Friday, March 2, 2018

RDD Creation using Pyspark

In the last post ,we have understood how RDD works in Apache spark.Now we focus on how to create RDD in Apache Spark.

So ,Let us start our discussion on creating a RDD with a simple text file.

so , our first code is below:-.

from pyspark import SparkContext,SparkConf
conf =SparkConf().setAppName("Count").setMaster("local")
sc =SparkContext(conf=conf)
rdd_create = sc.textFile('test.txt')
rdd_first = rdd_create.first()
print(rdd_first)


Let us decode this piece of code

Line 1 : We need to import the sparkcontext and sparkconfiguration packages.

Line 2:  The application name is count and it can be given any name .it is just need to find the application program on the cluster .The  application is set here local as we are running this piece of code on our local machine.

Line 3 : The SparkContext has been assigned to the variable sc .

Line 4: We read the file test.txt  and in simple words it is a pointer to the file (We need to keep in mind the Lazy evaluation technique of Spark)

Line 5: Here ,we read the first line of the text file and store it in the variable rdd_count

Line 6: The output has been displayed here.

You can find the code in my github id :- https://github.com/sangam92/Spark_tutorials



No comments:

Post a Comment

Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as wel...