Friday, March 2, 2018

RDD Creation using Pyspark

In the last post ,we have understood how RDD works in Apache spark.Now we focus on how to create RDD in Apache Spark.

So ,Let us start our discussion on creating a RDD with a simple text file.

so , our first code is below:-.

from pyspark import SparkContext,SparkConf
conf =SparkConf().setAppName("Count").setMaster("local")
sc =SparkContext(conf=conf)
rdd_create = sc.textFile('test.txt')
rdd_first = rdd_create.first()
print(rdd_first)


Let us decode this piece of code

Line 1 : We need to import the sparkcontext and sparkconfiguration packages.

Line 2:  The application name is count and it can be given any name .it is just need to find the application program on the cluster .The  application is set here local as we are running this piece of code on our local machine.

Line 3 : The SparkContext has been assigned to the variable sc .

Line 4: We read the file test.txt  and in simple words it is a pointer to the file (We need to keep in mind the Lazy evaluation technique of Spark)

Line 5: Here ,we read the first line of the text file and store it in the variable rdd_count

Line 6: The output has been displayed here.

You can find the code in my github id :- https://github.com/sangam92/Spark_tutorials



No comments:

Post a Comment

Delta Lake - Time Travel

  Time Travel allows you to query, restore, or compare data from a previous version of a Delta table. Delta Lake automatically keeps tra...