Wednesday, January 23, 2019

Spark - Streaming First Program


We will write our first code in spark streaming.We will see how data flows on a TCP socket.We will try to read this data and later count the number of words flowing through the TCP socket.

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

The following piece of code will import the required library .The Streaming context library will import all the required functionality that is required for the streaming data.

sc = SparkContext("local[2]","test")
ssc = StreamingContext(sc,1)

We are doing the set up in our local setup so we will create 2 execution threads and will provide a name “test” to find out this process on cluster. ‘1’ here denote the 1 second interval between the batches.

lines = ssc.socketTextStream("localhost",9999)

This line will take the streaming data from the localhost TCP socket via port no 9999.

words = lines.flatMap(lambda line:line.split(" "))
pairs = words.map(lambda word : (word,1))
wordcounts = pairs.reduceByKey(lambda a,b : a + b)

This part is used by us to read the words and further count it.We have already done such kind of activities in our earlier blogs.

wordcounts.pprint()

We can print the data into the console via pprint().

ssc.start()
ssc.awaitTermination()

The ssc.start() will start the computation and will wait until the process will not get terminated which is done by the command ssc.awaitTermination().

We need to open another terminal and have to run the below command.

Nc -lk 9999

after this we can start the typing of the words which we need to count.

Note :- There will be a lot of spark logging will be available in between the words and their count.

Complete Code Snippet :-

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

sc = SparkContext("local[2]","test")
ssc = StreamingContext(sc,1)

lines = ssc.socketTextStream("localhost",9999)


words = lines.flatMap(lambda line:line.split(" "))
pairs = words.map(lambda word : (word,1))
wordcounts = pairs.reduceByKey(lambda a,b : a + b)

wordcounts.pprint()

ssc.start()
ssc.awaitTermination()

you can also find the code in my Github :- https://github.com/sangam92/Spark_tutorials


No comments:

Post a Comment

Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as wel...