Data is Future: Spark - Streaming First Program

Wednesday, January 23, 2019

Spark - Streaming First Program

We will write our first code in spark streaming.We will see how data flows on a TCP socket.We will try to read this data and later count the number of words flowing through the TCP socket.

from pyspark import SparkContext

from pyspark.streaming import StreamingContext

The following piece of code will import the required library .The Streaming context library will import all the required functionality that is required for the streaming data.

sc = SparkContext("local[2]","test")

ssc = StreamingContext(sc,1)

We are doing the set up in our local setup so we will create 2 execution threads and will provide a name “test” to find out this process on cluster. ‘1’ here denote the 1 second interval between the batches.

lines = ssc.socketTextStream("localhost",9999)

This line will take the streaming data from the localhost TCP socket via port no 9999.

words = lines.flatMap(lambda line:line.split(" "))

pairs = words.map(lambda word : (word,1))

wordcounts = pairs.reduceByKey(lambda a,b : a + b)

This part is used by us to read the words and further count it.We have already done such kind of activities in our earlier blogs.

wordcounts.pprint()

We can print the data into the console via pprint().

ssc.start()

ssc.awaitTermination()

The ssc.start() will start the computation and will wait until the process will not get terminated which is done by the command ssc.awaitTermination().

We need to open another terminal and have to run the below command.

Nc -lk 9999

after this we can start the typing of the words which we need to count.

Note :- There will be a lot of spark logging will be available in between the words and their count.

Complete Code Snippet :-