We
will write our first code in spark streaming.We
will see how data flows
on a TCP socket.We will
try to read this data and later count the number of words flowing
through the TCP socket.
from
pyspark import SparkContext
from
pyspark.streaming import StreamingContext
The
following piece of code will import the required library .The
Streaming context library will import all the required functionality
that is required for the streaming data.
sc
= SparkContext("local[2]","test")
ssc
= StreamingContext(sc,1)
We
are doing the set up in our local setup so we will create 2
execution threads and will provide a name “test” to find out this
process on cluster. ‘1’
here denote the 1 second interval between the batches.
lines
= ssc.socketTextStream("localhost",9999)
This
line will take the streaming data from the localhost TCP socket via
port no 9999.
words
= lines.flatMap(lambda line:line.split(" "))
pairs
= words.map(lambda word : (word,1))
wordcounts
= pairs.reduceByKey(lambda a,b : a + b)
This
part is used by us to read the words and further count it.We have
already done such kind of activities
in our earlier blogs.
wordcounts.pprint()
We
can print the data into the console via pprint().
ssc.start()
ssc.awaitTermination()
The
ssc.start()
will start the computation and will wait until the process will not
get terminated which is done by the command ssc.awaitTermination().
We
need to open another terminal and have to run the below command.
Nc
-lk 9999
after
this we can start the typing of the words which we need to count.
Note
:- There will be a lot of spark logging will be available in between
the words and their count.
Complete
Code Snippet :-
from pyspark import
SparkContext
from pyspark.streaming import
StreamingContext
sc =
SparkContext("local[2]","test")
ssc = StreamingContext(sc,1)
lines =
ssc.socketTextStream("localhost",9999)
words = lines.flatMap(lambda
line:line.split(" "))
pairs = words.map(lambda word
: (word,1))
wordcounts =
pairs.reduceByKey(lambda a,b : a + b)
wordcounts.pprint()
ssc.start()
ssc.awaitTermination()
you
can also find the code in my Github :-
https://github.com/sangam92/Spark_tutorials
No comments:
Post a Comment