Saturday, January 26, 2019

Spark Streaming - DStreams

We are in such kind of world where the data is flowing at a very high paced rate. The competition has become so fierce that companies have no time, they want the data in real time.The internet of things has changed the scenario dramatically.

The online advertising market, real-time data traffic analysis, stock analysis, parking analysis, all this are germinating on the real-time data.

We have already learned about the capabilities of the spark batch processing. Here, we will try to learn the streaming capabilities of Spark.

Spark receives the data from multiple sources like Flume, Kafka,Kinesis,TCP/IP
socket. The spark receiver takes this data, converts this data into many mini-batches and send it to the spark core for further processing.


DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset.DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from the source or the processed data stream generated by transforming the input stream.

Every Dstream contains the data of a certain interval. If we apply any operation on a DStream, it applies to all the underlying RDDs. DStream covers all the details. It provides the developer with a high-level API for convenience. As a result, Spark DStream facilitates working with streaming data.

Wednesday, January 23, 2019

Spark - Streaming First Program


We will write our first code in spark streaming.We will see how data flows on a TCP socket.We will try to read this data and later count the number of words flowing through the TCP socket.

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

The following piece of code will import the required library .The Streaming context library will import all the required functionality that is required for the streaming data.

sc = SparkContext("local[2]","test")
ssc = StreamingContext(sc,1)

We are doing the set up in our local setup so we will create 2 execution threads and will provide a name “test” to find out this process on cluster. ‘1’ here denote the 1 second interval between the batches.

lines = ssc.socketTextStream("localhost",9999)

This line will take the streaming data from the localhost TCP socket via port no 9999.

words = lines.flatMap(lambda line:line.split(" "))
pairs = words.map(lambda word : (word,1))
wordcounts = pairs.reduceByKey(lambda a,b : a + b)

This part is used by us to read the words and further count it.We have already done such kind of activities in our earlier blogs.

wordcounts.pprint()

We can print the data into the console via pprint().

ssc.start()
ssc.awaitTermination()

The ssc.start() will start the computation and will wait until the process will not get terminated which is done by the command ssc.awaitTermination().

We need to open another terminal and have to run the below command.

Nc -lk 9999

after this we can start the typing of the words which we need to count.

Note :- There will be a lot of spark logging will be available in between the words and their count.

Complete Code Snippet :-

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

sc = SparkContext("local[2]","test")
ssc = StreamingContext(sc,1)

lines = ssc.socketTextStream("localhost",9999)


words = lines.flatMap(lambda line:line.split(" "))
pairs = words.map(lambda word : (word,1))
wordcounts = pairs.reduceByKey(lambda a,b : a + b)

wordcounts.pprint()

ssc.start()
ssc.awaitTermination()

you can also find the code in my Github :- https://github.com/sangam92/Spark_tutorials


Sunday, January 20, 2019

HDFS - Command 1


HDFS is used to store the files of different size and will be very useful for the data set which are very big in size.

We will have a look into the some of the basic commands that can be used in HDFS.

Directory Creation (mkdir):- The directory in HDFS can be created with the help of 'mkdir' command.

Syntax :- hdfs fs -mkdir dirname

we can also create the directory by giving the path .

Syntax :- hdfs fs -mkdir pathname/dirname


Listing directory and files( ls ) :- The files and directory can be viewed via 'ls' command in the HDFS.

Syntax :- hdfs fs -ls

We can list the files and directory by giving the specified path.

Syntax :- hdfs fs -ls pathname


Displaying the file (cat ) :- 'cat' command is used to display the contents of the file.

We will create a file in our local filesystem.

Syntax :- cat > file,txt


Now we will display the contents of the file after sending the file in HDFS using the command cat.

Syntax :- hdfs fs -cat file.txt


Sending the file via put command :- We can send the file from local filesystem to the hdfs via 'put' command.

Note :- put command can be used to transfer file from one location to another.

Syntax :- hdfs fs -put path1 path2

If we want to send the same file to the hdfs , we will get an error “File exists” .


To avoid this error , we can send the file using the option -f .

Syntax :- hdfs fs -put -f path1 path2

Copying File in HDFS :- We can copy the file from one location to another via 'cp' command .

Syntax :- hdfs fs -cp path1 path2


Moving file in HDFS :- We can transfer the file from one location to another via 'mv' command .

Syntax :- hdfs fs -mv path1 path2



Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as wel...