Saturday, January 26, 2019

Spark Streaming - DStreams

We are in such kind of world where the data is flowing at a very high paced rate. The competition has become so fierce that companies have no time, they want the data in real time.The internet of things has changed the scenario dramatically.

The online advertising market, real-time data traffic analysis, stock analysis, parking analysis, all this are germinating on the real-time data.

We have already learned about the capabilities of the spark batch processing. Here, we will try to learn the streaming capabilities of Spark.

Spark receives the data from multiple sources like Flume, Kafka,Kinesis,TCP/IP
socket. The spark receiver takes this data, converts this data into many mini-batches and send it to the spark core for further processing.


DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset.DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from the source or the processed data stream generated by transforming the input stream.

Every Dstream contains the data of a certain interval. If we apply any operation on a DStream, it applies to all the underlying RDDs. DStream covers all the details. It provides the developer with a high-level API for convenience. As a result, Spark DStream facilitates working with streaming data.

No comments:

Post a Comment

Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as wel...