Tuesday, October 9, 2018

Flume - Introduction

Flume is a highly reliable tool for aggregating and transporting large amount of streaming data such as log files,events from various sources to centralised data source.

What is streaming data ?

As per AWS , Streaming Data is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes). Streaming data includes a wide variety of data such as log files generated by customers using your mobile or web applications, e commerce purchases, in-game player activity, information from social networks, financial trading floors, or geo spatial services, and telemetry from connected devices or instrumentation in data centers.

Basic Architecture Of Flume :-
 
The data from data sources like Twitter,Facebook and Web server are passed through flume and stored in a centralized data store like HDFS, HBASE.

But a question arises why PUT and HDFS -CopyFromLocal not efficient ?
⦁    'PUT' command can transfer one data at a time .However , the log data are generated at a much higher rate than the expected.
⦁    For PUT command the data need to be packaged and should be ready for the upload.but with the case of web server logs it is not possible at all.

Apache Flume is a tool for data ingestion in HDFS. It collects, aggregates and transports large amount of streaming data such as log files, events from various sources like network traffic, social media, email messages etc. to HDFS. Flume is a highly reliable & distributed.
The main idea behind the Flume’s design is to capture streaming data from various web servers to HDFS. It has simple and flexible architecture based on streaming data flows. It is fault-tolerant and provides reliability mechanism for Fault tolerance & failure recovery.

No comments:

Post a Comment

Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as wel...