Data is Future: Apache Spark Introduction

Monday, February 26, 2018

Apache Spark Introduction

Spark started in 2009 as a research project in the UC Berkeley RAD Lab, later to become the AMPLab. The researchers in the lab had previously been working on Hadoop Map‐Reduce, and observed that MapReduce was inefficient for iterative and interactive computing jobs. Thus, from the beginning, Spark was designed to be fast for interactive queries and iterative algorithms, bringing in ideas like support for in-memory storage and efficient fault recovery.

Apache Spark is a cluster computing platform designed to be fast and general purpose.The main purpose of the Apache spark was to handle the map reduce efficiently in the speed side.
To handle the speed ,spark has in memory calculation but it can handle the complex problem better than map reduce in the disk.
Spark is highly accessible means it can offer a wide range of API's for scala, python ,sql ,java etc.It has a wide variety of libraries also.
It can be integrated with most of the Big data tool.

Spark Core :- Spark Core can handle many basic functionality like task scheduling,memory management,fault tolerance,storage system.it is the home for the most sought API in spark (RDD).
SparkSQL:- Spark supports SQL as well as HQL(Hive Query Language).It can support many source of data including HIVE tables, JSON ,paraquet .Moreover,Spark SQL gives a blending of RDD with the SQL
MLIB:-It provides complete machine learning flavour with all the required algorithms like classification,regression ,clustering ,collaborative filtering.
GraphX:- GraphX extends the Spark RDD API, allowing us to create a directed graph with arbitrary
properties attached to each vertex and edge. GraphX also provides various operators
for manipulating graphs.

It’s important to remember that Spark does not require Hadoop; it simply has support for storage systems implementing the Hadoop APIs. Spark supports text files, SequenceFiles, Avro, Parquet, and any other Hadoop InputFormat.

Data is Future

Monday, February 26, 2018

Apache Spark Introduction

No comments:

Post a Comment

Delta Lake - Time Travel