Wednesday, February 28, 2018

RDD In Apache Spark

RDD is considered as the core and heart of the Spark.For any program to run in the Spark, the RDD is there inside it .It can be created , transformed or called in any operation .Spark automatically distributes the data set within RDD across the cluster.

RDD is an immutable distributed collection of object  and is fault tolerant , can be operated in parallel.
RDD can be created in two different ways :-

1.) Parallelizing existing object like list or set in the driver program.
2.) Referencing an external dataset like HDFS, textfile,csv fille etc.

Once the RDD is created, two important processes can take place :-

1.) Transformation :- Generating one RDD from another RDD , can be commonly used in the    filtering process.
2.) Action :- Compute a result on the basis of a RDD like count, first.

Lazy Evaluation in Spark :-

Spark works on the concept of Lazy evaluation .It means that the RDD cannot be created until spark finds an action.Initially it seems weird but in handling Big data ,this concept is quite useful. All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base data set (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

Tuesday, February 27, 2018

Standalone Application in Spark

Spark can be run interactively as well as in the standalone program .The major difference  between the interactive shell and standalone application is that we need to define SparkContext in the case of standalone but in the interactive shell it is available through sc variable.
We will learn the spark through python implementation.In Python, you simply write applications as Python scripts, but you must run them using the bin/spark-submit script included in Spark. The spark-submit script includes the Spark dependencies for us in Python. This script sets up the environment for Spark’s Python API to function.

Intialization in Standalone Program;-

from pyspark import SparkContext,SparkConf
conf =SparkConf().setAppName("Count").setMaster("local")
sc =SparkContext(conf=conf)

The Line 1 will import all the Spark API for the Python.
Then in Line 2 , we are giving a name to identify this program on the cluster as count and which tells Spark how to connect to a cluster. local is a special value that runs Spark on one thread on the local machine, without connecting to a cluster.
In Line we initialize the SparkContext with sc Variable.

Apache Spark Core

In spark , the architecture is mainly divided between the Driver and Executor node.The Driver node is the part of the program where the main program get executed.The Driver take the main program and distribute the data sets into the worker nodes and also the operation that the worker nodes are suppose to do.
In layman words, the driver is the manager and executor are the developer .Driver distributes the resources and tasks to be performed by each developer.
The driver program access spark through sparkcontext  object which is connected to a computing cluster.


In spark shell ,you can connect to  the sparkcontext via sc variable .

If we are running our program on our local machine ,then it will run on a single cluster .But when we run the same program on cluster , different part of the program is run on different cluster.

Monday, February 26, 2018

Apache Spark Introduction

Spark started in 2009 as a research project in the UC Berkeley RAD Lab, later to become the AMPLab. The researchers in the lab had previously been working on Hadoop Map‐Reduce, and observed that MapReduce was inefficient for iterative and interactive computing jobs. Thus, from the beginning, Spark was designed to be fast for interactive queries and iterative algorithms, bringing in ideas like support for in-memory storage and efficient fault recovery.

Apache Spark is a cluster computing platform designed to be fast and general purpose.The main purpose of the Apache spark was to handle the map reduce efficiently in the speed side.
To handle the speed ,spark has in memory calculation but it can handle the complex problem better than map reduce in the disk.
Spark is highly accessible means it can offer a wide range of API's for scala, python ,sql ,java etc.It has a wide variety of libraries also.
It can be integrated with most of the Big  data tool.
 


Spark Core :- Spark Core  can handle many basic functionality like task scheduling,memory management,fault tolerance,storage system.it is the home for the most sought API in spark (RDD).
SparkSQL:- Spark supports SQL  as well as HQL(Hive Query Language).It can support many source of data including HIVE tables, JSON ,paraquet .Moreover,Spark SQL gives a blending of RDD with the SQL
MLIB:-It provides complete machine learning flavour with all the required algorithms like classification,regression ,clustering ,collaborative filtering.
GraphX:- GraphX extends the Spark RDD API, allowing us to create a directed graph with arbitrary
properties attached to each vertex and edge. GraphX also provides various operators
for manipulating graphs.

It’s important to remember that Spark does not require Hadoop; it simply has support for storage systems implementing the Hadoop APIs. Spark supports text files, SequenceFiles, Avro, Parquet, and any other Hadoop InputFormat.

Saturday, February 24, 2018

Normal Distribution

It is one of the most common form of probability distribution and can take any values within a range.
The normal distribution is a type of continuous probability distribution.One of the most common graph bell curve occurs in this one.

The two parameters that characterized the Normal Distribution are :- 


1.) Mean
2.) Variance

The normal distribution can take any values from -infinity to + infinity.There is infinite number of normal distribution ,varying upon their mean and variance.

Standard Distribution / Z-distribution :- The normal distribution with mean =0 and standard deviation =1 is called standard distribution /Z distribution.

Characteristics of Normal Distribution :-
1.)Symmetry
2.)A single most common value (uni modality)
3.)Range from -infinity to + infinity
4.)Area under curve is 1
5.)A common value for the mean,median and mode.

Examples to illustrate it :-
 
A normal distribution is perfectly symmetrical around its center. That is, the right side of the center is a mirror image of the left side. There is also only one mode, or peak, in a normal distribution. Normal distributions are continuous and have tails that are asymptotic, which means that they approach but never touch the x-axis. The center of a normal distribution is located at its peak, and 50% of the data lies above the mean, while 50% lies below. It follows that the mean, median, and mode are all equal in a normal distribution.

Some of the common uses of normal distribution :-

1.) Height
2.)IQ
3.)Blood pressure
4.)Salaries etc.

The empirical rule tells you what percentage of your data falls within a certain number of standard deviations from the mean:

• 68% of the data falls within one standard deviation of the mean.
• 95% of the data falls within two standard deviations of the mean.
• 99.7% of the data falls within three standard deviations of the mean.


Z-scores/Normalized Scores:-

The formula for Z-score is :

Let us take an example to understand the concept of Z-scores .

A wild pack of Chihuahuas terrorizing the countryside has a mean height of 7.5 inches, with a standard deviation of 1.5 inches. We feel sorry for the person who had to measure that. What proportion of these Chihuahuas are between 6 and 9 inches tall?
When we want to know something about probabilities or proportions of normal distributions, we need to work with Z-scores. We use them to convert a value into the number of standard deviations it is from the mean. The formula is:
 

μ is another fancy code name for the mean of the normal distribution, while σ is its standard deviation. We can find the Z-scores for 6 and 9 inches now.

 
How much of the normal distribution falls within 1 standard deviation above or below the mean? According to the Empirical Rule, that's 68% of the distribution.







Further Reading :-https://en.wikipedia.org/wiki/Normal_distribution
                             http://www.statisticshowto.com/probability-and-statistics/normal-distributions/


Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as wel...