Friday, August 2, 2019

Spark - Correlation

Correlation explains us the association between two or more variables .The movement of one variable will impact the movement of another variable .
It is normally used in situation where we want to explore how two variables are related with each other.

Types of Correlation :-
Correlation can be classified in several ways but in most generic way it is divided into three ways.

(i) Positive and Negatives
(ii) Linear and Non Linear
(iii) Simple,partial and multiple

Positive and Negative
: - Correlation can be positive and negative .If both the variable are moving in the same direction , we can termed them as positive correlation else it can be termed as negative correlation.

Linear and Non Linear :- If the change in one variable is accompanied by change in another variable in a constant ratio, it is a case of linear correlation.On the other hand, if the amount of change in one variable does not follow a constant ratio with the change in another variable, it is a case of non-linear or curvilinear correlation.

Simple,partial and multiple :- If only two variables are involved in a study, then the correlation is said to be simple correlation.When three or more variables are involved in a study, then it is a problem of either partial or multiple correlation. In multiple correlation, three or more variables are studied simultaneously. But in partial correlation we consider only two variables influencing each other while the effect of other variable(s) is held constant.


Let us implement a simple python code :-



from pyspark import SparkContext,SparkConf
from pyspark.mllib.stat import Statistics
import numpy as np
conf=SparkConf().setAppName("test")
sc =SparkContext(conf=conf)

seriesX = sc.parallelize([1.0, 2.0, 3.0, 3.0, 5.0]) # a series
# seriesY must have the same number of partitions and cardinality as seriesX
seriesY = sc.parallelize([11.0, 22.0, 33.0, 33.0, 555.0])

# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method.
# If a method is not specified, Pearson's method will be used by default.
print("Correlation is: " + str(Statistics.corr(seriesX, seriesY, method="pearson")))

data = sc.parallelize(
[np.array([1.0, 10.0, 100.0]), np.array([2.0, 20.0, 200.0]), np.array([5.0, 33.0, 366.0])]) # an RDD of Vectors

# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
# If a method is not specified, Pearson's method will be used by default.
print(Statistics.corr(data, method="pearson"))



No comments:

Post a Comment

Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as wel...