Data is Future: August 2019

Sunday, August 25, 2019

Python - Command Line Arguments

Sys.argv deals with the no of arguments that will be passed through the command line arguments.The command line arguments represent all the arguments coming through the command line.The no of command line arguments start with 0.

We will write a sample code check the usefulness of this command.

We need to import the sys module for it.

Let us run the program and check the output.

Output :-

The first print statement gives us the name of the program as we are giving the argument as 0 (sys.argv[0]).

The second print statement will print the total number of arguments passed in the above program.

We should note the below points regarding command line arguments :-

The sys.argv takes the command line arguments in the form of a list.

The first element in the list is the name of the file.

The arguments always come in the form of a string even if we type an integer in the argument list. We need to use int() function to convert the string to integer.

Tuesday, August 6, 2019

Spark - Basic Statistics

We have already gone through the tutorial on Measure of Central Tendency.Now we will do it’s implementation in Pyspark.We need to import statistics module from pyspark.mlib.stat.

Once the spark job is submitted , we will get the below output as the result.

The below code is available in my Github library :- https://github.com/sangam92/Spark_tutorials

Friday, August 2, 2019

Spark - Correlation

Correlation explains us the association between two or more variables .The movement of one variable will impact the movement of another variable .
It is normally used in situation where we want to explore how two variables are related with each other.

Types of Correlation :-
Correlation can be classified in several ways but in most generic way it is divided into three ways.

(i) Positive and Negatives
(ii) Linear and Non Linear
(iii) Simple,partial and multiple

Positive and Negative : - Correlation can be positive and negative .If both the variable are moving in the same direction , we can termed them as positive correlation else it can be termed as negative correlation.

Linear and Non Linear :- If the change in one variable is accompanied by change in another variable in a constant ratio, it is a case of linear correlation.On the other hand, if the amount of change in one variable does not follow a constant ratio with the change in another variable, it is a case of non-linear or curvilinear correlation.

Simple,partial and multiple :- If only two variables are involved in a study, then the correlation is said to be simple correlation.When three or more variables are involved in a study, then it is a problem of either partial or multiple correlation. In multiple correlation, three or more variables are studied simultaneously. But in partial correlation we consider only two variables influencing each other while the effect of other variable(s) is held constant.

Let us implement a simple python code :-

from pyspark import SparkContext,SparkConf

from pyspark.mllib.stat import Statistics

import numpy as np

conf=SparkConf().setAppName("test")

sc =SparkContext(conf=conf)

seriesX = sc.parallelize([1.0, 2.0, 3.0, 3.0, 5.0]) # a series

# seriesY must have the same number of partitions and cardinality as seriesX

seriesY = sc.parallelize([11.0, 22.0, 33.0, 33.0, 555.0])

# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method.

# If a method is not specified, Pearson's method will be used by default.

print("Correlation is: " + str(Statistics.corr(seriesX, seriesY, method="pearson")))

data = sc.parallelize(

[np.array([1.0, 10.0, 100.0]), np.array([2.0, 20.0, 200.0]), np.array([5.0, 33.0, 366.0])]) # an RDD of Vectors

# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.

# If a method is not specified, Pearson's method will be used by default.

print(Statistics.corr(data, method="pearson"))

Thursday, August 1, 2019

Neural Networks - An Introduction

Neural Networks in computer system are quite analogues to what the neurons we have in our brain. Our brain that is the collection of millions of neurons can be considered as the most advanced system. The decision taken by brain are impeccable and are accompanied with a lot of decision making that involved a lot of permutation and combination.

Scientist were working day and night to produce an AI enabled system that can solve majority of problem. Few AI system were developed also with the capability to solve the problem in a formal way but do not have a broader prospect. The IBM Deep Blue was one of such invention that defeated World Chess Champion Garry Kasprov.

Initially the AI system based their knowledge on some set of rules and regulation. Such kind of system were not so robust and had certain limitation. The human need to feed every possible situation which a cumbersome job was. Such approach was called Knowledge Base approach. The scientist devised a new methodology which does not require any specific rules and work according to the raw data fed into the system.

The scientist works upon the machine learning approaches and have many successes, one of them was logistic regression which was able to predict the cesarean delivery by feeding the system with certain inputs which was called feature extraction. Similarly, Naïve Bayes was used to classify the email as Spam and Non-Spam. The feature extraction was quite a tough job and require lots of human effort and time, moreover it requires a top-level domain expertise.

The ground-breaking innovation came into the year 1960 when Frank Rosenblatt has discovered an artificial neuron called Perceptron. Later, the evolution take place in such a perceptron and lead the way for the Multilayer Perceptron (MLP).

We will cover about Perceptron and Multilayer Perceptron in our next blog post.