Monday, October 29, 2018

Spark - Reading data from hive via pyspark

Hive and Pyspark are the two components lying on the top of the  Hadoop framework and will allow us to connect and fetch the data.we will start by creating a simple test table in the Hive.

1.) creating a table in Hive.

We will check the database using below command.

       show  databases;



This command will display all the databases available inside the Hive.We can choose any of the databases using the ‘use’ command.

2.) use  databasename;

In our case , the database name is spark_hive.so the command will be

       use spark_hive;



3.) Now we are in the required database.we can create our table.



4.) Once the table has been created in the Hive, we can fetch the data from the spark using the below code.




5.) We can run this code by running the spark-submit command.

              spark-submit filename.py

6.) Once the process will get executed , we will get the below output on our terminal.So, we can connect and fetch the data from hive via spark through this simple set of commands.



Sunday, October 28, 2018

Statistics - Skewness





What is a Symmetric data ?

A data set where the left and right hand sides of the distribution are roughly equal .In a histogram ,the tails of the left as well as the right part of the distribution are equally balanced.data Such kinds of data are often referred as symmetric data.

In below case if we calculate the Mean and Median both are approximately equal to 5.5



Mean =5.5
Median =5.5

If we see the tail on the left of the center value is almost equal to the value on the right side of the center.Skewness refers to the asymmetry of lack of symmetry in the frequency distribution.

However, a distribution which is asymmetrical is skewed .skewed can be both positive and negative.

Negative Skewness :- When a distribution is skewed to the left (red dashed curve), the tail on the curve's left-hand side is longer than the tail on the right-hand side, and the mean is less than the mode. This situation is also called negative skewness.

                                    Mean < Mode


Positive Skewness :- When a distribution is skewed to the right the tail on the curve's right-hand side is longer than the tail on the left-hand side, and the mean is greater than the mode. This situation is also called positive skewness.

                                    Mean > Mode




Tests of Skewness :-

1. The values of mean, median and mode do not coincide.
2. When the data are plotted on a graph they do not give the normal bell-
shaped form i.e. when cut along a vertical line through the centre the two
halves are not equal.
3.The sum of the positive deviations from the median is not equal to the sum
of the negative deviations.
4. Quartiles are not equidistant from the median.
5. Frequencies are not equally distributed at points of equal deviation from
the mode

Karl Pearson’s Measure :-

The formula for measuring skewness as given by Karl Pearson is as follows:
                     Skewness = Mean - Mode



Coefficient of Skewness = Mean – Mode / SD


Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as wel...