Data is Future: April 2018

Saturday, April 28, 2018

Variance,Standard Deviation,Coefficient of Variation and Outliers

Variance and Standard deviation are one of the most common measures of dispersion of data.Both the method can be used to define how much the data vary from the their average value or imean.
Variance is the average of the squared deviation from the mean while the standard deviation is the square root of the variance.

Calculation of Variance and Standard Deviation:-

1.) Take the mean of all the data that we have .Suppose that we have a small dataset with the values 1,2,3,4,5.
Then the mean of this dataset is 1+2+3+4+5/5 =3.
2.)Subtract each dataset with the mean and then square it.But why we will square it ?
we will find out this in the example.

By Applying the above formula ,what we get is 0.

To Overcome this problem , we work with squared deviation and divide their sum by n.
Hence , the formula for variance is :-

and the calculation for our dataset will be :-

However the above calculation is for the population not for the sample.
The Formula for Sample will be :-

And the calculation for our dataset will be like :-

3) Take the square root of the variance .

So ,the calculation of our above dataset will be :-

The variance for the sample dataset will be :-

From , the above calculation of the variance and standard deviation we can conclude the below points.

1.)The variance can be zero or grater than zero.It will be zero only in the case when all the values in the dataset are same.
2.)The variance of sample will be always greater than the population.
3.)For variance , we are calculating in the squared units but all other measurement are in the normal unit .To overcome it , we take the square root of the variance which is called standard deviation.

What does standard deviation signifies ?

1.) The higher the value of the standard deviation , more the variability in the data.
2.)The low standard deviation means that the data is closely related.

Coefficient of Variation :-

The variance and standard deviation that are calculated on the same data set but with the different units can have different variance and standard deviation.To overcome this problem the concept of coefficient of variation came into the picture.
Example :- weights calculated in ounces and pounds can have different variance and standard deviation for the same sample.

where ,
s = standard deviation
x-bar = mean of the sample.
For our data set , the coefficient of variation

Python Implementation of the above data set with numpy library:-

import numpy as np
a=[1,2,3,4,5]
print('The variance will be',np.var(a))
print('The variance will be',np.std(a))import numpy as np
a=[1,2,3,4,5]
print(np.std(a))
#output :-
The variance will be 2.0
The variance will be 1.41421356237

Note :- The calculation of standard deviation in single precision may be inaccurate.

Outliers:- The data point which is different from the normal sample dataset is called outliers.It is normally considered as the data point that come from different population or samples.

Why outliers detection is important :-

1.) It will distort the calculation of the normal statistics like mean.
2.)There can be chances that the data point is erroneous.
3.)The data point may be coming from the different samples or population.

How to handle the outliers ?

1.)Trimmed Mean.
2.)Interquartile Range
3.)Deletion of the outliers (subject to the statistician).

Further Reading:-https://docs.scipy.org/doc/numpy/reference/generated/numpy.std.html

Spark - Normal Join

RDD supports different kinds of join.We will see one by one each of them.

Normal Join :- It outputs all the data from both the RDD based upon the common key present in them.

Example :- Suppose RDD 'GOALS' has the list of football player with their respective number of goals and the RDD 'MATCH' contains the number of matches played by them.

RDD GOALS

Player Name             Goals Scored
Messi         71
Ronaldo        77
Pele                               59
Zidane            42
Drogba                           27

RDD MATCH

Player Name             Matches Played
Messi            163
Ronaldo                         171
Pele                              142
Zidane                             91
Roonie                           183

.
The joining happens on the basis of the Key ,in case the key is available in both the RDD then the resulting output will have their corresponding values.

Output :-

Messi             71        163
Ronaldo          77        171
Pele               59        142
Zidane            42          91

Python Code Snippet :-

Sunday, April 22, 2018

HBase - An Introduction

With the advent of the Big data , we soon realized the limitation of Relation databases in handling the data .The need of the hour was to find a solution that can overcome the limitation of the RDBMS.
Hadoop has the capability to store and process huge amount of data and can become a good alternative for these RDBMS system.

HDFS cannot do the random reads very well .We need to scan the entire data set even for a small record set.

HBASE is a column oriented database that sits on the top of the Hadoop distributed file system.

It has a low a latency and can be used for the frequent lookups in the database.
The idea for HBase was inspired from the Google's Big table and it is written in Java.
Insert ,Update and Delete like transaction capabilities are well supported by the HBase .
Hbase also supports AVRO , Thrift and REST services.
It is good for a large amount of data and is highly scalable.
It can be good for real time need also

HBase stores the data in a key,value pair and scan out a single records within billion of records in an efficient way.

Data is indexed as row key and the row key are automatically in a sorted manner.

Companies such as Google, Yahoo, Twitter uses Hbase.

Wednesday, April 18, 2018

HIVE -Bucket

Though Partitioning improves the query performance to a large extent .but there are scenarios where partitioning does not serve the purpose completely.

Let us take a real time example.Suppose our user requires the data based upon the type of transaction like(electronic , paper ).we have partitioned the table on the basis of Type_of_Transaction column .In such scenarios ,the majority of data is present for the transaction type =electronic.Let us take 70 percent of data belongs to electronic transaction.

Again ,we will face the same issue.
The main idea behind bucketing is to handle such kind of scenarios.
Bucketing divides our partitioned files into further more sub directories and improves the query performance to a greater extent.

The number of bucket is determined by the HASH algorithm working behind it.

Cluster By:

CREATE TABLE students (name VARCHAR(64), age INT, CGPA DECIMAL(3, 2))
CLUSTERED BY (age) INTO 2 BUCKETS
STORED AS ORC
TBLPROPERTIES("transactional"="true");

we will load some sample data into these table and try to see how bucketing impact at the directory level.
we can check the /hive/user/warehouse/hivetest.db/students directory to see the creation of sub directories.

we can see the creation of two buckets.

Before implementing the bucketing property ,we need to check whether below properties are set or not:-
set hive.support.concurrency =true;
set hive.enforce.bucketing=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.compactor.initiator.on=true;
set hive.compactor.worker.threads=2;

Saturday, April 14, 2018

HIVE -Serde(CSV)

HIVE has an immense capability of processing the data and it can be achieved by the help of the package SERDE which means Serialization and Deserialization.

To read the semi structured data like JSON,XML , HIVE need to understand how to process such kind of format.To solve this ,SERDE came into the picture.

SERDE performs two function mainly :-

1.)Reading data from Table.
2.)Writing the data back to the HDFS.

DESERIALIZER takes the binary or string representation of the record and converts into the Java object that can be manipulated by the HIVE.
SERIALIZER takes the Java object and convert it back into the such a format that can be written into the HDFS.

SERDE can be downloaded from the hadoop distribution vendor like (cloudera or Hortonworks)
The JAR file need to be placed into the $HIVE_HOME/lib.The required SERDE need to be register into the HIVE .
Let us take an example to understand the SERDE in a more efficient way.
we have a csv file serdefile.csv having the below data.

we need to put this file into hdfs using the put command.
hadoop fs -put <source> <destination>
we need to create the external table to read the

we can now do the basic select query from the above table.

Friday, April 13, 2018

HIVE - View

VIEW in HIVE is similar to what we have in the RDBMS system.The concept of VIEW has been incorporated from the RDBMS .It allows us to store the Query and treat it like as Table.It does not store the data.

Some of the basic functions of a VIEW are :-

1.) To reduce query complexity.
2.)To reduce the data on the basis of some condition.
3.)Query a view is like querying a table.
4.)If a query references a table, then the definition of the view is combined with the rest of the query by Hive's query planner.
5.)Query using the view can fail, if the referenced table/column does not exist.

CREATING A VIEW

To create any view , we need an underlying table on which we have to create our query.
Let us take a simple example with employees table and we will align our query on this table.
we have the below data:-

we want to have only those id which are having salary more than 50000.

Syntax :-
create view view name as select column name from referenced_table;

The select * from viewemp will give us the below result.

The view is available in the database just like the normal table , we will get the view name if we give the show tables;

we should note that we can create index on the view and can join two views also .

Dropping a View :-

drop view viewname;

Monday, April 9, 2018

HIVE - Partitioned Tables

Partition in Hive is mainly done to balance the load horizontally.Normally ,when we fire a query it will check for all the data and produce the result.This process is quite cumbersome and involve a lot of map reduce job.The job usually take too long to execute.
In order to reduce such a huge effort , we need to partition the table based on some column.

Below is the query to create a table with partition on the date column :

The syntax for loading the data into the partition table is bit different from the normal table which we used in our earlier tutorials.

Once the table is created and we loaded the data , we will find the sub directory created in the user/hive/warehouse location.

Each sub directory corresponds to a particular partition.

If we fire a query with like select * from hivepart where year=2017.
The following query search for the data in the only sub directory that belongs to the year=2017 This will reduce a lot of map reduce operation and can be effective way of improving the performance.

STRICT MODE:-
By Default ,HIVE runs in a STRICT mode .The STRICT mode does not allow to fire a query on a partition table witjout where clause on partition table.

We can change the mode by giving "nonstrict".

In case if we fire a query then we get the semantic error .

How to check the partition in the table:

Dynamic Partitioning :- The Dynamic partitioning is used to eliminates the process of hard coding the partition value in a table.
But for the data to be loaded , we need a statging table.we cannot load the data from the directory .
First we will create a staging table.

create table hivestag1(col1 int, year string)
row format delimited
fields terminated by ','
lines terminated by '\n';

Then , we need to create the final table.

create table hivefin(col1 int)
partitioned by (year string)
row format delimited
fields terminated by ','
lines terminated by '\n';
Finally , we need to load the data from the staging table into the final table.

insert overwrite table hivefin
partition(year)
select * from hivestag;
Partition for the Hivefin Table:

We need to check the below property :-

set hive.mapred.mode=nonstrict;
hive.exec.max.dynamic.partitions;
hive.exec.max.dynamic.partitions.pernode;

Sunday, April 1, 2018

HIVE-Managed Table Vs External Table

Managed table and External table is one of the main concept and need to be understand carefully.Normally, the table which we will create is the managed tables.
So, Let us understand what exactly the Managed Table and External table internally behave.

Managed Tables :- In managed table , the data is moved from the it's original directory to the location of the HIVE meta store (/user/hive/warehouse/). The location can be default or user can provide it.The required directory and sub directory is also created.
In case , we drop the table the data as well as the metadata will get removed from the meta store.
HIVE controls the complete life cycle of the table.

External Tables:- In external table , the data does not move from the HDFS to the table storage location.It means HIVE does not owns the data.
Metadata for this table get updated in the meta store.In case , the table is dropped the data does not get deleted only the metadata get deleted.
During the creation of the table , we need to provide the EXTERNAL keyword for the creation of the table.

When to use External Tables:- There are some scenario when we need to use EXTERNAL table.For our learning purpose, the managed table is fine.But when we are working on a production cluster . it is recommended that we should go for for the External Tables.
1.) When we are using some other tool to excavate the same piece of data .
2.) When we have multiple views or table on the same data set.
3.) To query external dataset present in the external system like Amazon S3.

Syntax for creating an external table.

Syntax for creating a normal table.