Data is Future: March 2018

Friday, March 30, 2018

HIVE -Loading CSV Files

To Load a csv file in a HIVE table , we need to make the HIVE engine how data is being ingested inside it.

Query to create the table:-

We will load the data from the local filesystem into the hivecsv table.

The output will be something like this:-

HIVE - Loading Data Into Table

Once we have created the database and the Table .We are supposed to load some data inside that table. We will create a simple table and will load some data inside it.

CREATE TABLE EMPLOYEE (COL1 INT);

The table has been created with the name employee having a single column COL1.
We can check the details of the table with the help of the DESCRIBE command.

We can load data either from the local file system or from the Hadoop file system.
We will see both the example separately .First of all, we load the data from the local filesystem.

Data Loading From Local Filesystem

Load data local inpath 'sangam/test.txt' into table employee;

Similarly ,we can load the data from the HDFS via using the below command.

Data Loading From HDFS

Load data inpath 'hivetest/test.txt' into table employee;

We should note that when we are using the local filesystem , we need to use the local keyword .We can verify the data using the below command.

Since ,we have loaded the data twice once through the local filesystem and other through the HDFS .so we are getting the duplicates in the table.

Data Loading through another HIVE table
We can load from other table using the insert and select command as we do this in the SQL like databases;

Insert into employeenew select * from employee;

Wednesday, March 28, 2018

Hive - Database and Table Creation

The traditional database works on the fact that they control the complete data storage system.
It will check if the data that is being written follows the constraints , datatype ,lengths etc.
The above property is called SCHEMA ON WRITE

HIVE does not follow above all as it does not have it's own storage system and rely on the HDFS for it's data storage.
HIVE can read any data that is kept in the HDFS which is created , updated or sometimes the data got damaged also.
This property is called SCHEMA ON READ.

So, Let us start and check the different databases available in our cluster.

It will display all the databases available in the HIVE.

We should note that HIVE contains a default database if we do not specify the database name , the default database will get provoked.

Normally , when we are working with a large data set and a number of databases ,we forget that in which DB we are working .we can set a property to identify the current DB using the below command.

Creating a DB :- We can create our own db using the simple command .

Using the particular DB :-We can use the particular db using the use keyword followed by the database name.

Whenever we create a database in the HIVE ,a directory is created and the tables are stored in the sub directories.Exception is the default database.

The default location is the hive.metastore.warehouse.dir.
We can check the directory for the default db in the below location:-

We can switch the database directory at the time of database creation.

Dropping a database :- We can drop a database using "drop database database_name;"
However, it will throw ana error if it contains tables inside the database.
In order to override this property, we can use
Drop database database_name cascade;

Using the cascade keyword will drop all the tables , then the db and finally all the directory associated with it.

Tuesday, March 27, 2018

HIVE An Introduction

Hive was founded by Facebook in August 2007 and later made open source in 2008 .The main idea behind the creation of HIVE was to provide a SQL like flavor for the Hadoop.

The problem faced were that the Hadoop Map Reduce needs a lot of code for the simple programs and lack the expressability of the SQL.

HIVE has a SQL like dialect called HQL (Hive Query Language).It has made the solution very easy as anyone having the knowledge of SQL can easily work upon it.

Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.

However , HIVE is not a proper database as it lacks some basic properties of the database.

1.)The record level update is not possible in the HIVE.
2.)HIVE does not provide transactions
3.)Even small data set required a large latency .

HIVE consist mainly of three parts:-

1.) It contains of multiple JAR files and each having some different functionality.They are normally available in the $HIVE_HOME/lib directory.
2.)The second part has executable scripts present in the $HIVE_HOME/bin directory.CLI (Command Line Interface) is invoked with the help of this scripts.
3.)HIVE has also a thrift services to access it's services via ODBC / JDBC driver .It is normally used by the reporting solution like Tableau and Qlikview.

HIVE Architecture:-

Apart from this HIVE also has a meta store that is built-in DERBY database.It is used to store table schema and other metadata.

The DERBY database is normally used for learning purpose and we cannot run two instances of the HIVE CLI as derby is a single process storage.

Starting with HIVE:-

Just type HIVE in the prompt and a hive session will open with HIVE prompt hive> and a secondary prompt comes like this >.

Hive Prompt:-

Secondary Prompt:-

A Simple Query in HIVE :-

In the above example , we have created a table and try to see the data but we do not have any data .Finally we have dropped the table.

We should note that whenever our query is correct ,OK should be there and later the query result.