Data is Future: March 2019

Monday, March 18, 2019

No Sql - Introduction

NO SQL stands for ‘NOT ONLY SQL’.No SQL is a different dimension in database which is different than traditional database system and used to hold a large dataset.This kind of database can include a wide variety of database like key,value pair,document,columnar and graph formats .The idea of NO SQL database comes up when we start facing the issue with traditional database system.Both the system and has some advantages and disadvantages .

The difference between SQL and NO SQL can be done on various categories but one of the primary category for the differentiation is on schema.

The traditional database is more of predefined schema based database while No Sql is free from any schema or we can say that No Sql is a schema less database.

NoSQL databases are built to allow the insertion of data without a predefined schema. That makes it easy to make significant application changes in real-time, without worrying about service interruptions – which means development is faster, code integration is more reliable, and less database administrator time is needed.

Different kind of No Sql database :-

Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute name, or key, together with its value. Examples of key-value stores are Riak and Voldemort. Some key-value stores, such as Redis, allow each value to have a type, such as "integer", which adds functionality.

Document databases pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents.

Wide-column stores such as Cassandra and HBase are optimized for queries over large datasets, and store columns of data together, instead of rows.

Graph stores are used to store information about networks, such as social connections. Graph stores include Neo4J and HyperGraphDB.

Where can we use No SQL databases ?

The No Sql databases are very helpful us in many ways but we need to figure out where we can use the No Sql database .

The main priority is speed and we can compromise on the consistency of the data.
When we have a large dataset and changing the schema can be a cumbersome job.
If we are having some unstructured data and it is one of the toughest job is to handle them in traditional relational database.

Limitation with No Sql :

There is nothing called “free lunch “.The No Sql database comes up with some limitation.Such kind of databases are not useful where we need a lot of consistency and reliability in data .such kind of safeguards are not allowed in No Sql database .
Such kind of system are new and still need some time to bring confidence at the user level.

Friday, March 8, 2019

Oozie - Workflow

Oozie workflow is a sequence of actions that are in the form of DAG.(Directed acyclic graph).it is a combination of actions and control nodes that are arranged in the form directed acyclic graph.Normally the action is a hadoop job like Pig,Hive,MR job etc) but there are some scenarios where the jobs are not Hadoop one like (shell script, email notification etc).

An action does not start until the previous action completes.The start and end controls point out the start and end of the workflow.The fork and join control nodes allow execution in parallel.The decision control node is like a switch case statement.oozie workflows can also be parametrized .These parameters come from a configuration file called parameter file.

We will create a sample and oozie workflow . We will try to check the working of the oozie in our next blog.

Wednesday, March 6, 2019

Oozie - Introduction

Oozie is designed to run multi-stages Hadoop jobs as a single job.It is a scheduler system to run and manage Hadoop jobs in a distributed environment.It was developed by Yahoo and later outsourced in year 2010-11 to Apache.Oozie uses the concept of directed acyclic graph (DAG) to coordinate multi stages jobs.The output of one current action is used to run the next job.

Why we need Oozie ?

When we are working on Hadoop framework with a large data set and multiple of data transformation, it is not easy to handle all this things in a single job.We need multiple of process like Map reduce ,Hive , Pig etc to handle all this jobs in a single job .Oozie has the capability to handle all this processes in a single place.

How the name “Oozie” came into the picture ?

The team of engineer who built the Oozie want a name that is somehow similar to elephant as hadoop was coined after the toy elephant.The engineers were looking for the name that controls the elephant .The Indian name for elephant is “mahout” which was already taken by Apache mahout.Then they found the Burmese name for elephant keeper called “Oozie”.

Where does Oozie sits in Hadoop framework ?

What are the common types of Jobs in Oozie ?

Some of the common types in Oozie are below :-

    • Oozie jobs running on demand are called workflow jobs.
    • Oozie jobs running periodically are called coordinator jobs.
    • A bundle job is a collection of coordinator jobs managed as a single job.

We will learn about all these jobs in detail in our next blog post.

Sunday, March 3, 2019

Python – Pickle

Python pickle module is used for serializing and deserializing the object stream.

The pickle module consist of two major operation :-

Pickling ---------> conversion of object stream into byte stream

Unpickling -----> reverse of pickling , byte stream into object stream.

Once the object stream is converted into byte stream , the pickle file has all the necessary information reinstate the earlier state.

When to use pickle ?

1.) when we need the data to be persisted on the disk and to use it on the later stage of the program.

2.) when we need to send the data over the network protocol like TCP connection.

3.)pickling is normally used in Machine learning algorithm where we need the same data set to prediction at later stage .

Limitation of pickle :- The major disadvantage of pickle module is that it is programming language dependent and this module can work well only with Python.

Difference between pickle and JSON :- The pickle module is often comapre to the JSON but there are significant difference between pickle and JSON.

JSON is human readable while pickle is not human readable.
JSON file can work across different platform and is compatible with different programming language while pickle is python specific.
JSON is text serialization format while pickle is binary serialization format.

Code Snippet :-

Created on Sat Mar 2 23:11:30 2019

@author: sangam

"""

import pickle

data_to_be_pickle = ['tad','five','kyle','Ram']

dump_filename = 'file_pickle'

outfile = open(dump_filename,'wb')

data_to_be_dumped = pickle.dump(data_to_be_pickle,outfile)

outfile.close()

infile = open(dump_filename,'rb')

data_to_be_unpickle = pickle.load(infile)

print(data_to_be_unpickle)