Friday, September 7, 2018

Sqoop - Introduction

Apache Sqoop is an open source tool that can be used to extract data from the structured data store into Hadoop for further processing.Sqoop is one of the tools that brings the connectivity beween Hadoop and the traditional databases. Sqoop actually stands for SQL and Hadoop.

Sqoop can be used to import data from different relational databases like Oracle, My Sql , Netezza etc into the Hadoop distributed filesystem.We can do any complex calculation and keep back the data  into the RDBMS system.



Why we need an integration between Hadoop and RDBMS ?

  • Hadoop is mainly used to process unstructured or semi-structured data such as web server logs. We will be using MapReduce to achieve that objective. However, maybe our reference or master data such as products, customers, locations, server information etc. is stored in a relational database. So we would need to bring in the reference or the master data into Hadoop to perform more meaningful analysis.
  • Suppose we have some scenario to decide if loan applications should be approved or not and this also takes a lot of time. In such cases, instead of performing these CPU intensive processing on our RDBMS, we can actually outsource that to Hadoop and then get the results back into your RDBMS. So, we could replace our ETL or RDBMS with ETL on Hadoop.
  • We can use Hadoop for cheap storage ,most probably for the archived or historic data.In such cases , data is exported from RDBMS into HDFS.
With Sqoop, we can import data from a relational database system into HDFS. The input to the import process is a database table. Sqoop will read the table row-by-row into HDFS. The output of this import process is a set of files containing a copy of the imported table. The import process is performed in parallel. For this reason, the output will be in multiple files. These files may be delimited text files (for example, with commas or tabs separating each field), or binary Avro or SequenceFiles containing serialized record data.


No comments:

Post a Comment

Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as wel...