Data is Future: HIVE -Bucket

Wednesday, April 18, 2018

HIVE -Bucket

Though Partitioning improves the query performance to a large extent .but there are scenarios where partitioning does not serve the purpose completely.

Let us take a real time example.Suppose our user requires the data based upon the type of transaction like(electronic , paper ).we have partitioned the table on the basis of Type_of_Transaction column .In such scenarios ,the majority of data is present for the transaction type =electronic.Let us take 70 percent of data belongs to electronic transaction.

Again ,we will face the same issue.
The main idea behind bucketing is to handle such kind of scenarios.
Bucketing divides our partitioned files into further more sub directories and improves the query performance to a greater extent.

The number of bucket is determined by the HASH algorithm working behind it.

Cluster By:

CREATE TABLE students (name VARCHAR(64), age INT, CGPA DECIMAL(3, 2))
CLUSTERED BY (age) INTO 2 BUCKETS
STORED AS ORC
TBLPROPERTIES("transactional"="true");

we will load some sample data into these table and try to see how bucketing impact at the directory level.
we can check the /hive/user/warehouse/hivetest.db/students directory to see the creation of sub directories.

we can see the creation of two buckets.

Before implementing the bucketing property ,we need to check whether below properties are set or not:-
set hive.support.concurrency =true;
set hive.enforce.bucketing=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.compactor.initiator.on=true;
set hive.compactor.worker.threads=2;

Data is Future

Wednesday, April 18, 2018

HIVE -Bucket

No comments:

Post a Comment

Delta Lake - Time Travel