Data is Future: HBase

Tuesday, May 1, 2018

HBase - Data Model

We have gone through the introduction of the HBase in our last tutorial.We need to understand how the data model of the HBase works. Hbase stores the data in the table and having the rows and columns.It looks similar to the relational databases but it has some serious differences in the structure.We will go through the differences one by one.

Every row in the Hbase table is indexed by the Row Key and these row keys are sorted dynamically and are unique.While designing the schema , the creation of Row key is very important.For every row key, we can store unlimited number of columns in the table.The new columns can be even added during the run time and can be grouped into the column families.

The columns are stored in the column families that creates a clear separation inside the table . In the relational database, the separation is created by the column .The designing of column families are paramount during the schema creation as they will impact the performance of the table.We should note that once the Column family is created during the table creation , it cannot be change afterwards.

The data in the column family can be sparsely populated and it is not necessary that all the column must contain the data in a particular column family.We can see the data in the form of rows and columns. In fact, the cells are stored as the individual entity with all the required information.

In a physical storage the cells of one family are stored in one storefile and the other column family are stored in another store file.The below table shows the column of the column family 1 with all the related information of the cell.

Similarly , the cell of the Column Family 2 are stored in the another storefile

In addition, multiple versions of the same cell are stored as separate,consecutive cells,adding the required timestamp of when the cell was stored. The cells are sorted in descending order by that timestamp so that a reader of the data will see the newest value first.

Timestamp is always written alongside the row and it is the perfect identifier for the each version of a value.It signifies the time when the data is written into the server.

Few things we should note about the data storage in Hbase :-

Data once written cannot be changed but we can keep the latest version of the same data
We can configure the maximum number of version that can be store for a particular value.
Oldest version of the value can be deleted by reaching the maximum number of the version.

Further Reading :- https://hbase.apache.org/book.html#getting_started

Data is Future

Tuesday, May 1, 2018

HBase - Data Model

No comments:

Post a Comment

Delta Lake - Time Travel