Nagesh Singh Chauhan
Apache HBase: Headstart for a beginner
Lets understand what a NoSql database is and in this blog we will focus on Apache HBase

If you’re running a web site with a large number of customers, you’ve experienced the dreaded Big Data Performance Paradox.
Just when you need your site to respond quickly to a successful marketing campaign, it slows down.
Sites like Facebook, Twitter, and others have wrestled with this problem for years as they’ve grown from thousands to millions and now hundreds of millions of users.
Inundated by huge amounts of user data, they took advantage of data store technologies like Memcached and Redis to make their sites run fast. But for sites without the engineering resources of companies like Facebook, adopting these technologies has been challenging.
In this blog, I’m going to talk about Apache HBase a NoSql database. In order to understand that we first need to understand what exactly is NoSql, So I have divided this blog into 2 parts, In the first part I’ll be talking about NoSql and in the second part I’ll talk about HBase.
What is NoSQL and why it is needed?
NoSQL database stands for “Not Only SQL” or “Not SQL”. Carl Strozz introduced the NoSQL concept in 1998.

Credits: https://www.getfilecloud.com/blog/2014/08/leading-nosql-databases-to-consider/#.XEgQpc8zaRs
NoSQL is basically a database used to manage huge sets of unstructured data, wherein the data is not stored in tabular relations like RDBMS. Most of the currently existing Relational Databases have failed in solving some of the complex modern problems like :
First, the data size has increased tremendously to the range of petabytes — one petabyte = 1,024 terabytes. RDBMS finds it challenging to handle such huge data volumes. To address this, RDBMS added more central processing units (or CPUs) or more memory to the database management system to scale up vertically.
Second, the majority of the data comes in a semi-structured or unstructured format from social media, audio, video, texts, and emails. However, the second problem related to unstructured data is outside the purview of RDBMS because relational databases just can’t categorize unstructured data. They’re designed and structured to accommodate structured data such as weblog sensor and financial data.
Third, While changing the structure and adding/removing data leaves a lot of “NULLS” which are of course expensive.
Forth, A lot of resources are spent on referential integrity and normalization.
Fifth, “big data” is generated at a very high velocity. RDBMS lacks in high velocity because it’s designed for steady data retention rather than rapid growth. Even if RDBMS is used to handle and store “big data,” it will turn out to be very expensive.
As a result, the inability of relational databases to handle “big data” led to the emergence of new technologies like HDFS and HBase.
NoSQL Database Types
Document Databases: In this type, a key is paired with a complex data structure called Document. Example: MongoDB
Graph stores: This type of database is usually used to store networked data. Where we can relate data based on some existing data.
Key-Value stores: These are the simplest NoSQL databases. In this, each is stored with a key to identify it. In some Key-value databases, we can even save the type of data saved along, like in Redis.
Wide-column stores: Used to store large data sets(store columns of data together). Example: Cassandra(Used in Facebook), HBase etc.
CAP Theorem
As per Wikipedia, the CAP theorem also named Brewer’s theorem after computer scientist Eric Brewer, states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:
Consistency: Every read receives the most recent write or an error
Availability: Every request receives a (non-error) response — without the guarantee that it contains the most recent write
Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
In particular, the CAP theorem implies that in the presence of a network partition, one has to choose between consistency and availability. Note that consistency, as defined in the CAP theorem, is quite different from the consistency guaranteed in ACID database transactions.
Advantages of NoSQL Databases
Dynamic Schemas In Relational Databases like Oracle, MySQL, MS SQL server etc we define table structures, For example, if we want to save records of Student Data, then we will have to create a table named Student, add columns to it, like student_id, student_name etc, this is called pre-defined schema, wherein we define the structure before saving any data.
In the future, if we want to add some more related columns in our Student table, then we will have to add a new column to our table. Which is easy, if we have fewer data in our tables, but what if we have millions of records. Migration to the updated schema going to cause a hell lot of pain. NoSQL databases solve this problem, as in a NoSQL database, schema definition is not required.
Sharding In Sharding, large databases are partitioned into small, faster and easily manageable databases.
Relational Databases follow a vertical architecture wherein a single server holds the data, as all the data is related. Relational Databases does not provide Sharding feature by default, to achieve this a lot of efforts has to be put in, because transactional integrity(Inserting/Updating data in transactions), Multiple tables JOINS etc cannot be easily achieved in a distributed architecture in case of Relational Databases.
NoSQL Databases have the Sharding feature as default. No additional efforts required. They automatically spread the data across servers, fetch the data in the fastest time from the server which is free, while maintaining the integrity of data.
Replication Auto data replication is also supported in NoSQL databases by default. Hence, if one DB server goes down, data is recovered using its copy created on another server in the same network.
Integrated Caching Many NoSQL databases have support for Integrated Caching, wherein the frequently demanded data is stored in a cache to make the queries faster.
After knowing the advantages and all the features that a NoSQL database offers lets dive into Apache HBase.
Apache HBase

HBase is an open-source, distributed, versioned, non-relational, column-oriented database built on top of HDFS. All the data is taken in the form of key-value pairs.
A column-oriented database is suitable for Online Analytical Processing(OLAP) which is a computing method that enables users easily and selectively extract and query data in order to analyze them from a different point of view. For example Trend analysis, financial reporting, sales forecasting, budgeting etc.
Storage Mechanism :
-> A Table in HBase is a collection of rows. (table is sorted by rowID) -> Row is a collection of column families. -> Column family is a collection of columns. -> Column is a collection of key-value pairs. -> Cell : {row, col, version}
Below diagram shows how the data is organized and stored in HBase table.

Credits: https://www.guru99.com/hbase-architecture-data-flow-usecases.html
HBase Architecture

Credits: https://mindmajix.com/hadoop-ecosystem
In HBase, a table is split into Regions(basic building element of HBase cluster) and are served by Region servers. HBase has 4 main components : -> Master Server -> Region Server (can be added or removed as per the requirement) -> Regions -> Zookeeper
Master Server: Also called HMaster and it resides on the NameNode.
-> It assigns regions to Region servers with the help of zookeeper. -> It handles load balancing of regions across region servers by unloading the busy servers and shifting the region to less occupied servers. -> It is also responsible for schema changes and metadata operations.
Region Server: Also called HRegionServer and it resides on DataNodes.
-> It communicates with the client and handles data related operations across regions. -> It handles read/write request of all regions under it. -> Decides the size of the region by following region size threshold. -> Hlog present in region server stores all the logs.
Regions: Also called HRegions
Regions are the basic building elements of HBase cluster that consists of the distribution of tables and comprised of Column families.
Region server contains Regions and Stores. Regions are vertically divided by column families into Stores.
Now Store contains : 1. Memstores(Memory store): stores data just like a cache memory so initially data is stored here, later the data is transferred to HFiles and memstore is flushed.
2. HFiles: HFile is simply a specialized file-based data structure that is used to store data in HBase. Hbase previously used Hadoop Map file format( i.e sorted sequence file). However, it has a performance limitation. It was replaced by HFILE format which is designed specifically to HBASE.
HFILE has sorted key/value pairs. Both keys and values are byte arrays. You can check the official HBase document to know more about HFiles.
Have a look at the below image to have a better idea.

Credits: https://mapr.com/blog/in-depth-look-hbase-architecture/
You must be wondering what is this “WAL”. So let us talk a little bit about WAL.
The Write Ahead Log (WAL) is for data recovery that records all modifications to the data in HBase, to file-based storage. If a RegionServer crashes or becomes unavailable before the MemStore is flushed, the WAL ensures that the changes to the data can be replayed.
With a single WAL per RegionServer, the RegionServer must write to the WAL serially, because HDFS files must be sequential. This causes the WAL to be a performance bottleneck. WAL can be disabled to improve performance bottleneck. This is done by calling the HBase client field:
Mutation.writeToWAL(false)
Zookeeper
A zookeeper is a distributed, open-source configuration and synchronization service along with naming registry for distributed applications. So it is used to manage and coordinate a large cluster of machines.

Credits: https://www.tutorialspoint.com/hbase/hbase_architecture.htm
Note: ZooKeeper exists independent of HBase.
It has ephemeral nodes representing different region servers. The master server uses these nodes to discover available servers.
These nodes are used to track server failure and network partitions. Along with that, the client communicates the region server through zooKeeper only.
Dataflow in HBase

Credits: https://beyondcorner.com/learn-apache-hbase/hbase-data-flow-mechanism-architecture/
The Read and Write operations from Client into HFile can be shown in the above diagram.
Step 1) Client wants to write data and in turn first communicates with Regions server and then regions
Step 2) Regions contacting memstore for storing data associated with the column family
Step 3) First data stores into Memstore, where the data is sorted and after that it flushes into HFile. The main reason for using Memstore is to store data in Distributed file system based on Row Key. Memstore will be placed in Region server main memory while HFiles are written into HDFS.
Step 4) Client wants to read data from Regions
Step 5) In turn Client can have direct access to Mem store, and it can request for data.
Step 6) Client approaches HFiles to get the data. The data are fetched and retrieved by the Client.
Conclusion
HBase allows you to store your data both preprocessing and post-processing and provides you some greater flexibility and can store millions of columns and billions of rows of data with rapid access. We can also perform online real-time analytics using HBase integrated with Hadoop ecosystem and also with Apache Spark.
Hope you guys have enjoyed reading this article. For any doubts/queries/suggestions please drop a comment.
You can reach me out in LinkedIn.