HBase vs. BigTable Comparison - HBase对比BigTable

HBase vs. BigTable Comparison

HBase is an open-source implementation of the Google BigTable architecture. That part is fairly easy to understand and grasp. What I personally feel is a bit more difficult is to understand how much HBase covers and where there are differences (still) compared to the BigTable specification. This post is an
attempt to compare the two systems.



Before we embark onto the dark technology side of things I would like to point out one thing upfront: HBase is very close to what the BigTable paper describes. Putting aside minor differences, as of
HBase 0.20, which is using
ZooKeeper as its
lock distributed coordination service, it has all the means to be nearly an exact implementation of BigTable's functionality. What I will be looking into below are mainly subtle variations or differences. Where possible I will try to point out how
the HBase team is working on improving the situation given there is a need to do so.

Scope

The comparison in this post is based on the OSDI'06 paper that describes the system Google implemented in about seven person-years and which is in operation since 2005. The paper was published 2006 while the HBase sub-project of
Hadoop was established only around the end of that same year to early 2007. Back then the current version of Hadoop was 0.15.0. Given we are now about 2 years in, with Hadoop 0.20.1 and HBase 0.20.2 available,
you can hopefully understand that indeed much has happened since. Please also note that I am comparing a 14 page high level technical paper with an open-source project that can be examined freely from top to bottom. It usually means that there is more to tell
about how HBase does things because the information is available.



Towards the end I will also address a few newer features that BigTable has nowadays and how HBase is comparing to those. We start though with naming conventions.

Terminology

There are a few different terms used in either system describing the same thing. The most prominent being what HBase calls "regions" while Google refers to it as "tablet". These are the partitions of subsequent rows spread across many "region servers" - or
"tablet server" respectively. Apart from that most differences are minor or caused by usage of related technologies since Google's code is obviously closed-source and therefore only mirrored by open-source projects. The open-source projects are free to use
other terms and most importantly names for the projects themselves.

Features

The following table lists various "features" of BigTable and compares them with what HBase has to offer. Some are actual implementation details, some are configurable option and so on. This may be confusing but it would be difficult to sort them into categories
and not ending up with one entry only in each of them.

Feature
 Google BigTable 
 Apache HBase 
Notes
Atomic Read/Write/Modify
Yes, per row
Yes, per row
Since BigTable does not strive to be a relational database it does not have transactions. The closest to such a mechanism is the atomic access to each row in the table. HBase also implements a row lock API which allows the
user to lock more than one row at a time.
Lexicographic Row Order
Yes
Yes
All rows are sorted lexicographically in one order and that one order only. Again, this is no SQL database where you can have different sorting orders.
Block Support
Yes
Yes
Within each storage file data is written as smaller blocks of data. This enables faster loading of data from large storage files. The size is configurable in either system. The typical size is 64K.
Block Compression
Yes, per column family
Yes, per column family
Google uses BMDiff and Zippy in a two step process. BMDiff works really well because neighboring key-value pairs in the store files are often very similar. This can be achieved by using versioning so that all modifications
to a value are stored next to each other but still have a lot in common. Or by designing the row keys in such a way that for example web pages from the same site are all bundled. Zippy then is a modified LZW algorithm. HBase on the other hand uses the standard
Java supplied GZip or with a little effort the GPL licensed LZO format. There are indications though that Hadoop also may want to have BMDiff (HADOOP-5793) and possibly Zippy as well.
Number of Column Families
Hundreds at Most
Less than 100
While the number of rows and columns is theoretically unbound the number of column families is not. This is a design trade-off but does not impose too much restrictions if the tables and key are designed accordingly.
Column Family Name Format
Printable
Printable
The main reason for HBase here is that column family names are used as directories in the file system.
Qualifier Format
Arbitrary
Arbitrary
Any arbitrary byte[] array can be used.
Key/Value Format
Arbitrary
Arbitrary
Like above, any arbitrary byte[] array can be used.
Access Control
Yes
No
BigTable enforces access control on a column family level. HBase does not have yet have that feature (see
HBASE-1697).
Cell Versions
Yes
Yes
Versioning is done using timestamps. See next feature below too. The number of versions that should be kept are freely configurable on a column family level.
Custom Timestamps
Yes (micro)
Yes (milli)
With both systems you can either set the timestamp of a value that is stored yourself or leave the default "now". There are "known" restrictions in HBase that the outcome is indeterminate when adding older timestamps after
already having stored newer ones beforehand.
Data Time-To-Live
Yes
Yes
Besides having versions of data cells the user can also set a time-to-live on the stored data that allows to discard data after a specific amount of time.
Batch Writes
Yes
Yes
Both systems allow to batch table operations.
Value based Counters
Yes
Yes
BigTable and HBase can use a specific column as atomic counters. HBase does this by acquiring a row lock before the value is incremented.
Row Filters
Yes
Yes
Again both system allow to apply filters when scanning rows.
Client Script Execution
Yes
No
BigTable uses Sawzall to enable users to process the stored data.
MapReduce Support
Yes
Yes
Both systems have convenience classes that allow scanning a table in MapReduce jobs.
Storage Systems
GFS
HDFS, S3, S3N, EBS
While BigTable works on Google's GFS, HBase has the option to use any file system as long as there is a proxy or driver class for it.
File Format
SSTable
HFile
 
Block Index
At end of file
At end of file
Both storage file formats have a similar block oriented structure with the block index stored at the end of the file.
Memory Mapping
Yes
No
BigTable can memory map storage files directly into memory.
Lock Service
Chubby
ZooKeeper
There is a difference in where ZooKeeper is used to coordinate tasks in HBase as opposed to provide locking services. Overall though ZooKeeper does for HBase pretty much what Chubby does for BigTable with slightly different
semantics.
Single Master
Yes
No
HBase recently added support for multiple masters. These are on "hot" standby and monitor the master's ZooKeeper node.
Tablet/Region Count
10-1000
10-1000
Both systems recommend about the same amount of regions per region server. Of course this depends on many things but given a similar setup as far as "commodity" machines are concerned it seems to result in the same amount
of load on each server.
Tablet/Region Size
100-200MB
256MB
The maximum region size can be configured for HBase and BigTable. HBase used 256MB as the default value.
Root Location
1st META / Chubby
-ROOT- / ZooKeeper
HBase handles the Root table slightly different from BigTable, where it is the first region in the Meta table. HBase uses its own table with a single region to store the Root table. Once either system starts the address
of the server hosting the Root region is stored in ZooKeeper or Chubby so that the clients can resolve its location without hitting the master.
Client Region Cache
Yes
Yes
The clients in either system caches the location of regions and has appropriate mechanisms to detect stale information and update the local cache respectively
Meta Prefetch
Yes
No (?)
A design feature of BigTable is to fetch more than one Meta region information. This proactively fills the client cache for future lookups.
Historian
Yes
Yes
The history of region related events (such as splits, assignment, reassignment) is recorded in the Meta table.
Locality Groups
Yes
No
It is not entirely clear but it seems everything in BigTable is defined by Locality Groups. The group multiple column families into one so that they get stored together and also share the same configuration parameters. A
single column family is probably a Locality Group with one member. HBase does not have this option and handles each column family separately.
In-Memory Column Families
Yes
Yes
These are for relatively small tables that need very fast access times.
KeyValue (Cell) Cache
Yes
No
This is a cache that servers hot cells.
Block Cache
Yes
Yes
Blocks read from the storage files are cached internally in configurable caches.
Bloom Filters
Yes
Yes
These filters allow - at a cost of using memory on the region server - to quickly check if a specific cell exists or maybe not.
Write-Ahead Log (WAL)
Yes
Yes
Each region server in either system stores one modification log for all regions it hosts.
Secondary Log
Yes
No
In addition to the Write-Ahead log mentioned above BigTable has a second log that it can use when the first is going slow. This is a performance optimization.
Skip Write-Ahead Log
?
Yes
For bulk imports the client in HBase can opt to skip writing into the WAL.
Fast Table/Region Split
Yes
Yes
Splitting a region or tablet is fast as the daughter regions first read the original storage file until a compaction finally rewrites the data into the region's local store.

New Features

As mentioned above, a few years have passed since the original OSDI'06 BigTable paper. Jeff Dean - a fellow at Google - has mentioned a few new BigTable
features during speeches and presentations he gave recently. We will have a look at some of them here.

Feature
 Google BigTable 
 Apache HBase 
Notes
Client Isolation
Yes
No
BigTable is internally used to server many separate clients and can therefore keep the data between isolated.
Coprocessors
Yes
No
BigTable can host code that resides with the regions and splits with them as well. See
HBASE-2000 for progress on this feature within HBase.
Corruption Safety
Yes
No
This is an interesting topic. BigTable uses CRC checksums to verify if data has been written safely. While HBase does not have this, the question is if that is build into Hadoop's HDFS?
Replication
Yes
No
HBase is working on the same topic in HBASE-1295

Note: the color codes indicate what features have a direct match or where it is missing (yet). Weaker features are colored yellow, as I am not sure if they are immediately necessary or even applicable given HBase's implementation.

Variations and Differences

Some of the above features need a bit more looking into as they are difficult to be narrowed down to simple "Yay or Nay" questions. I am addressing them below separately.



Lock Service



This is from the BigTable paper:

Bigtable uses Chubby for a variety of tasks: to ensure that there is at most one active master at any time; to store the bootstrap location of Bigtable data (see Section 5.1); to discover
tablet servers and finalize tablet server deaths (see Section 5.2); to store Bigtable schema information (the column family information for each table); and to store access control lists. If Chubby becomes unavailable for an extended period of time, Bigtable
becomes unavailable.

There is a lot of overlap compared to how HBase does use ZooKeeper. What is different though is that schema information is not stored in ZooKeeper (yet, see http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases) for details. What is important here though is
the same reliance on the lock service being available. From my own experience and reading the threads on the HBase mailing list it is often underestimated what can happen when ZooKeeper does not get the resources it needs to react timely. It is better to have
a small ZooKeeper cluster on older machines not doing anything else as opposed to having ZooKeeper nodes running next to the already heavy Hadoop or HBase processes. Once you starve ZooKeeper you will see a domino effect of HBase nodes going down with it -
including the master(s).



Update: After talking to a few guys of the ZooKeeper team I would like to point out that this is indeed
not a ZooKeeper issue. It has to do with the fact that if you have an already heavily loaded node trying to also respond in time to ZooKeeper resources then you may face a timeout situation where the HBase RegionServers and even the Master may think
that their coordination service is gone and shut themselves down. Patrick Hunt has responded to this by
mail and by post. Please read both to see that ZooKeeper is able to handle the load. I personally recommend to set up ZooKeeper in combination with HBase on a separate cluster, maybe a set of spare machines you have from a recent update to the cluster and which are
slightly outdated (no 2xquad core CPU with 16GB of memory) but are otherwise perfectly fine. This also allows you to monitor the machines separately and not having to see a combined CPU load of 100% on the servers and not really knowing where it comes from
and what effect it may have.



Another important difference is that ZooKeeper is no lock service like Chubby - and I do think it does not have to be as far as HBase is concerned. ZooKeeper is a distributed coordination service enabling HBase to do Master node elections etc. It also allows
using semaphores to indicate state or actions required. So where Chubby creates a lock file to indicate a tablet server is up and running HBase in turn uses ephemeral nodes that exist as long as the session between the RegionServer which creates that node
and ZooKeeper is active. This also causes the differences in semantics where in BigTable can delete a tablet servers lock file to indicate that it has lost its lease on tablets. In HBase this has to be handled differently because of the slightly less restrictive
architecture of ZooKeeper. These are only semantics as mentioned and do not mean one is better than the other - just different.

The first level is a file stored in Chubby that contains the location of the root tablet. The root tablet contains the location of all tablets in a special METADATA table. Each METADATA
tablet contains the location of a set of user tablets. The root tablet is just the first tablet in the METADATA table, but is treated specially - it is never split - to ensure that the tablet location hierarchy has no more than three levels.

As mentioned above in HBase the root region is its own table with a single region. If that makes a difference to having it as the first (non-splittable) region of the meta table I doubt strongly. It is just the same feature but implemented differently.

The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet's table identifier and its end row.

HBase does have a different layout here. It stores the start and end row with each region where the end row is exclusive and denotes the first (or start) row of the next region. Again, these are minor differences and I am not sure if there is a better or worse
solution. It is just done differently.



Master Operation

To detect when a tablet server is no longer serving its tablets, the master periodically asks each tablet server for the status of its lock. If a tablet server reports that it has lost
its lock, or if the master was unable to reach a server during its last several attempts, the master attempts to acquire an exclusive lock on the server's file. If the master is able to acquire the lock, then Chubby is live and the tablet server is either
dead or having trouble reaching Chubby, so the master ensures that the tablet server can never serve again by deleting its server file. Once a server's file has been deleted, the master can move all the tablets that were previously assigned to that server
into the set of unassigned tablets. To ensure that a Bigtable cluster is not vulnerable to networking issues between the master and Chubby, the master kills itself if its Chubby session expires.

This is quite different even up to the current HBase 0.20.2. Here the master uses a heartbeat protocol that is used by the region servers to report for duty and that they are still alive subsequently. I am not sure if this topic is covered by the master rewrite
umbrella issue HBASE-1816 - and if it needs to be addressed at all. It could well be that what we have in HBase now is sufficient and does its job just fine. It was created when there was no lock service yet and therefore could be considered legacy code too.



Master Startup

The master executes the following steps at startup. (1) The master grabs a unique master lock in Chubby, which prevents concurrent master instantiations. (2) The master scans the servers
directory in Chubby to find the live servers. (3) The master communicates with every live tablet server to discover what tablets are already assigned to each server. (4) The master scans the METADATA table to learn the set of tablets. Whenever this scan encounters
a tablet that is not already assigned, the master adds the tablet to the set of unassigned tablets, which makes the tablet eligible for tablet assignment.

Along what I mentioned above, this part of the code was created before ZooKeeper was available. So HBase actually waits for the region servers to report for duty. It also scans the .META. table to learn what is there and which server is assigned to it. ZooKeeper
is (yet) only used to publish the server hosting the -ROOT- region.



Tablet/Region Splits

In case the split notification is lost (either because the tablet server or the master died), the master detects the new tablet when it asks a tablet server to load the tablet that has
now split. The tablet server will notify the master of the split, because the tablet entry it finds in the METADATA table will specify only a portion of the tablet that the master asked it to load.

The master node in HBase uses the .META. solely to detect when a region was split but the message was lost. For that reason it scans the .META. on a regular basis to see when a region appears that is not yet assigned. It will then assign that region as per
its default strategy.



Compactions



The following are more terminology differences than anything else.

As write operations execute, the size of the memtable increases. When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable
is converted to an SSTable and written to GFS. This minor compaction process has two goals: it shrinks the memory usage of the tablet server, and it reduces the amount of data that has to be read from the commit log during recovery if this server dies. Incoming
read and write operations can continue while compactions occur.

HBase has a similar operation but it is referred to as a "flush". Opposed to that "minor compactions" in HBase rewrite the last N used store files, i.e. those with the most recent mutations as they are probably much smaller than previously created files that
have more data in them.

... we bound the number of such files by periodically executing a merging compaction in the background. A merging compaction reads the contents of a few SSTables and the memtable, and
writes out a new SSTable. The input SSTables and memtable can be discarded as soon as the compaction has finished.

This again refers to what is called "minor compaction" in HBase.

A merging compaction that rewrites all SSTables into exactly one SSTable is called a major compaction.

Here we have an exact match though, a "major compaction" in HBase also rewrites all files into one.



Immutable Files



Knowing that files are fixed once written BigTable makes the following assumption:

The only mutable data structure that is accessed by both reads and writes is the memtable. To reduce contention during reads of the memtable, we make each memtable row copy-on-write
and allow reads and writes to proceed in parallel.

I do believe this is done similar in HBase but am not sure. It certainly has the same architecture as HDFS files for example are also immutable once written.



I can only recommend that you read the BigTable too and make up your own mind. This post was inspired by the idea to learn what BigTable really has to offer and how much HBase has already covered. The difficult part is of course that there is not too much information
available on BigTable. But the numbers even the 2006 paper lists are more than impressive. If HBase as on open-source project with just a handful of committers of whom most have a full-time day jobs can achieve something even remotely comparable I think this
is a huge success. And looking at the 0.21 and 0.22 road map, the already small gap is going to shrink even further!

原文地址:http://www.larsgeorge.com/2009/11/hbase-vs-bigtable-comparison.html

上一篇:【转载】绝对干货!Linux小白最佳实践:《超容易的Linux系统管理入门书》(连载九)如何通过源代码安装软件


下一篇:configure - 源代码安装的第一步