HashTable/SyncTable是一个同步hbase表数据的工具,其通过过程分为两步,这两步都是mapreduce job。和CopyTable工具一样,他也可以用来在同一个或者不同的集群之间同步部分或者全部的表数据。只不过,相比CopyTable来说,本工具在同步不同集群之间的表数据时表现更好。它不是复制某个区间范围的表数据,而是首先在源集群执行HashTable基于源数据表生成哈希序列,然后在目标集群执行SyncTable基于源数据表、源数据表生成的哈希序列、目标表、目标表生成的哈希序列,对两个表生成的哈希序列进行对比,从而找出缺失的数据。那么在同步的时候就只需要同步缺失的数据就可以了,这可以极大减少带宽和数据传输。
step1 HashTable
首先在源集群执行HashTable:
HbaseTable使用方法:
Usage: HashTable [options] <tablename> <outputpath>
Options:
batchsize the target amount of bytes to hash in each batch
rows are added to the batch until this size is reached
(defaults to 8000 bytes)
numhashfiles the number of hash files to create
if set to fewer than number of regions then
the job will create this number of reducers
(defaults to 1/100 of regions -- at least 1)
startrow the start row
stoprow the stop row
starttime beginning of the time range (unixtime in millis)
without endtime means from starttime to forever
endtime end of the time range. Ignored if no starttime specified.
scanbatch scanner batch size to support intra row scans
versions number of cell versions to include
families comma-separated list of families to include
Args:
tablename Name of the table to hash
outputpath Filesystem path to put the output data
Examples:
To hash 'TestTable' in 32kB batches for a 1 hour window into 50 files:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.HashTable --batchsize=32000 --numhashfiles=50 --starttime=1265875194289 --endtime=1265878794289 --families=cf2,cf3 TestTable /hashes/testTable
batchsize属性定义基于单个给定的region在单个哈希值中有多少cell data数据。这个属性的设置直接影响到同步的效率。因为这可能会导致SyncTable映射器任务执行的扫描次数减少(这是进程的下一步)。经验法则是,不同步的单元格数量越少(找到差异的概率越低),可以确定更大的批大小值。也就是说,如果未同步的数据少了,那么这个值就可以设置大一些。反之亦然。
step2 SyncTable
HashTable在源集群执行完之后,就可以在目标集群执行SyncTable了。就像replication和其他同步任务一样,需要目标集群能够访问所有源集群的regionServers/DataNodes节点。
SyncTable使用方法:
Usage: SyncTable [options] <sourcehashdir> <sourcetable> <targettable>
Options:
sourcezkcluster ZK cluster key of the source table
(defaults to cluster in classpath's config)
targetzkcluster ZK cluster key of the target table
(defaults to cluster in classpath's config)
dryrun if true, output counters but no writes
(defaults to false)
Args:
sourcehashdir path to HashTable output dir for source table
if not specified, then all data will be scanned
sourcetable Name of the source table to sync from
targettable Name of the target table to sync to
Examples:
For a dry run SyncTable of tableA from a remote source cluster
to a local target cluster:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.SyncTable --dryrun=true --sourcezkcluster=zk1.example.com,zk2.example.com,zk3.example.com:2181:/hbase hdfs://nn:9000/hashes/tableA tableA tableA
dryrun选项在只读操作以及表对比中时非常有用的,它可以显示两个表的差异数量而不对表做任何改变,它可以作为VerifyReplication工具的替代品
默认情况下,SyncTable会让目标表成为源表的复制品。
将doDeletes设置为false将修改默认行为,以不删除源表上没有而目标表有的数据。相同的,将doPuts设置为false也将修改默认的行为,以不增加源表有而目标表没有的数据。而如果同时将doDeletes和doPuts同时设置为false,那么其效果就和设置dryrun为true一样的。
在Two-Way Replication或者其他场景下,比如源端和目标端集群数据都有其他数据输入的情况下,建议将doDeletes选项设置为false。
实例
这边以同集群不同表为例,不同集群类似:
源表:
表名:Student
表数据
hbase(main):010:0> scan 'Student'
ROW COLUMN+CELL
0001 column=Grades:BigData, timestamp=1604988333715, value=80
0001 column=Grades:Computer, timestamp=1604988336890, value=90
0001 column=Grades:Math, timestamp=1604988339775, value=85
0001 column=StuInfo:Age, timestamp=1604988324791, value=18
0001 column=StuInfo:Name, timestamp=1604988321380, value=Tom Green
0001 column=StuInfo:Sex, timestamp=1604988328806, value=Male
1 row(s) in 0.0120 seconds
表结构:
hbase(main):011:0> describe 'Student'
Table Student is ENABLED
Student
COLUMN FAMILIES DESCRIPTION
{NAME => 'Grades', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{NAME => 'StuInfo', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
2 row(s) in 0.0400 seconds
执行HashTable
hbase org.apache.hadoop.hbase.mapreduce.HashTable Student /tmp/hash/Student
执行完之后可以看到 /tmp/hash/Student目录下:
drwxr-xr-x+ - hadoop supergroup 0 2020-12-17 16:09 /tmp/hash/Student/hashes
-rw-r--r--+ 2 hadoop supergroup 0 2020-12-17 16:09 /tmp/hash/Student/hashes/_SUCCESS
drwxr-xr-x+ - hadoop supergroup 0 2020-12-17 16:09 /tmp/hash/Student/hashes/part-r-00000
-rw-r--r--+ 2 hadoop supergroup 158 2020-12-17 16:09 /tmp/hash/Student/hashes/part-r-00000/data
-rw-r--r--+ 2 hadoop supergroup 220 2020-12-17 16:09 /tmp/hash/Student/hashes/part-r-00000/index
-rw-r--r--+ 2 hadoop supergroup 80 2020-12-17 16:08 /tmp/hash/Student/manifest
-rw-r--r--+ 2 hadoop supergroup 153 2020-12-17 16:08 /tmp/hash/Student/partitions
创建新表,命名为Student_2
create 'Student_2','StuInfo','Grades'
现往表Student_2中插入Student表中的部分数据:
put 'Student_2', '0001', 'StuInfo:Name', 'Tom Green', 1
put 'Student_2', '0001', 'StuInfo:Age', '18'
put 'Student_2', '0001', 'StuInfo:Sex', 'Male'
那么现在Student和Student_2两个表的数据是这样子的:
hbase(main):015:0> scan 'Student_2'
ROW COLUMN+CELL
0001 column=StuInfo:Age, timestamp=1608192992466, value=18
0001 column=StuInfo:Name, timestamp=1, value=Tom Green
0001 column=StuInfo:Sex, timestamp=1608192995476, value=Male
1 row(s) in 0.0180 seconds
hbase(main):016:0> scan 'Student'
ROW COLUMN+CELL
0001 column=Grades:BigData, timestamp=1604988333715, value=80
0001 column=Grades:Computer, timestamp=1604988336890, value=90
0001 column=Grades:Math, timestamp=1604988339775, value=85
0001 column=StuInfo:Age, timestamp=1604988324791, value=18
0001 column=StuInfo:Name, timestamp=1604988321380, value=Tom Green
0001 column=StuInfo:Sex, timestamp=1604988328806, value=Male
1 row(s) in 0.0070 seconds
现在通过SyncTable将Student同步到Student_2中:
执行
hbase org.apache.hadoop.hbase.mapreduce.SyncTable --dryrun=false --sourcezkcluster=hadoop:2181:/hbase hdfs://hadoop:8020/tmp/hash/Student Student Student_2
执行完成任务之后可以看到两个表同步了:
hbase(main):001:0> scan 'Student_2'
ROW COLUMN+CELL
0001 column=Grades:BigData, timestamp=1604988333715, value=80
0001 column=Grades:Computer, timestamp=1604988336890, value=90
0001 column=Grades:Math, timestamp=1604988339775, value=85
0001 column=StuInfo:Age, timestamp=1604988324791, value=18
0001 column=StuInfo:Name, timestamp=1604988321380, value=Tom Green
0001 column=StuInfo:Sex, timestamp=1604988328806, value=Male
1 row(s) in 0.2620 seconds
hbase(main):002:0> scan 'Student'
ROW COLUMN+CELL
0001 column=Grades:BigData, timestamp=1604988333715, value=80
0001 column=Grades:Computer, timestamp=1604988336890, value=90
0001 column=Grades:Math, timestamp=1604988339775, value=85
0001 column=StuInfo:Age, timestamp=1604988324791, value=18
0001 column=StuInfo:Name, timestamp=1604988321380, value=Tom Green
0001 column=StuInfo:Sex, timestamp=1604988328806, value=Male
1 row(s) in 0.0210 seconds