hbase数据同步工具—HashTable/SyncTable

HashTable/SyncTable是一个同步hbase表数据的工具,其通过过程分为两步,这两步都是mapreduce job。和CopyTable工具一样,他也可以用来在同一个或者不同的集群之间同步部分或者全部的表数据。只不过,相比CopyTable来说,本工具在同步不同集群之间的表数据时表现更好。它不是复制某个区间范围的表数据,而是首先在源集群执行HashTable基于源数据表生成哈希序列,然后在目标集群执行SyncTable基于源数据表、源数据表生成的哈希序列、目标表、目标表生成的哈希序列,对两个表生成的哈希序列进行对比,从而找出缺失的数据。那么在同步的时候就只需要同步缺失的数据就可以了,这可以极大减少带宽和数据传输。

step1 HashTable

首先在源集群执行HashTable:

HbaseTable使用方法:

Usage: HashTable [options] <tablename> <outputpath>

Options:
 batchsize     the target amount of bytes to hash in each batch
               rows are added to the batch until this size is reached
               (defaults to 8000 bytes)
 numhashfiles  the number of hash files to create
               if set to fewer than number of regions then
               the job will create this number of reducers
               (defaults to 1/100 of regions -- at least 1)
 startrow      the start row
 stoprow       the stop row
 starttime     beginning of the time range (unixtime in millis)
               without endtime means from starttime to forever
 endtime       end of the time range.  Ignored if no starttime specified.
 scanbatch     scanner batch size to support intra row scans
 versions      number of cell versions to include
 families      comma-separated list of families to include

Args:
 tablename     Name of the table to hash
 outputpath    Filesystem path to put the output data

Examples:
 To hash 'TestTable' in 32kB batches for a 1 hour window into 50 files:
 $ bin/hbase org.apache.hadoop.hbase.mapreduce.HashTable --batchsize=32000 --numhashfiles=50 --starttime=1265875194289 --endtime=1265878794289 --families=cf2,cf3 TestTable /hashes/testTable

batchsize属性定义基于单个给定的region在单个哈希值中有多少cell data数据。这个属性的设置直接影响到同步的效率。因为这可能会导致SyncTable映射器任务执行的扫描次数减少(这是进程的下一步)。经验法则是,不同步的单元格数量越少(找到差异的概率越低),可以确定更大的批大小值。也就是说,如果未同步的数据少了,那么这个值就可以设置大一些。反之亦然。

step2 SyncTable

HashTable在源集群执行完之后,就可以在目标集群执行SyncTable了。就像replication和其他同步任务一样,需要目标集群能够访问所有源集群的regionServers/DataNodes节点。

SyncTable使用方法:

Usage: SyncTable [options] <sourcehashdir> <sourcetable> <targettable>

Options:
 sourcezkcluster  ZK cluster key of the source table
                  (defaults to cluster in classpath's config)
 targetzkcluster  ZK cluster key of the target table
                  (defaults to cluster in classpath's config)
 dryrun           if true, output counters but no writes
                  (defaults to false)

Args:
 sourcehashdir    path to HashTable output dir for source table
                  if not specified, then all data will be scanned
 sourcetable      Name of the source table to sync from
 targettable      Name of the target table to sync to

Examples:
 For a dry run SyncTable of tableA from a remote source cluster
 to a local target cluster:
 $ bin/hbase org.apache.hadoop.hbase.mapreduce.SyncTable --dryrun=true --sourcezkcluster=zk1.example.com,zk2.example.com,zk3.example.com:2181:/hbase hdfs://nn:9000/hashes/tableA tableA tableA

dryrun选项在只读操作以及表对比中时非常有用的,它可以显示两个表的差异数量而不对表做任何改变,它可以作为VerifyReplication工具的替代品

默认情况下,SyncTable会让目标表成为源表的复制品。

将doDeletes设置为false将修改默认行为,以不删除源表上没有而目标表有的数据。相同的,将doPuts设置为false也将修改默认的行为,以不增加源表有而目标表没有的数据。而如果同时将doDeletes和doPuts同时设置为false,那么其效果就和设置dryrun为true一样的。

在Two-Way Replication或者其他场景下,比如源端和目标端集群数据都有其他数据输入的情况下,建议将doDeletes选项设置为false。

实例

这边以同集群不同表为例,不同集群类似:

源表:

表名:Student
表数据

hbase(main):010:0> scan 'Student'
ROW                                       COLUMN+CELL
 0001                                     column=Grades:BigData, timestamp=1604988333715, value=80
 0001                                     column=Grades:Computer, timestamp=1604988336890, value=90
 0001                                     column=Grades:Math, timestamp=1604988339775, value=85
 0001                                     column=StuInfo:Age, timestamp=1604988324791, value=18
 0001                                     column=StuInfo:Name, timestamp=1604988321380, value=Tom Green
 0001                                     column=StuInfo:Sex, timestamp=1604988328806, value=Male
1 row(s) in 0.0120 seconds

表结构:

hbase(main):011:0> describe 'Student'
Table Student is ENABLED
Student

COLUMN FAMILIES DESCRIPTION

{NAME => 'Grades', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}

{NAME => 'StuInfo', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}

2 row(s) in 0.0400 seconds

执行HashTable

hbase org.apache.hadoop.hbase.mapreduce.HashTable Student /tmp/hash/Student

执行完之后可以看到 /tmp/hash/Student目录下:

drwxr-xr-x+  - hadoop supergroup          0 2020-12-17 16:09 /tmp/hash/Student/hashes
-rw-r--r--+  2 hadoop supergroup          0 2020-12-17 16:09 /tmp/hash/Student/hashes/_SUCCESS
drwxr-xr-x+  - hadoop supergroup          0 2020-12-17 16:09 /tmp/hash/Student/hashes/part-r-00000
-rw-r--r--+  2 hadoop supergroup        158 2020-12-17 16:09 /tmp/hash/Student/hashes/part-r-00000/data
-rw-r--r--+  2 hadoop supergroup        220 2020-12-17 16:09 /tmp/hash/Student/hashes/part-r-00000/index
-rw-r--r--+  2 hadoop supergroup         80 2020-12-17 16:08 /tmp/hash/Student/manifest
-rw-r--r--+  2 hadoop supergroup        153 2020-12-17 16:08 /tmp/hash/Student/partitions

创建新表,命名为Student_2

create 'Student_2','StuInfo','Grades'

现往表Student_2中插入Student表中的部分数据:

put 'Student_2', '0001', 'StuInfo:Name', 'Tom Green', 1
put 'Student_2', '0001', 'StuInfo:Age', '18'
put 'Student_2', '0001', 'StuInfo:Sex', 'Male'

那么现在Student和Student_2两个表的数据是这样子的:

hbase(main):015:0> scan 'Student_2'
ROW                                       COLUMN+CELL
 0001                                     column=StuInfo:Age, timestamp=1608192992466, value=18
 0001                                     column=StuInfo:Name, timestamp=1, value=Tom Green
 0001                                     column=StuInfo:Sex, timestamp=1608192995476, value=Male
1 row(s) in 0.0180 seconds

hbase(main):016:0> scan 'Student'
ROW                                       COLUMN+CELL
 0001                                     column=Grades:BigData, timestamp=1604988333715, value=80
 0001                                     column=Grades:Computer, timestamp=1604988336890, value=90
 0001                                     column=Grades:Math, timestamp=1604988339775, value=85
 0001                                     column=StuInfo:Age, timestamp=1604988324791, value=18
 0001                                     column=StuInfo:Name, timestamp=1604988321380, value=Tom Green
 0001                                     column=StuInfo:Sex, timestamp=1604988328806, value=Male
1 row(s) in 0.0070 seconds

现在通过SyncTable将Student同步到Student_2中:

执行

hbase org.apache.hadoop.hbase.mapreduce.SyncTable --dryrun=false --sourcezkcluster=hadoop:2181:/hbase hdfs://hadoop:8020/tmp/hash/Student Student Student_2

执行完成任务之后可以看到两个表同步了:

hbase(main):001:0> scan 'Student_2'
ROW                                       COLUMN+CELL
 0001                                     column=Grades:BigData, timestamp=1604988333715, value=80
 0001                                     column=Grades:Computer, timestamp=1604988336890, value=90
 0001                                     column=Grades:Math, timestamp=1604988339775, value=85
 0001                                     column=StuInfo:Age, timestamp=1604988324791, value=18
 0001                                     column=StuInfo:Name, timestamp=1604988321380, value=Tom Green
 0001                                     column=StuInfo:Sex, timestamp=1604988328806, value=Male
1 row(s) in 0.2620 seconds

hbase(main):002:0> scan 'Student'
ROW                                       COLUMN+CELL
 0001                                     column=Grades:BigData, timestamp=1604988333715, value=80
 0001                                     column=Grades:Computer, timestamp=1604988336890, value=90
 0001                                     column=Grades:Math, timestamp=1604988339775, value=85
 0001                                     column=StuInfo:Age, timestamp=1604988324791, value=18
 0001                                     column=StuInfo:Name, timestamp=1604988321380, value=Tom Green
 0001                                     column=StuInfo:Sex, timestamp=1604988328806, value=Male
1 row(s) in 0.0210 seconds
上一篇:jupyter lab---服务器编程利器


下一篇:行为设计模式及其在JVM中的应用