文件存储HDFS版和对象存储OSS双向数据迁移

一 目的

本文档介绍文件存储HDFS版和对象存储OSS之间的数据迁移过程。您可以将文件存储HDFS版数据迁移到对象存储OSS,也可以将对象存储OSS的数据迁移到文件存储HDFS版上。

二 背景信息

阿里云文件存储HDFS版是面向阿里云ECS实例及容器服务等计算资源的文件存储服务。文件存储HDFS版允许您就像在Hadoop分布式文件系统中管理和访问数据,并对热数据提供高性能的数据访问能力。对象存储OSS是海量、安全、低成本、高可靠的云存储服务,并提供标准型、归档型等多种存储类型供选择。您可以在文件存储HDFS版和对象存储OSS之间实现数据迁移,从而实现热、温、冷数据的合理分层,在实现对热数据的高性能访问的同时,有效控制存储成本。

三 准备工作

  1. 开通文件存储HDFS版服务并创建文件系统实例和挂载点,详情请参见快速入门
  2. 建议您使用的Hadoop版本不低于2.7.2,本文档中使用的Hadoop版本为Apache Hadoop 2.7.2。
  3. 在Hadoop集群所有节点上安装JDK。版本不能低于1.8。
  4. 在Hadoop集群中配置文件存储HDFS版实例,详情请参见挂载文件系统
  5. 在Hadoop集群安装 OSS 客户端 JindoFS SDKJindoFS SDK 详细介绍及配置请参考:JindoFS SDK ,本文档中操作如下

a. 下载最新的jar包 jindofs-sdk-x.x.x.jar (下载页面),将sdk包安装到 Hadoop 的classpath下。

cp ./jindofs-sdk-*.jar  ${HADOOP_HOME}/share/hadoop/hdfs/lib/jindofs-sdk.jar

b. 配置 JindoFS OSS 实现类

将 JindoFS OSS 实现类配置到 Hadoop 的 core-site.xml 中。

<configuration>
    <property>
        <name>fs.AbstractFileSystem.oss.impl</name>
        <value>com.aliyun.emr.fs.oss.OSS</value>
    </property>

    <property>
        <name>fs.oss.impl</name>
        <value>com.aliyun.emr.fs.oss.JindoOssFileSystem</value>
    </property>
</configuration>

c. 配置 OSS Access Key

将 OSS 的Access Key、Access Key Secret、Endpoint等预先配置在 Hadoop 的 core-site.xml 中。

<configuration>
    <property>
        <name>fs.jfs.cache.oss.accessKeyId</name>
        <value>xxx</value>
    </property>

    <property>
        <name>fs.jfs.cache.oss.accessKeySecret</name>
        <value>xxx</value>
    </property>

    <property>
        <name>fs.jfs.cache.oss.endpoint</name>
        <!-- ECS 环境推荐使用内网 OSS Endpoint,即 oss-cn-xxx-internal.aliyuncs.com -->
        <value>oss-cn-xxx.aliyuncs.com</value>
    </property>
</configuration>

d. 在Hadoop 集群使用 OSS 客户端

${HADOOP_HOME}/bin/hadoop fs -ls oss://<bucket>/<path>

四 将文件存储HDFS版数据迁移至对象存储OSS

1.  在文件存储HDFS版实例上生成 1T 测试数据

${HADOOP_HOME}/bin/hadoop jar \
${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar \
randomtextwriter \
-D mapreduce.randomtextwriter.totalbytes=1099511627776 \
-D mapreduce.randomtextwriter.bytespermap=5368709120 \
dfs://f-xxxxxxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/dfs2oss/data/data_1t/

查看生成的测试数据

文件存储HDFS版和对象存储OSS双向数据迁移


文件存储HDFS版和对象存储OSS双向数据迁移

2. 启动Hadoop MapReduce任务(DistCp)将测试数据迁移至对象存储 OSS

详细的 Distcp 工具使用说明请参见 Hadoop Distcp 工具官方说明文档

${HADOOP_HOME}/bin/hadoop distcp \
dfs://f-xxxxxxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/dfs2oss \
oss://data-migrate-test/dfs2oss/

3. 任务执行完成后,查看迁移结果。

如果回显包含如下类似信息,说明迁移成功。

21/11/11 16:17:55 INFO mapreduce.Job: Job job_1636613902785_0001 completed successfully
21/11/11 16:17:55 INFO mapreduce.Job: Counters: 38
        File System Counters
                DFS: Number of bytes read=1124279173244
                DFS: Number of bytes written=0
                DFS: Number of read operations=742
                DFS: Number of large read operations=0
                DFS: Number of write operations=42
                FILE: Number of bytes read=0
                FILE: Number of bytes written=2596976
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                OSS: Number of bytes read=0
                OSS: Number of bytes written=1124279123564
                OSS: Number of read operations=0
                OSS: Number of large read operations=0
                OSS: Number of write operations=0
        Job Counters
                Launched map tasks=21
                Other local map tasks=21
                Total time spent by all maps in occupied slots (ms)=1930874048
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=60339814
                Total vcore-milliseconds taken by all map tasks=60339814
                Total megabyte-milliseconds taken by all map tasks=61787969536
        Map-Reduce Framework
                Map input records=206
                Map output records=0
                Input split bytes=2814
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=131013
                CPU time spent (ms)=10430520
                Physical memory (bytes) snapshot=9110585344
                Virtual memory (bytes) snapshot=83361333248
                Total committed heap usage (bytes)=15489564672
        File Input Format Counters
                Bytes Read=46866
        File Output Format Counters
                Bytes Written=0
        org.apache.hadoop.tools.mapred.CopyMapper$Counter
                BYTESCOPIED=1124279123564
                BYTESEXPECTED=1124279123564
                COPY=206

4.验证迁移结果

查看对象存储OSS上迁移的测试数据大小。

${HADOOP_HOME}/bin/hadoop fs -du -s oss://data-migrate-test/dfs2oss/

文件存储HDFS版和对象存储OSS双向数据迁移

五 将对象存储OSS数据迁移至文件存储HDFS版

1.在 OSS 上生成 1T 测试数据

${HADOOP_HOME}/bin/hadoop jar \
${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar \
randomtextwriter \
-D mapreduce.randomtextwriter.totalbytes=1099511627776 \
-D mapreduce.randomtextwriter.bytespermap=5368709120 \
oss://data-migrate-test/oss2dfs/data/data_1t/

文件存储HDFS版和对象存储OSS双向数据迁移

2.启动Hadoop MapReduce任务(DistCp)将测试数据迁移至文件存储 HDFS

${HADOOP_HOME}/bin/hadoop distcp \
oss://data-migrate-test/oss2dfs/data/data_1t \
dfs://f-xxxxxxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/oss2dfs

3.任务执行完成后,查看迁移结果。

如果回显包含如下类似信息,说明迁移成功。

21/11/11 10:37:18 INFO mapreduce.Job: Job job_1636535499203_0007 completed successfully
21/11/11 10:37:18 INFO mapreduce.Job: Counters: 38
        File System Counters
                DFS: Number of bytes read=37907
                DFS: Number of bytes written=1124279506814
                DFS: Number of read operations=1397
                DFS: Number of large read operations=0
                DFS: Number of write operations=453
                FILE: Number of bytes read=0
                FILE: Number of bytes written=2598446
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                OSS: Number of bytes read=1124279506814
                OSS: Number of bytes written=0
                OSS: Number of read operations=0
                OSS: Number of large read operations=0
                OSS: Number of write operations=0
        Job Counters
                Launched map tasks=21
                Other local map tasks=21
                Total time spent by all maps in occupied slots (ms)=783323456
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=24478858
                Total vcore-milliseconds taken by all map tasks=24478858
                Total megabyte-milliseconds taken by all map tasks=25066350592
        Map-Reduce Framework
                Map input records=206
                Map output records=0
                Input split bytes=2793
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=112771
                CPU time spent (ms)=9003330
                Physical memory (bytes) snapshot=7519907840
                Virtual memory (bytes) snapshot=83900002304
                Total committed heap usage (bytes)=14357626880
        File Input Format Counters
                Bytes Read=35114
        File Output Format Counters
                Bytes Written=0
        org.apache.hadoop.tools.mapred.CopyMapper$Counter
                BYTESCOPIED=1124279506814
                BYTESEXPECTED=1124279506814
                COPY=206

4.验证迁移结果

查看文件存储HDFS版上迁移的测试数据大小。

${HADOOP_HOME}/bin/hadoop fs -du -s dfs://f-xxxxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/oss2dfs

文件存储HDFS版和对象存储OSS双向数据迁移

六 常见问题

  • 对于正在写入的文件,进行迁移时会遗漏掉最新写入的数据吗?

Hadoop兼容文件系统提供单写者多读者并发语义,针对同一个文件,同一时刻可以有一个写者写入和多个读者读出。以文件存储HDFS版到对象存储OSS的数据迁移为例,数据迁移任务打开文件存储HDFS版的文件F,根据当前系统状态决定文件F的长度L,将L字节迁移到对象存储OSS。如果在数据迁移过程中,有并发的写者写入,文件F的长度将超过L,但是数据迁移任务无法感知到最新写入的数据。因此,建议您在做数据迁移时,避免往迁移的文件中写入数据。


了解更多关于文件存储HDFS版的产品信息,欢迎访问https://www.aliyun.com/product/alidfs

如果您对文件存储HDFS版有任何问题,欢迎钉钉扫描以下二维码加入文件存储HDFS版技术交流群。

文件存储HDFS版和对象存储OSS双向数据迁移

上一篇:在文件存储HDFS版上使用 Apache Flink


下一篇:Hadoop 大数据系统在文件存储 HDFS 版上的最佳实践