通过 Sqoop1.4.7 将 Mysql5.7、Hive2.3.4、Hbase1.4.9 之间的数据导入导出

目录

目录

1、什么是 Sqoop?

  Sqoop 是一种用于在 Hadoop 和关系数据库或大型机之间传输数据的工具。
  您可以使用 Sqoop 将数据从关系数据库管理系统RDBMS(如 MySQLOracle)导入 Hadoop 分布式文件系统 HDFS,转换 Hadoop MapReduce 中的数据,然后将数据导出回 RDBMS
  Sqoop 自动执行此过程的大部分过程,依靠数据库来描述要导入的数据的模式。Sqoop 使用 MapReduce 导入和导出数据,提供并行操作和容错。
  

2、下载应用程序及配置环境变量

2.1、下载 Sqoop 1.4.7

  通过以下命令下载 Sqoop,解压后,放到/home/work/_app/ 目录中:

[root@c0 _src]# pwd
/home/work/_src
[root@c0 _src]# wget http://mirrors.shu.edu.cn/apache/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
[root@c0 _src]# tar -xzvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
[root@c0 _src]# mv sqoop-1.4.7.bin__hadoop-2.6.0 /home/work/_app/

  

2.2、设置环境变量

  在每一台机器上设置 Sqoop 环境变量,运行以下命令

echo "" >> /etc/bashrc
echo "# Sqoop 1.4.7" >> /etc/bashrc
echo "export SQOOP_HOME=/home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0" >> /etc/bashrc

echo "" >> /etc/bashrc
echo "# Path" >> /etc/bashrc
echo "export PATH=\$PATH:\$SQOOP_HOME/bin" >> /etc/bashrc
source /etc/bashrc

  

2.3、设置安装所需环境

  安装和运行 Sqoop 需要用到 HiveMySqlHadoop环境。可以参考文章:Hadoop 3.1.2(HA)+Zookeeper3.4.13+Hbase1.4.9(HA)+Hive2.3.4+Spark2.4.0(HA)高可用集群搭建
  

3、安装 Sqoop 1.4.7

3.1、修改 Sqoop 配置文件

3.1.1、修改配置文件 sqoop-env.sh

   创建 /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/conf/sqoop-env.sh 文件编辑并保存,内容为空,因为我们在本文的配置环境变量章节中已经配置了环境变量,同时也在<Hadoop 3.1.2(HA)+Zookeeper3.4.13+Hbase1.4.9(HA)+Hive2.3.4+Spark2.4.0(HA)高可用集群搭建>一文中配置了 HiveHadoop 环境变量:

[root@c0 ~]# echo "" > /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/conf/sqoop-env.sh

  
  将 /home/work/_app/hive-2.3.4/lib/ 目录下的 hive-hcatalog-core-2.3.4.jarmysql-connector-java-5.1.47-bin.jarhive-common-2.3.4.jarlibthrift-0.9.3.jar 文件,复制到/home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/lib 目录下

[root@c0 ~]# cp /home/work/_app/hive-2.3.4/lib/hive-hcatalog-core-2.3.4.jar /home/work/_app/hive-2.3.4/lib/mysql-connector-java-5.1.47-bin.jar /home/work/_app/hive-2.3.4/lib/libthrift-0.9.3.jar /home/work/_app/hive-2.3.4/lib/hive-common-2.3.4.jar /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/lib/

  

3.1.2、修改配置文件 configure-sqoop

  编辑 /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/bin/configure-sqoop 文件并保存,内容如下:

[root@c0 _src]# cat /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/bin/configure-sqoop
#!/bin/bash
#
# Copyright 2011 The Apache Software Foundation
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# This is sourced in by bin/sqoop to set environment variables prior to
# invoking Hadoop.

bin="$1"

if [ -z "${bin}" ]; then
  bin=`dirname $0`
  bin=`cd ${bin} && pwd`
fi

if [ -z "$SQOOP_HOME" ]; then
  export SQOOP_HOME=${bin}/..
fi

SQOOP_CONF_DIR=${SQOOP_CONF_DIR:-${SQOOP_HOME}/conf}

if [ -f "${SQOOP_CONF_DIR}/sqoop-env.sh" ]; then
  . "${SQOOP_CONF_DIR}/sqoop-env.sh"
fi

# Find paths to our dependency systems. If they are unset, use CDH defaults.

if [ -z "${HADOOP_COMMON_HOME}" ]; then
  if [ -n "${HADOOP_HOME}" ]; then
    HADOOP_COMMON_HOME=${HADOOP_HOME}
  else
    if [ -d "/usr/lib/hadoop" ]; then
      HADOOP_COMMON_HOME=/usr/lib/hadoop
    else
      HADOOP_COMMON_HOME=${SQOOP_HOME}/../hadoop
    fi
  fi
fi

if [ -z "${HADOOP_MAPRED_HOME}" ]; then
  HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
  if [ ! -d "${HADOOP_MAPRED_HOME}" ]; then
    if [ -n "${HADOOP_HOME}" ]; then
      HADOOP_MAPRED_HOME=${HADOOP_HOME}
    else
      HADOOP_MAPRED_HOME=${SQOOP_HOME}/../hadoop-mapreduce
    fi
  fi
fi

# We are setting HADOOP_HOME to HADOOP_COMMON_HOME if it is not set
# so that hcat script works correctly on BigTop
if [ -z "${HADOOP_HOME}" ]; then
  if [ -n "${HADOOP_COMMON_HOME}" ]; then
     HADOOP_HOME=${HADOOP_COMMON_HOME}
     export HADOOP_HOME
  fi
fi

if [ -z "${HBASE_HOME}" ]; then
  if [ -d "/usr/lib/hbase" ]; then
    HBASE_HOME=/usr/lib/hbase
  else
    HBASE_HOME=${SQOOP_HOME}/../hbase
  fi
fi
#if [ -z "${HCAT_HOME}" ]; then
#  if [ -d "/usr/lib/hive-hcatalog" ]; then
#    HCAT_HOME=/usr/lib/hive-hcatalog
#  elif [ -d "/usr/lib/hcatalog" ]; then
#    HCAT_HOME=/usr/lib/hcatalog
#  else
#    HCAT_HOME=${SQOOP_HOME}/../hive-hcatalog
#    if [ ! -d ${HCAT_HOME} ]; then
#       HCAT_HOME=${SQOOP_HOME}/../hcatalog
#    fi
#  fi
#fi
#if [ -z "${ACCUMULO_HOME}" ]; then
#  if [ -d "/usr/lib/accumulo" ]; then
#    ACCUMULO_HOME=/usr/lib/accumulo
#  else
#    ACCUMULO_HOME=${SQOOP_HOME}/../accumulo
#  fi
#fi
if [ -z "${ZOOKEEPER_HOME}" ]; then
  if [ -d "/usr/lib/zookeeper" ]; then
    ZOOKEEPER_HOME=/usr/lib/zookeeper
  else
    ZOOKEEPER_HOME=${SQOOP_HOME}/../zookeeper
  fi
fi
if [ -z "${HIVE_HOME}" ]; then
  if [ -d "/usr/lib/hive" ]; then
    export HIVE_HOME=/usr/lib/hive
  elif [ -d ${SQOOP_HOME}/../hive ]; then
    export HIVE_HOME=${SQOOP_HOME}/../hive
  fi
fi

# Check: If we can't find our dependencies, give up here.
if [ ! -d "${HADOOP_COMMON_HOME}" ]; then
  echo "Error: $HADOOP_COMMON_HOME does not exist!"
  echo 'Please set $HADOOP_COMMON_HOME to the root of your Hadoop installation.'
  exit 1
fi
if [ ! -d "${HADOOP_MAPRED_HOME}" ]; then
  echo "Error: $HADOOP_MAPRED_HOME does not exist!"
  echo 'Please set $HADOOP_MAPRED_HOME to the root of your Hadoop MapReduce installation.'
  exit 1
fi

## Moved to be a runtime check in sqoop.
if [ ! -d "${HBASE_HOME}" ]; then
  echo "Warning: $HBASE_HOME does not exist! HBase imports will fail."
  echo 'Please set $HBASE_HOME to the root of your HBase installation.'
fi

## Moved to be a runtime check in sqoop.
#if [ ! -d "${HCAT_HOME}" ]; then
#  echo "Warning: $HCAT_HOME does not exist! HCatalog jobs will fail."
#  echo 'Please set $HCAT_HOME to the root of your HCatalog installation.'
#fi

#if [ ! -d "${ACCUMULO_HOME}" ]; then
#  echo "Warning: $ACCUMULO_HOME does not exist! Accumulo imports will fail."
#  echo 'Please set $ACCUMULO_HOME to the root of your Accumulo installation.'
#fi
if [ ! -d "${ZOOKEEPER_HOME}" ]; then
  echo "Warning: $ZOOKEEPER_HOME does not exist! Accumulo imports will fail."
  echo 'Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.'
fi

# Where to find the main Sqoop jar
SQOOP_JAR_DIR=$SQOOP_HOME

# If there's a "build" subdir, override with this, so we use
# the newly-compiled copy.
if [ -d "$SQOOP_JAR_DIR/build" ]; then
  SQOOP_JAR_DIR="${SQOOP_JAR_DIR}/build"
fi

function add_to_classpath() {
  dir=$1
  for f in $dir/*.jar; do
    SQOOP_CLASSPATH=${SQOOP_CLASSPATH}:$f;
  done

  export SQOOP_CLASSPATH
}

# Add sqoop dependencies to classpath.
SQOOP_CLASSPATH=""
if [ -d "$SQOOP_HOME/lib" ]; then
  add_to_classpath $SQOOP_HOME/lib
fi

# Add HBase to dependency list
if [ -e "$HBASE_HOME/bin/hbase" ]; then
  TMP_SQOOP_CLASSPATH=${SQOOP_CLASSPATH}:`$HBASE_HOME/bin/hbase classpath`
  SQOOP_CLASSPATH=${TMP_SQOOP_CLASSPATH}
fi

# Add HCatalog to dependency list
if [ -e "${HCAT_HOME}/bin/hcat" ]; then
  TMP_SQOOP_CLASSPATH=${SQOOP_CLASSPATH}:`${HCAT_HOME}/bin/hcat -classpath`
  if [ -z "${HIVE_CONF_DIR}" ]; then
    TMP_SQOOP_CLASSPATH=${TMP_SQOOP_CLASSPATH}:${HIVE_CONF_DIR}
  fi
  SQOOP_CLASSPATH=${TMP_SQOOP_CLASSPATH}
fi

# Add Accumulo to dependency list
if [ -e "$ACCUMULO_HOME/bin/accumulo" ]; then
  for jn in `$ACCUMULO_HOME/bin/accumulo classpath | grep file:.*accumulo.*jar | cut -d':' -f2`; do
    SQOOP_CLASSPATH=$SQOOP_CLASSPATH:$jn
  done
  for jn in `$ACCUMULO_HOME/bin/accumulo classpath | grep file:.*zookeeper.*jar | cut -d':' -f2`; do
    SQOOP_CLASSPATH=$SQOOP_CLASSPATH:$jn
  done
fi

ZOOCFGDIR=${ZOOCFGDIR:-/etc/zookeeper}
if [ -d "${ZOOCFGDIR}" ]; then
  SQOOP_CLASSPATH=$ZOOCFGDIR:$SQOOP_CLASSPATH
fi

SQOOP_CLASSPATH=${SQOOP_CONF_DIR}:${SQOOP_CLASSPATH}

# If there's a build subdir, use Ivy-retrieved dependencies too.
if [ -d "$SQOOP_HOME/build/ivy/lib/sqoop" ]; then
  for f in $SQOOP_HOME/build/ivy/lib/sqoop/*/*.jar; do
    SQOOP_CLASSPATH=${SQOOP_CLASSPATH}:$f;
  done
fi

add_to_classpath ${SQOOP_JAR_DIR}

HADOOP_CLASSPATH="${SQOOP_CLASSPATH}:${HADOOP_CLASSPATH}"
if [ ! -z "$SQOOP_USER_CLASSPATH" ]; then
  # User has elements to prepend to the classpath, forcibly overriding
  # Sqoop's own lib directories.
  export HADOOP_CLASSPATH="${SQOOP_USER_CLASSPATH}:${HADOOP_CLASSPATH}"
fi

export SQOOP_CLASSPATH
export SQOOP_CONF_DIR
export SQOOP_JAR_DIR
export HADOOP_CLASSPATH
export HADOOP_COMMON_HOME
export HADOOP_MAPRED_HOME
export HBASE_HOME
export HCAT_HOME
export HIVE_CONF_DIR
export ACCUMULO_HOME
export ZOOKEEPER_HOME

  

3.2、查看 Sqoop 版本

[root@c0 _src]# sqoop version
2019-03-11 22:30:16,837 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
Sqoop 1.4.7
git commit id 2328971411f57f0cb683dfb79d19d4d19d185dd8
Compiled by maugli on Thu Dec 21 15:59:58 STD 2017

  

4、启动和测试 Sqoop 的数据导入、导出

4.1、Sqoop 通过 Hive 导入数据到 Sqoop

  在 mysql 中创建数据库 testmshk 并授权给 root 用户,同时创建 hive2mysql_mshk

[root@c0 _src]# mysql -uroot -p123456
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 21
Server version: 5.7.25 MySQL Community Server (GPL)

Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> CREATE DATABASE testmshk DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
Query OK, 1 row affected (0.00 sec)

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| hive               |
| mysql              |
| performance_schema |
| sys                |
| testmshk           |
+--------------------+
6 rows in set (0.02 sec)

mysql> grant select,insert,update,delete,create on testmshk.* to root;
Query OK, 0 rows affected (0.01 sec)

mysql> flush  privileges;
Query OK, 0 rows affected (0.01 sec)


mysql> use testmshk;
Database changed

mysql> create table hive2mysql_mshk(id int,namea varchar(50),nameb varchar(50));
Query OK, 0 rows affected (0.02 sec)

mysql> quit;
Bye

  
  通过 Sqoop 查询 Mysql 中表的内容,这时可以看到表中的内容是空的

[root@c0 ~]# sqoop eval --connect  jdbc:mysql://c0:3306/testmshk?useSSL=false  --username root --password 123456 --query "select * from hive2mysql_mshk"

2019-03-11 23:44:06,894 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2019-03-11 23:44:06,945 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2019-03-11 23:44:07,100 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
-------------------------------------------------------------
| id          | namea                | nameb                |
-------------------------------------------------------------
-------------------------------------------------------------

  
  在<Hadoop 3.1.2(HA)+Zookeeper3.4.13+Hbase1.4.9(HA)+Hive2.3.4+Spark2.4.0(HA)高可用集群搭建>一文中,我们在测试 Hive 时创建了测试数据 /hive/warehouse/testtable/testdata001.dat 我们将这个数据,导入到 Mysql

[root@c0 ~]# sqoop export --connect  jdbc:mysql://c0:3306/testmshk?useSSL=false  --username root --password 123456 --table hive2mysql_mshk  --export-dir /hive/warehouse/testtable/testdata001.dat  --input-fields-terminated-by ','
2019-03-11 23:47:10,400 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2019-03-11 23:47:10,437 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2019-03-11 23:47:10,571 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
2019-03-11 23:47:10,574 INFO tool.CodeGenTool: Beginning code generation
2019-03-11 23:47:10,914 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1
2019-03-11 23:47:10,943 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1
2019-03-11 23:47:10,952 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/work/_app/hadoop-3.1.2
Note: /tmp/sqoop-root/compile/9ea7f54fe87f35ed071ed75c293f25d8/hive2mysql_mshk.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
2019-03-11 23:47:12,652 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/9ea7f54fe87f35ed071ed75c293f25d8/hive2mysql_mshk.jar
2019-03-11 23:47:12,669 INFO mapreduce.ExportJobBase: Beginning export of hive2mysql_mshk
2019-03-11 23:47:12,669 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2019-03-11 23:47:12,804 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
2019-03-11 23:47:14,106 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
2019-03-11 23:47:14,112 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
2019-03-11 23:47:14,112 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
2019-03-11 23:47:14,479 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
2019-03-11 23:47:14,808 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1552315366846_0003
2019-03-11 23:47:16,429 INFO input.FileInputFormat: Total input files to process : 1
2019-03-11 23:47:16,432 INFO input.FileInputFormat: Total input files to process : 1
2019-03-11 23:47:16,513 INFO mapreduce.JobSubmitter: number of splits:4
2019-03-11 23:47:16,577 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
2019-03-11 23:47:16,684 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1552315366846_0003
2019-03-11 23:47:16,686 INFO mapreduce.JobSubmitter: Executing with tokens: []
2019-03-11 23:47:16,924 INFO conf.Configuration: resource-types.xml not found
2019-03-11 23:47:16,924 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2019-03-11 23:47:17,213 INFO impl.YarnClientImpl: Submitted application application_1552315366846_0003
2019-03-11 23:47:17,261 INFO mapreduce.Job: The url to track the job: http://c1:8088/proxy/application_1552315366846_0003/
2019-03-11 23:47:17,262 INFO mapreduce.Job: Running job: job_1552315366846_0003
2019-03-11 23:47:23,359 INFO mapreduce.Job: Job job_1552315366846_0003 running in uber mode : false
2019-03-11 23:47:23,360 INFO mapreduce.Job:  map 0% reduce 0%
2019-03-11 23:47:31,454 INFO mapreduce.Job:  map 75% reduce 0%
2019-03-11 23:47:32,462 INFO mapreduce.Job:  map 100% reduce 0%
2019-03-11 23:47:32,473 INFO mapreduce.Job: Job job_1552315366846_0003 completed successfully
2019-03-11 23:47:32,619 INFO mapreduce.Job: Counters: 32
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=913424
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=782
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=19
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters
        Launched map tasks=4
        Data-local map tasks=4
        Total time spent by all maps in occupied slots (ms)=23446
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=23446
        Total vcore-milliseconds taken by all map tasks=23446
        Total megabyte-milliseconds taken by all map tasks=24008704
    Map-Reduce Framework
        Map input records=2
        Map output records=2
        Input split bytes=636
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=582
        CPU time spent (ms)=3960
        Physical memory (bytes) snapshot=830259200
        Virtual memory (bytes) snapshot=11165683712
        Total committed heap usage (bytes)=454557696
        Peak Map Physical memory (bytes)=208502784
        Peak Map Virtual memory (bytes)=2793611264
    File Input Format Counters
        Bytes Read=0
    File Output Format Counters
        Bytes Written=0
2019-03-11 23:47:32,626 INFO mapreduce.ExportJobBase: Transferred 782 bytes in 18.5015 seconds (42.2668 bytes/sec)
2019-03-11 23:47:32,629 INFO mapreduce.ExportJobBase: Exported 2 records.

--export-dir 表示在 HDFS 对应的 Hive 数据库文件位置
–input-fields-terminated-by 表示要处理的间隔符

  
  再次通过 Sqoop 查看 MySql 中的内容,可以看到数据已经成功导入

[root@c0 ~]# sqoop eval --connect  jdbc:mysql://c0:3306/testmshk?useSSL=false  --username root --password 123456 --query "select * from hive2mysql_mshk"
2019-03-11 23:48:56,848 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2019-03-11 23:48:56,884 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2019-03-11 23:48:57,024 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
-------------------------------------------------------------
| id          | namea                | nameb                |
-------------------------------------------------------------
| 10086       | my.mshk.top          | you.mshk.top         |
| 12306       | mname                | yname                |
-------------------------------------------------------------

  在 MySql 中能够看到我们创建的 hive2mysql_mshk 表有2行数据
  

  

4.2、Sqoop 通过 MySql 导入数据到 Hive

  刚刚我们创建的 hive2mysql_mshk 表没有任何主键,我们只是从 Hive 中添加了一些记录到 Mysql
  默认情况下,Sqoop 将识别表中的主键列(如果存在)并将其用作拆分列。
  从数据库中检索拆分列的低值和高值,并且映射任务在总范围的大小均匀的组件上运行。
  如果主键的实际值在其范围内不均匀分布,则可能导致任务不平衡。
  您应该使用 --split-by 参数明确选择不同的列。例如-- split-by id

  
  在将 MySql 的数据导入到 Hive中的 Sqoop 命令添加了更多参数:

sqoop import --connect jdbc:mysql://c0:3306/testmshk?useSSL=false --username root --password 123456 --split-by id  --table hive2mysql_mshk --target-dir /hive/warehouse/mysql2hive_mshk  --fields-terminated-by "," --hive-import --hive-table testmshk.mysql2hive_mshk

--split-by <column-name> 用哪个列来拆分
--table 告诉计算机您要从MySQL导入哪个表
--target-dir <dir> HDFS要存储的目录
--hive-import 将表导入Hive
--hive-overwrite 覆盖Hive表中的现有数据
--hive-table <table-name> 设置导入Hive时要使用的表名
--fields-terminated-by <char> 设置字段分隔符

  接下来 Sqoop 的操作是一个 map-reduce 工作。

[root@c0 _src]# sqoop import --connect jdbc:mysql://c0:3306/testmshk?useSSL=false --username root --password 123456 --split-by id  --table hive2mysql_mshk --target-dir /hive/warehouse/mysql2hive_mshk  --fields-terminated-by "," --hive-import --hive-table testmshk.mysql2hive_mshk --hive-overwrite
2019-03-12 20:21:05,060 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2019-03-12 20:21:05,137 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2019-03-12 20:21:05,337 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
2019-03-12 20:21:05,348 INFO tool.CodeGenTool: Beginning code generation
2019-03-12 20:21:05,785 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1
2019-03-12 20:21:05,821 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1
2019-03-12 20:21:05,831 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/work/_app/hadoop-3.1.2
Note: /tmp/sqoop-root/compile/202a5bda3950a7ccc6782f3cfcc3a99d/hive2mysql_mshk.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
2019-03-12 20:21:08,747 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/202a5bda3950a7ccc6782f3cfcc3a99d/hive2mysql_mshk.jar
2019-03-12 20:21:08,761 WARN manager.MySQLManager: It looks like you are importing from mysql.
2019-03-12 20:21:08,761 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
2019-03-12 20:21:08,761 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
2019-03-12 20:21:08,762 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
2019-03-12 20:21:08,764 INFO mapreduce.ImportJobBase: Beginning import of hive2mysql_mshk
2019-03-12 20:21:08,765 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2019-03-12 20:21:08,928 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
2019-03-12 20:21:09,656 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
2019-03-12 20:21:10,332 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
2019-03-12 20:21:10,688 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1552315366846_0011
2019-03-12 20:21:12,618 INFO db.DBInputFormat: Using read commited transaction isolation
2019-03-12 20:21:12,619 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `hive2mysql_mshk`
2019-03-12 20:21:12,622 INFO db.IntegerSplitter: Split size: 555; Num splits: 4 from: 10086 to: 12306
2019-03-12 20:21:12,696 INFO mapreduce.JobSubmitter: number of splits:4
2019-03-12 20:21:13,137 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1552315366846_0011
2019-03-12 20:21:13,140 INFO mapreduce.JobSubmitter: Executing with tokens: []
2019-03-12 20:21:13,443 INFO conf.Configuration: resource-types.xml not found
2019-03-12 20:21:13,443 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2019-03-12 20:21:13,533 INFO impl.YarnClientImpl: Submitted application application_1552315366846_0011
2019-03-12 20:21:13,593 INFO mapreduce.Job: The url to track the job: http://c1:8088/proxy/application_1552315366846_0011/
2019-03-12 20:21:13,594 INFO mapreduce.Job: Running job: job_1552315366846_0011
2019-03-12 20:21:20,705 INFO mapreduce.Job: Job job_1552315366846_0011 running in uber mode : false
2019-03-12 20:21:20,727 INFO mapreduce.Job:  map 0% reduce 0%
2019-03-12 20:21:29,927 INFO mapreduce.Job:  map 50% reduce 0%
2019-03-12 20:21:29,930 INFO mapreduce.Job: Task Id : attempt_1552315366846_0011_m_000000_0, Status : FAILED
[2019-03-12 20:21:28.236]Container [pid=19941,containerID=container_e15_1552315366846_0011_01_000002] is running 539445760B beyond the 'VIRTUAL' memory limit. Current usage: 199.9 MB of 1 GB physical memory used; 2.6 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_e15_1552315366846_0011_01_000002 :
    |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
    |- 20026 19941 19941 19941 (java) 950 81 2678403072 50861 /opt/jdk1.8.0_201/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx820m -Djava.io.tmpdir=/home/work/_data/hadoop-3.1.2/nm-local-dir/usercache/root/appcache/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/work/_logs/hadoop-3.1.2/userlogs/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 192.168.11.32 41274 attempt_1552315366846_0011_m_000000_0 16492674416642
    |- 19941 19939 19941 19941 (bash) 1 2 115900416 307 /bin/bash -c /opt/jdk1.8.0_201/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN   -Xmx820m -Djava.io.tmpdir=/home/work/_data/hadoop-3.1.2/nm-local-dir/usercache/root/appcache/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/work/_logs/hadoop-3.1.2/userlogs/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 192.168.11.32 41274 attempt_1552315366846_0011_m_000000_0 16492674416642 1>/home/work/_logs/hadoop-3.1.2/userlogs/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002/stdout 2>/home/work/_logs/hadoop-3.1.2/userlogs/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002/stderr

[2019-03-12 20:21:28.324]Container killed on request. Exit code is 143
[2019-03-12 20:21:28.335]Container exited with a non-zero exit code 143.

2019-03-12 20:21:30,978 INFO mapreduce.Job:  map 75% reduce 0%
2019-03-12 20:21:37,021 INFO mapreduce.Job:  map 100% reduce 0%
2019-03-12 20:21:37,032 INFO mapreduce.Job: Job job_1552315366846_0011 completed successfully
2019-03-12 20:21:37,145 INFO mapreduce.Job: Counters: 33
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=915840
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=425
        HDFS: Number of bytes written=49
        HDFS: Number of read operations=24
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=8
    Job Counters
        Failed map tasks=1
        Launched map tasks=5
        Other local map tasks=5
        Total time spent by all maps in occupied slots (ms)=31981
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=31981
        Total vcore-milliseconds taken by all map tasks=31981
        Total megabyte-milliseconds taken by all map tasks=32748544
    Map-Reduce Framework
        Map input records=2
        Map output records=2
        Input split bytes=425
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=318
        CPU time spent (ms)=6520
        Physical memory (bytes) snapshot=815542272
        Virtual memory (bytes) snapshot=11174408192
        Total committed heap usage (bytes)=437780480
        Peak Map Physical memory (bytes)=206934016
        Peak Map Virtual memory (bytes)=2795565056
    File Input Format Counters
        Bytes Read=0
    File Output Format Counters
        Bytes Written=49
2019-03-12 20:21:37,154 INFO mapreduce.ImportJobBase: Transferred 49 bytes in 27.4776 seconds (1.7833 bytes/sec)
2019-03-12 20:21:37,159 INFO mapreduce.ImportJobBase: Retrieved 2 records.
2019-03-12 20:21:37,159 INFO mapreduce.ImportJobBase: Publishing Hive/Hcat import job data to Listeners for table hive2mysql_mshk
2019-03-12 20:21:37,188 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1
2019-03-12 20:21:37,208 INFO hive.HiveImport: Loading uploaded data into Hive
2019-03-12 20:21:37,220 INFO conf.HiveConf: Found configuration file file:/home/work/_app/hive-2.3.4/conf/hive-site.xml
2019-03-12 20:21:49,491 INFO hive.HiveImport:
2019-03-12 20:21:49,492 INFO hive.HiveImport: Logging initialized using configuration in jar:file:/home/work/_app/hive-2.3.4/lib/hive-common-2.3.4.jar!/hive-log4j2.properties Async: true
2019-03-12 20:21:56,558 INFO hive.HiveImport: OK
2019-03-12 20:21:56,561 INFO hive.HiveImport: Time taken: 5.954 seconds
2019-03-12 20:21:57,005 INFO hive.HiveImport: Loading data to table testmshk.mysql2hive_mshk
2019-03-12 20:21:58,181 INFO hive.HiveImport: OK
2019-03-12 20:21:58,181 INFO hive.HiveImport: Time taken: 1.619 seconds
2019-03-12 20:21:58,681 INFO hive.HiveImport: Hive import complete.

    
  最后,让我们验证 Hive 中的输出:

[root@c0 ~]# hive

Logging initialized using configuration in jar:file:/home/work/_app/hive-2.3.4/lib/hive-common-2.3.4.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> show databases;
OK
default
testmshk
Time taken: 3.512 seconds, Fetched: 2 row(s)
hive> use testmshk;
OK
Time taken: 0.032 seconds
hive> show tables;
OK
mysql2hive_mshk
testtable
Time taken: 0.083 seconds, Fetched: 2 row(s)
hive> select * from mysql2hive_mshk;
OK
10086   my.mshk.top you.mshk.top
12306   mname   yname
Time taken: 1.634 seconds, Fetched: 2 row(s)
hive> quit;

  
  同时我们在 HDFS 中也可以看到创建的数据:
  
  

4.3、Sqoop 通过 MySql 导入数据到 Hbase

  接下来我们将 MySql 中的表 hive2mysql_mshk 数据,导入到 Hbase ,同时在 Hbase 中创建表 mysql2hase_mshk

[root@c0 ~]# sqoop import --connect jdbc:mysql://c0:3306/testmshk?useSSL=false --username root --password 123456 --split-by id --table hive2mysql_mshk --hbase-table mysql2hase_mshk --hbase-create-table --hbase-row-key id --column-family id
2019-03-13 12:04:33,647 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2019-03-13 12:04:33,694 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2019-03-13 12:04:33,841 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
2019-03-13 12:04:33,841 INFO tool.CodeGenTool: Beginning code generation
2019-03-13 12:04:34,162 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1
2019-03-13 12:04:34,197 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1
...
2019-03-13 12:05:13,782 INFO mapreduce.Job:  map 75% reduce 0%
2019-03-13 12:05:15,813 INFO mapreduce.Job:  map 100% reduce 0%
2019-03-13 12:05:16,827 INFO mapreduce.Job: Job job_1552397454797_0002 completed successfully
2019-03-13 12:05:16,942 INFO mapreduce.Job: Counters: 33
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=1041632
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=425
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=4
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters
        Failed map tasks=5
        Launched map tasks=9
        Other local map tasks=9
        Total time spent by all maps in occupied slots (ms)=68882
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=68882
        Total vcore-milliseconds taken by all map tasks=68882
        Total megabyte-milliseconds taken by all map tasks=70535168
    Map-Reduce Framework
        Map input records=2
        Map output records=2
        Input split bytes=425
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=801
        CPU time spent (ms)=15480
        Physical memory (bytes) snapshot=1097326592
        Virtual memory (bytes) snapshot=11271196672
        Total committed heap usage (bytes)=629669888
        Peak Map Physical memory (bytes)=295751680
        Peak Map Virtual memory (bytes)=2828283904
    File Input Format Counters
        Bytes Read=0
    File Output Format Counters
        Bytes Written=0
2019-03-13 12:05:16,949 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 38.224 seconds (0 bytes/sec)
2019-03-13 12:05:16,954 INFO mapreduce.ImportJobBase: Retrieved 2 records.

  
  这时,我们再用 shell 测试连接 Hbase ,查看我们刚刚导入的数据,能够看到 mysql2hase_mshk 已经存在,并且可以获取其中的数据

[root@c0 ~]# hbase shell
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
Version 1.4.9, rd625b212e46d01cb17db9ac2e9e927fdb201afa1, Wed Dec  5 11:54:10 PST 2018

hbase(main):001:0> list
TABLE
mysql2hase_mshk
1 row(s) in 0.1870 seconds

=> ["mysql2hase_mshk"]
hbase(main):002:0> scan 'mysql2hase_mshk'
ROW                                                        COLUMN+CELL
 10086                                                     column=id:namea, timestamp=1552449912494, value=my.mshk.top
 10086                                                     column=id:nameb, timestamp=1552449912494, value=you.mshk.top
 12306                                                     column=id:namea, timestamp=1552449906986, value=mname
 12306                                                     column=id:nameb, timestamp=1552449906986, value=yname
2 row(s) in 0.1330 seconds

hbase(main):003:0> hbase(main):003:0> get 'mysql2hase_mshk','10086'
COLUMN                                                     CELL
 id:namea                                                  timestamp=1552449912494, value=my.mshk.top
 id:nameb                                                  timestamp=1552449912494, value=you.mshk.top
1 row(s) in 0.0230 seconds

hbase(main):004:0>

  
  如何在 HbaseHive 中互相导入、导出数据,请参考文章:Hadoop 3.1.2(HA)+Zookeeper3.4.13+Hbase1.4.9(HA)+Hive2.3.4+Spark2.4.0(HA)高可用集群搭建 中的 9.2.49.2.5 章节。
  
  希望您发现它很有用,感谢您的支持和阅读我的博客。


博文作者:迦壹
博客地址:通过 Sqoop1.4.7 将 Mysql5.7、Hive2.3.4、Hbase1.4.9 之间的数据导入导出
转载声明:可以转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明,谢谢合作!
  
假设您认为这篇文章对您有帮助,可以通过以下方式进行捐赠,谢谢!

比特币地址:1KdgydfKMcFVpicj5w4vyn3T88dwjBst6Y
以太坊地址:0xbB0a92d634D7b9Ac69079ed0e521CC2e0a97c420


上一篇:Sqoop


下一篇:Sqoop安装及基本使用