用引导操作给E-MapReduce集群安装impala

当前emr最新版本2.0.1没有impala组件,需要额外安装。本文介绍如何在emr 2.0.1版本上用E-MapReduce软件配置功能修改hdfs配置,引导操作安装impala 2.5.0 for cdf 5.7.1版本,shell作业来启动impala的完整过程。

软件配置

impala对hdfs组件的配置有要求,需要用软件配置功能修改hdfs的配置。软件配置功能可以修改hadoop组件的配置,详见: 帮助文档

本地创建一个hdfs.json文件,可以直接从oss下载,内容如下,通过oss控制台上传到oss合适位置.例如[yourbucket]/sh/hdfs.json。

{
    "configurations": [
                {
            "classification": "hdfs-site",
            "properties": {
                "dfs.client.read.shortcircuit": "true",
                "dfs.domain.socket.path": "/var/run/hadoop-hdfs/dn._PORT",
                "dfs.client.file-block-storage-locations.timeout": "10000",
                "dfs.datanode.hdfs-blocks-metadata.enabled": "true"
            }
        }
    ]
}

参照 帮助文档
创建集群时点击软件配置,选择刚才上传的hdfs.json的脚本,创建集群在启动hdfs前会将新加的配置添加到hdfs-site.xml里。

安装配置impala

引导操作可以在集群创建时执行指定的脚本,详见: 帮助文档 。这里用引导操作安装配置impala。

注意:由于impala的启动依赖hadoop组件已启动,而引导操作在启动hadoop组件前执行,所以引导操作只进行impala的安装和配置,impala的启动由后面shell作业来运行。

本地创建一个installimpala.sh文件,在杭州region创建集群可以直接从oss下载杭州region安装脚本,或者在北京region创建经典网络集群可以从oss下载北京region安装脚本,创建vpc网络集群可以从oss下载北京region vpc集群安装脚本。通过oss控制台上传到oss合适位置,例如[yourbucket]/sh/installimpala.sh。

该脚本下载资源,安装impala,替换jar,修改配置。下面分模块详细讲解脚本内容。

下载资源

#!/bin/sh

echo "download resources"
wget http://emr-agent-pack.oss-cn-hangzhou-internal.aliyuncs.com/bootstrap/impala/impala-2.5.0/bigtop-utils-0.7.0%2Bcdh5.7.1%2B0-1.cdh5.7.1.p0.13.el6.noarch.rpm
wget http://emr-agent-pack.oss-cn-hangzhou-internal.aliyuncs.com/bootstrap/impala/impala-2.5.0/impala-2.5.0%2Bcdh5.7.1%2B0-1.cdh5.7.1.p0.14.el6.x86_64.rpm
wget http://emr-agent-pack.oss-cn-hangzhou-internal.aliyuncs.com/bootstrap/impala/impala-2.5.0/impala-catalog-2.5.0%2Bcdh5.7.1%2B0-1.cdh5.7.1.p0.14.el6.x86_64.rpm
wget http://emr-agent-pack.oss-cn-hangzhou-internal.aliyuncs.com/bootstrap/impala/impala-2.5.0/impala-server-2.5.0%2Bcdh5.7.1%2B0-1.cdh5.7.1.p0.14.el6.x86_64.rpm
wget http://emr-agent-pack.oss-cn-hangzhou-internal.aliyuncs.com/bootstrap/impala/impala-2.5.0/impala-shell-2.5.0%2Bcdh5.7.1%2B0-1.cdh5.7.1.p0.14.el6.x86_64.rpm
wget http://emr-agent-pack.oss-cn-hangzhou-internal.aliyuncs.com/bootstrap/impala/impala-2.5.0/impala-state-store-2.5.0%2Bcdh5.7.1%2B0-1.cdh5.7.1.p0.14.el6.x86_64.rpm
wget http://emr-agent-pack.oss-cn-hangzhou-internal.aliyuncs.com/bootstrap/impala/impala-2.5.0/impala-udf-devel-2.5.0%2Bcdh5.7.1%2B0-1.cdh5.7.1.p0.14.el6.x86_64.rpm
wget http://emr-agent-pack.oss-cn-hangzhou-internal.aliyuncs.com/bootstrap/impala/impala-2.5.0/sentryjar1.5.1.tar.gz
tar -zxvf sentryjar1.5.1.tar.gz -C /tmp/
wget http://emr-agent-pack.oss-cn-hangzhou-internal.aliyuncs.com/bootstrap/impala/impala-2.5.0/cdh-5.7.1.tar.gz
tar -zxvf cdh-5.7.1.tar.gz -C /tmp/

从oss下载需要的源文件。

安装依赖

echo "install yum deps"
yum install -y libxcb cdparanoia-libs cups foomatic  foomatic-db foomatic-db-filesystem foomatic-db-ppds ghostscript ghostscript-fonts gstreamer gstreamer-plugins-base  gstreamer-tools iso-codes lcms-libs libXt libXv libXxf86vm libgudev1 libmng liboil libtheora libvisual mesa-dri-drivers mesa-dri-filesystem  mesa-dri1-drivers  mesa-libGL mesa-libGLU mesa-private-llvm openjpeg-libs phonon-backend-gstreamer poppler poppler-data poppler-utils portreserve qt qt-sqlite qt-x11 qt3 redhat-lsb redhat-lsb-compat redhat-lsb-graphics redhat-lsb-printing urw-fonts xml-common nc

安装impala依赖的基础库

安装impala

echo "install impala"
rpm -ivh bigtop-utils-0.7.0+cdh5.7.1+0-1.cdh5.7.1.p0.13.el6.noarch.rpm
rpm -ivh impala-2.5.0+cdh5.7.1+0-1.cdh5.7.1.p0.14.el6.x86_64.rpm --nodeps
rpm -ivh impala-server-2.5.0+cdh5.7.1+0-1.cdh5.7.1.p0.14.el6.x86_64.rpm
rpm -ivh impala-state-store-2.5.0+cdh5.7.1+0-1.cdh5.7.1.p0.14.el6.x86_64.rpm
rpm -ivh impala-catalog-2.5.0+cdh5.7.1+0-1.cdh5.7.1.p0.14.el6.x86_64.rpm
rpm -ivh impala-udf-devel-2.5.0+cdh5.7.1+0-1.cdh5.7.1.p0.14.el6.x86_64.rpm
rpm -ivh impala-shell-2.5.0+cdh5.7.1+0-1.cdh5.7.1.p0.14.el6.x86_64.rpm

安装impala组件,impala依赖的其他cdh组件,比如cdh hadoop,集群已经包含,不需要安装。

替换库文件

echo "replace jar"
source /etc/profile.d/hadoop.sh
alias cp='cp'
cd /usr/lib/impala/lib

#hadoop so
rm -f libhadoop.so
rm -rf libhadoop.so.1.0.0
cp $HADOOP_HOME/lib/native/libhadoop.so* ./
rm -f libhdfs.so
rm -rf libhdfs.so.0.0.0
cp $HADOOP_HOME//lib/native/libhdfs.so* ./

#avro and parquet
rm -f avro.jar
cp -f  $HIVE_HOME/lib/avro-1.7.7.jar avro.jar
rm -f parquet-hadoop-bundle.jar
cp -f  $HIVE_HOME/lib/parquet-hadoop-bundle-1.8.1.jar parquet-hadoop-bundle.jar

#hadoop
rm -f hadoop-annotations.jar
cp -f  $HADOOP_HOME/share/hadoop/common/lib/hadoop-annotations-2.7.2.jar hadoop-annotations.jar
rm -f hadoop-auth.jar
cp -f  $HADOOP_HOME/share/hadoop/common/lib/hadoop-auth-2.7.2.jar hadoop-auth.jar
rm -f hadoop-common.jar
cp -f  $HADOOP_HOME/share/hadoop/common/hadoop-common-2.7.2.jar hadoop-common.jar
rm -f hadoop-hdfs.jar
cp -f  $HADOOP_HOME/share/hadoop/hdfs/hadoop-hdfs-2.7.2.jar hadoop-hdfs.jar
rm -f hadoop-mapreduce-client-common.jar
cp -f  $HADOOP_HOME/share/hadoop/hdfs/hadoop-hdfs-2.7.2.jar hadoop-hdfs.jar
rm -f zookeeper.jar
cp -f  $HADOOP_HOME/share/hadoop/common/lib/zookeeper-3.4.6.jar zookeeper.jar

#mapreduce
rm -f hadoop-mapreduce-client-core.jar
cp -f  $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.7.2.jar hadoop-mapreduce-client-core.jar
rm -f hadoop-mapreduce-client-jobclient.jar
cp -f  $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.2.jar hadoop-mapreduce-client-jobclient.jar
rm -f hadoop-mapreduce-client-shuffle.jar
cp -f  $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-shuffle-2.7.2.jar hadoop-mapreduce-client-shuffle.jar

#yarn
rm -f hadoop-yarn-api.jar
cp -f  $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-api-2.7.2.jar hadoop-yarn-api.jar
rm -f hadoop-yarn-client.jar
cp -f  $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-client-2.7.2.jar hadoop-yarn-client.jar
rm -f hadoop-yarn-common.jar
cp -f  $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-common-2.7.2.jar hadoop-yarn-common.jar
rm -f hadoop-yarn-server-applicationhistoryservice.jar
cp -f  $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-server-applicationhistoryservice-2.7.2.jar hadoop-yarn-server-applicationhistoryservice.jar
rm -f hadoop-yarn-server-common.jar
cp -f  $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-server-common-2.7.2.jar hadoop-yarn-server-common.jar
rm -f hadoop-yarn-server-nodemanager.jar
cp -f  $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-server-nodemanager-2.7.2.jar hadoop-yarn-server-nodemanager.jar
rm -f hadoop-yarn-server-resourcemanager.jar
cp -f  $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-server-resourcemanager-2.7.2.jar hadoop-yarn-server-resourcemanager.jar
rm -f hadoop-yarn-server-web-proxy.jar
cp -f  $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-server-web-proxy-2.7.2.jar hadoop-yarn-server-web-proxy.jar

#hbase
rm -f hbase-annotations.jar
cp -f  /usr/lib/hbase-current/usr/lib/hbase-current/lib/hbase-annotations-1.1.1.jar hbase-annotations.jar
rm -f hbase-common.jar
cp -f /usr/lib/hbase-current/usr/lib/hbase-current/lib/hbase-common-1.1.1.jar hbase-common.jar
rm -f hbase-protocol.jar
cp -f /usr/lib/hbase-current/usr/lib/hbase-current/lib/hbase-protocol-1.1.1.jar hbase-protocol.jar

#sentry
rm -f sentry-binding-hive.jar
cp -f /tmp/sentry-binding-hive-1.5.1-incubating.jar sentry-binding-hive.jar
rm -f sentry-core-common.jar
cp -f /tmp/sentry-core-common-1.5.1-incubating.jar sentry-core-common.jar
rm -f sentry-core-model-db.jar
cp -f /tmp/sentry-core-model-db-1.5.1-incubating.jar sentry-core-model-db.jar
rm -f sentry-core-model-search.jar
cp -f /tmp/sentry-core-model-search-1.5.1-incubating.jar sentry-core-model-search.jar
rm -f sentry-policy-common.jar
cp -f /tmp/sentry-policy-common-1.5.1-incubating.jar sentry-policy-common.jar
rm -f sentry-policy-db.jar
cp -f /tmp/sentry-policy-db-1.5.1-incubating.jar sentry-policy-db.jar
rm -f sentry-policy-search.jar
cp -f /tmp/sentry-policy-search-1.5.1-incubating.jar sentry-policy-search.jar
rm -f sentry-provider-cache.jar
cp -f /tmp/sentry-provider-cache-1.5.1-incubating.jar sentry-provider-cache.jar
rm -f sentry-provider-common.jar
cp -f /tmp/sentry-provider-common-1.5.1-incubating.jar sentry-provider-common.jar
rm -f sentry-provider-db-sh.jar
cp -f /tmp/sentry-provider-db-1.5.1-incubating.jar sentry-provider-db-sh.jar
rm -f sentry-provider-file.jar
cp -f /tmp/sentry-provider-file-1.5.1-incubating.jar sentry-provider-file.jar

#other cdh jar
rm -f hadoop-mapreduce-client-common.jar
rm -f hbase-client.jar
rm -f hive-metastore.jar
rm -f hive-ant.jar
rm -f hive-beeline.jar
rm -f hive-common.jar
rm -f hive-exec.jar
rm -f hive-hbase-handler.jar
rm -f hive-serde.jar
rm -f hive-service.jar
rm -f hive-shims-common.jar
rm -f hive-shims.jar
rm -f hive-shims-scheduler.jar

cp -f /tmp/hadoop-mapreduce-client-common-2.6.0-cdh5.7.1.jar hadoop-mapreduce-client-common.jar
cp -f /tmp/hbase-client-1.2.0-cdh5.7.1.jar hbase-client.jar
cp -f /tmp/hive-common-1.1.0-cdh5.7.1.jar hive-common.jar
cp -f /tmp/hive-metastore-1.1.0-cdh5.7.1.jar hive-metastore.jar
cp -f  /tmp/hive-shims-common-1.1.0-cdh5.7.1.jar hive-shims-common.jar
cp -f /tmp/hive-shims-1.1.0-cdh5.7.1.jar hive-shims.jar
cp -f  /tmp/hive-shims-scheduler-1.1.0-cdh5.7.1.jar hive-shims-scheduler.jar
cp -f /tmp/hive-exec-1.1.0-cdh5.7.1.jar  hive-exec.jar
cp -f /tmp/hive-ant-1.1.0-cdh5.7.1.jar hive-ant.jar
cp -f /tmp/hive-beeline-1.1.0-cdh5.7.1.jar hive-beeline.jar
cp -f /tmp/hive-hbase-handler-1.1.0-cdh5.7.1.jar hive-hbase-handler.jar
cp -f /tmp/hive-serde-1.1.0-cdh5.7.1.jar hive-serde.jar
cp -f /tmp/hive-service-1.1.0-cdh5.7.1.jar hive-service.jar

impala安装后,依赖的jar地址都是链接其他cdh组件的软链,需要替换为真实的jar地址。大部分可以用镜像已有的jar,比如avro.jar。有些依赖的jar环境里没有,前面下载资源已下载到本地/tmp目录,比如sentry-binding-hive.jar。有些jar cdh版本和镜像里的apache hadoop 2.7.2有冲突要用cdh的jar,前面也已下载到本地/tmp目录,比如hadoop-mapreduce-client-common.jar。

修改impala配置

echo "copy hadoop conf"
cp $HADOOP_CONF_DIR/core-site.xml /etc/impala/conf/
cp $HADOOP_CONF_DIR/hdfs-site.xml /etc/impala/conf/
cp $HIVE_CONF_DIR/hive-site.xml /etc/impala/conf/

echo "mkdir hdfs socket path"
mkdir /var/run/hadoop-hdfs/
chown hdfs:hadoop  /var/run/hadoop-hdfs/

echo "modify impala config"
masterIp=`cat /etc/hosts | grep emr-header-1|awk '{print $1}'`
echo "masterip: $masterIp"
sed -i "s/127.0.0.1/$masterIp/g" /etc/default/impala

sed -i "s/Defaults    requiretty/# Defaults    requiretty/g" /etc/sudoers 

将impala需要的hadoop配置文件拷贝到impala配置目录,创建hdfs socket path供hdfs和impala使用,修改impala的配置,将catalog和statestore地址改成master的内网ip。为了避免shell作业启动impala有sudo问题“sudo: sorry, you must have a tty to run sudo",注释掉tty检查。

创建集群

参照 帮助文档 ,创建集群时点击添加引导操作,选择刚才上传的installImpala脚本,创建一个引导操作步骤。集群创建好后,通过集群详情页的引导/软件配置:无异常来确定引导操作执行成功。

启动impala

由于impala的启动依赖hadoop组件已启动,所以用shell作业来启动impala的进程。shell作业可以在master上执行脚本,详见: 帮助文档

本地创建一个startimpala.sh文件,可以直接从oss下载,内容如下,通过oss控制台上传到oss合适位置,例如[yourbucket]/sh/startimpala.sh。

#!/bin/sh

sudo service impala-state-store start
sudo service impala-catalog start
sudo service impala-server start

for ip in `cat /etc/hosts  | grep worker|awk '{print $1}'`
do
ssh  -t $ip "sudo service impala-server start"
done 

脚本在master节点上启动imapla的statestore,catalog,server进程,在worker节点上启动impala的server进程。

参考 帮助文档,创建shell作业,选择刚才上传的startimpala.sh,切换前缀为ossref://,应用参数写”-f ossresf://xxx.startimpala.sh“,例如 “-f ossref://emr-agent-pack/bootstrap/impala/impala-2.5.0/startimpala.sh“。保存。

再创建一个执行计划,选择刚才创建的集群和作业,调度策略选择手工执行。执行计划点击立即运行,运行startimpala作业脚本,在执行计划的运行记录里可以看到执行信息,刷新页面,等待状态更新为运行完成。点击右侧查看作业列表,显示作业的状态为运行成功,即说明impala进程启动成功。

验证

cli验证

引导操作安装了impala-shell,ssh用root用户 登陆master,进入impala-shell
impala-shell,创建表并插入数据

create table test(id bigint);
insert into table test select count(id) from test;

执行show tables能看见刚刚创建的表,执行select * from test;能查到刚才插入的数据。

web验证

默认安全组只能访问集群的22端口,可以通过 本地端口转发来访问impala的web页面,impala的web地址是masterip:25000,可以查看impala进程的详细信息。

也可以通过 安全组设置 公网入方向允许白名单Ip访问25000端口。由于公网暴露端口有安全隐患,ip白名单授权对象一定不能设置为0.0.0.0/0,而是应该只允许固定的ip白名单访问。如果没有固定ip,还是用前面端口转发的方式来访问。

上一篇:阿里云文件存储极速型NAS性能提升340%


下一篇:使用阿里云极速型NAS构建高可用的GitLab