HDFS部署详解
Hadoop是一个由Apache基金会所开发的分布式系统基础架构,充分利用集群的威力进行高速运算和存储,Hadoop实现了一个分布式文件系统(Hadoop Distributed File System),其中一个组件是HDFS,HDFS有高容错性的特点,并且设计用来部署在低廉的(low-cost)硬件上;而且它提供高吞吐量(high throughput)来访问应用程序的数据,适合那些有着超大数据集(large data set)的应用程序.
优点:
1.高可靠性。Hadoop按位存储和处理数据的能力值得人们信赖。
2.高扩展性。Hadoop是在可用的计算机集簇间分配数据并完成计算任务的,这些集簇可以方便地扩展到数以千计的节点中。
3.高效性。Hadoop能够在节点之间动态地移动数据,并保证各个节点的动态平衡,因此处理速度非常快。
4.高容错性。Hadoop能够自动保存数据的多个副本,并且能够自动将失败的任务重新分配。
5.低成本。与一体机、商用数据仓库以及QlikView、Yonghong Z-Suite等数据集市相比,hadoop是开源的,项目的软件成本因此会大大降低。
架构核心
Hadoop 由许多元素构成。其最底部是 Hadoop Distributed File System(HDFS),它存储 Hadoop 集群中所有存储节点上的文件。HDFS的上一层是MapReduce 引擎,通过对Hadoop分布式计算平台最核心的分布式文件系统HDFS、MapReduce处理过程,以及数据仓库工具Hive和分布式数据库Hbase的介绍,基本涵盖了Hadoop分布式平台的所有技术核心。
HDFS:
对外部客户机而言,HDFS就像一个传统的分级文件系统。可以创建、删除、移动或重命名文件,等等。这些节点包括 NameNode,它在 HDFS 内部提供元数据服务;DataNode,它为 HDFS 提供存储块。存储在 HDFS 中的文件被分成块,然后将这些块复制到多个计算机中。
NameNode:
NameNode 是一个通常在 HDFS 实例中的机器上运行的软件。它负责管理文件系统名称空间和控制外部客户机的访问。
DataNode:
DataNode 也是一个通常在 HDFS实例中的单独机器上运行的软件,DataNode 响应来自 HDFS 客户机的读写请求。它们还响应来自 NameNode 的创建、删除和复制块的命令。
集群部署
1.环境准备
192.168.249.227 node-1
192.168.249.228 node-2
192.168.249.229 node-3
192.168.249.230 node-4
关闭防火墙和selinux
2.本地解析(每台服务器都操作)
[root@node-1 ~]# vi /etc/hosts
192.168.249.227 node-1
192.168.249.228 node-2
192.168.249.229 node-3
192.168.249.230 node-4
3.安装jdk环境(每台服务器都操作)
上传jdk包并解压
[root@node-1 ~]# tar xf jdk-8u211-linux-x64.tar.gz -C /usr/local/
修改文件名
[root@node-1 ~]# mv /usr/local/jdk1.8.0_211/ /usr/java
添加环境变量
[root@node-1 ~]# vi /etc/profile
JAVA_HOME=/usr/java
PATH=$JAVA_HOME/bin:$PATH
export JAVA_HOME PATH
使环境变量生效
[root@node-1 ~]# source /etc/profile
检查环境版本
[root@node-1 ~]# java -version
java version "1.8.0_211"
Java(TM) SE Runtime Environment (build 1.8.0_211-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.211-b12, mixed mode)
4.免密登录(每台服务器都操作)
[root@node-1 ~]# ssh-keygen 执行这条命令一路回车到最后
[root@node-1 ~]# ssh-copy-id -i node-1
[root@node-1 ~]# ssh-copy-id -i node-2
[root@node-1 ~]# ssh-copy-id -i node-3
[root@node-1 ~]# ssh-copy-id -i node-4
5.下载hadoop(每台服务器都操作)
下载hadoop并上传到服务器并解压
[root@node-1 ~]# tar xf hadoop-3.3.0.tar.gz -C /opt/
[root@node-1 ~]# mv /opt/hadoop-3.3.0/ /opt/hadoop
6.添加环境变量(每台服务器都操作)
[root@node-1 ~]# vi ~/.bash_profile (找到后添加红色部分)
PATH=$PATH:$HOME/bin:/opt/hadoop/bin
[root@node-1 ~]# source ~/.bash_profile
[root@node-1 ~]# hadoop version
Hadoop 3.3.0
Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r aa96f1871bfd858f9bac59cf2a81ec470da649af
Compiled by brahma on 2020-07-06T18:44Z
Compiled with protoc 3.7.1
From source with checksum 5dc29b802d6ccd77b262ef9d04d19c4
This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-3.3.0.jar
7.namenode节点部署(node-1)
创建对应目录
[root@node-1 ~]# mkdir -p /data/hdfs/{tmp,var,logs,dfs,data,name,checkpoint,edits}
修改配置文件(配置文件见子目录)
[root@node-1 ~]# vi /opt/hadoop/etc/hadoop/core-site.xml
[root@node-1 ~]# vi /opt/hadoop/etc/hadoop/workers
[root@node-1 ~]# vi /opt/hadoop/etc/hadoop/hdfs-site.xml
[root@node-1 ~]# vi /opt/hadoop/etc/hadoop/mapred-site.xml
[root@node-1 ~]# vi /opt/hadoop/etc/hadoop/yarn-site.xml
8.依次拷贝(node-1)
[root@node-1 ~]# scp -r /data/ node-2:/
[root@node-1 ~]# scp -r /data/ node-3:/
[root@node-1 ~]# scp -r /data/ node-4:/
[root@node-1 ~]# scp -r /opt/hadoop/etc/hadoop/* node-2:/opt/hadoop/etc/hadoop/
[root@node-1 ~]# scp -r /opt/hadoop/etc/hadoop/* node-3:/opt/hadoop/etc/hadoop/
[root@node-1 ~]# scp -r /opt/hadoop/etc/hadoop/* node-4:/opt/hadoop/etc/hadoop/
9.初始化
[root@node-1 ~]# cd /opt/hadoop/bin/
[root@node-1 bin]# ./hadoop namenode -format
修改启动文件,添加权限(依次拷贝到其他主机)
[root@node-1 ~]#vi /opt/hadoop/sbin/start-dfs.sh
HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
[root@node-1 ~]#vi /opt/hadoop/sbin/stop-dfs.sh
HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
[root@node-1 ~]#vi /opt/hadoop/sbin/stop-yarn.sh
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
[root@node-1 ~]#vi /opt/hadoop/sbin/start-yarn.sh
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
[root@node-1 sbin]# scp /opt/hadoop/sbin/start-dfs.sh /opt/hadoop/sbin/stop-dfs.sh /opt/hadoop/sbin/stop-yarn.sh /opt/hadoop/sbin/start-yarn.sh node-2:/opt/hadoop/sbin/
[root@node-1 sbin]# scp /opt/hadoop/sbin/start-dfs.sh /opt/hadoop/sbin/stop-dfs.sh /opt/hadoop/sbin/stop-yarn.sh /opt/hadoop/sbin/start-yarn.sh node-3:/opt/hadoop/sbin/
[root@node-1 sbin]# scp /opt/hadoop/sbin/start-dfs.sh /opt/hadoop/sbin/stop-dfs.sh /opt/hadoop/sbin/stop-yarn.sh /opt/hadoop/sbin/start-yarn.sh node-4:/opt/hadoop/sbin/
添加java环境,依次拷贝到其他主机
[root@node-1 ~]# vim /opt/hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/java
[root@node-1 sbin]# scp /opt/hadoop/etc/hadoop/hadoop-env.sh node-2:/opt/hadoop/etc/hadoop/hadoop-env.sh
[root@node-1 sbin]# scp /opt/hadoop/etc/hadoop/hadoop-env.sh node-3:/opt/hadoop/etc/hadoop/hadoop-env.sh
[root@node-1 sbin]# scp /opt/hadoop/etc/hadoop/hadoop-env.sh node-4:/opt/hadoop/etc/hadoop/hadoop-env.sh
所有节点依次启动
[root@node-1 ~]# cd /opt/hadoop/sbin/
[root@node-1 ~]# ./start-all.sh
配置文件模板
cat /opt/hadoop/etc/hadoop/core-size.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.checkpoint.period</name>
<value>3600</value>
</property>
<property>
<name>fs.checkpoint.size</name>
<value>67108864</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://node-1:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/data/hdfs/tmp</value>
</property>
<property>
<name>hadoop.http.staticuser.user</name>
<value>root</value>
</property>
</configuration>
cat /opt/hadoop/etc/hadoop/workers
node-1
node-2
node-3
node-4
cat /opt/hadoop/etc/hadoop/hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/data/hdfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/data/hdfs/data</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>node-1:50090</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>node-1:50070</value>
<description>
The address and the base port where the dfs namenode web ui will listen on.
If the port is 0 then the server will start on a free port.
</description>
</property>
<property>
<name>dfs.datanode.http.address</name>
<value>node1:50075</value>
</property>
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>file:/data/hdfs/checkpoint</value>
</property>
<property>
<name>dfs.namenode.checkpoint.edits.dir</name>
<value>file:/data/hdfs/edits</value>
</property>
</configuration>
cat /opt/hadoop/etc/hadoop/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.tarcker</name>
<value>node-1:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>node-1:19888</value>
</property>
</configuration>
cat /opt/hadoop/etc/hadoop/yarn-site.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node-1</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandle</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tarcker.address</name>
<value>node-1:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>node-1:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>node-1:8040</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>node-1:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>node-1:8088</value>
</property>
</configuration>
管理界面:
http://node-1:50070/
http://node-1:8088/
集群操作
1.创建文件夹并查看
[root@node-1 ~]# hadoop fs -mkdir -p /test/input
[root@node-1 ~]# hadoop fs -ls /
Found 1 items drwxr-xr-x - root supergroup 0 2020-11-13 22:46 /test
[root@node-1 ~]# hadoop fs -ls /test
Found 1 items drwxr-xr-x - root supergroup 0 2020-11-13 22:46 /test/input
2.上传文件
[root@node-1 ~]# cat newrain.txt
hello world
nihao shijie
3.上传到/test/input
[root@node-1 ~]# hadoop fs -put newrain.txt /test/input
4.查看是否上传成功
[root@node-1 ~]# hadoop fs -ls /test/input
Found 1 items
-rw-r--r-- 2 root supergroup 27 2020-11-13 22:52 /test/input/xingdian.txt
5.下载文件
[root@node-1 ~]# hadoop fs -get /test/input/newrain.txt /
[root@node-1 ~]# ls /
bin boot data dev etc home lib lib64 media mnt opt proc root run sbin srv sys tmp usr var newrain.txt
6.修改配置文件
[root@node-1 ~]# vi /opt/hadoop/etc/hadoop/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.tarcker</name>
<value>node-1:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>node-1:19888</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
</configuration>
7.运行一个mapreduce的例子程序: wordcount
[root@node-1 ~]# hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar wordcount /test/input /test/output