目录
- HDFS概述(Hadoop Distributed File System)
- HDFS架构详解
- 文件系统Namespace
- HDFS副本机制
- Linux环境介绍
- Hadoop部署
- JDK1.8部署详解
- ssh无密码登陆部署详解
- Hadoop安装目录详解及hadoop-env配置
- HDFS格式化以及启动详解
- Hadoop命令行操作详解
- HDFS的存储扩展
HDFS概述(Hadoop Distributed File System)
- 分布式的
- commodity、low-cost hardware:去中心化IoE
- fault-tolerant:默认采用3副本机制
- high throughput:移动计算比移动数据成本低
- large data sets:大规模的数据集
HDFS架构详解
- NameNode(master) / DataNodes(slave)
- master/slave的架构
- NN: the file system namespace
- DN: storage
- a file system namespace and allows user data to be stored in files
- a file is split into one or more blocks and these blocks are stored in a set of DataNodes.
- NameNode executes file system namespace operations like opening, closing, and renaming files and directories.
- determines the mapping of blocks to DataNodes.
- These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language
举例
一个a.txt 共有150M 一个blocksize为128M
则会拆分两个block 一个是block1: 128M ; 另个block2: 22M
那么问题来了, block1 和block2 要存放在哪个DN里面?
这个 对于用户是透明的 , 这个就要用 HDFS来完成
文件系统Namespace
- user or an application can create directories and store files inside these directories.
- The file system namespace hierarchy is similar to most other existing file systems;
- one can create and remove files, move a file from one directory to another, or rename a file.
- HDFS supports user quotas and access permissions. HDFS does not support hard links or soft links.
- HDFS does not support hard links or soft links . However, the HDFS architecture does not preclude implementing these features.
- The NameNode maintains the file system namespace
HDFS副本机制
-
It stores each file as a sequence of blocks. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file.
-
An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later.
Linux环境介绍
(base) JackSundeMBP:~ jacksun$ ssh hadoop@192.168.68.200
[hadoop@hadoop000 ~]$ pwd
/home/hadoop
[hadoop@hadoop000 ~]$ ls
app Desktop Downloads maven_resp Pictures README.txt software t.txt
data Documents lib Music Public shell Templates Videos
文件名 | 用途 |
---|---|
software | 软件安装包 |
app | 软件安装目录 |
data | 数据 |
lib | jar包 |
shell | 脚本 |
maven_resp | maven依赖包 |
[hadoop@hadoop000 ~]$ sudo vi /etc/hosts
192.168.68.200 hadoop000
Hadoop部署
- hadoop是用的是CDH
- CDH相关的软件包下载 :http://archive.cloudera.com/cdh5/cdh/5/
- hadoop的版本 :hadoop-2.6.0-cdh5.15.1
- hive版本: hive-1.1.0-cdh5.15.1
JDK1.8部署详解
- 获得文件
scp jdk_name hadoop@192/168.1.200
- 安装jdk
tar -zvxf jdk_name -C ~/app
- 配置系统环境
vi .bash_profile
PATH=$PATH:$HOME/.local/bin:$HOME/bin
export JAVA_HOME=/home/hadoop/app/jdk1.8.0_91
export PATH=$JAVA_HOME/bin:$PATH
修改生效source .bash_profile
java -version
java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)
打印出上述则安装成功
所有软件安装包
https://download.csdn.net/download/jankin6/12668545
ssh无密码登陆部署详解
-
创建密钥
ssh-keygen -t rsa
-
cd .ssh
-
公钥输入到key里面
cat id_rsa.pub >> authorized_keys
-rw------- 1 hadoop hadoop 796 8月 16 06:17 authorized_keys
-rw------- 1 hadoop hadoop 1675 8月 16 06:14 id_rsa
-rw-r--r-- 1 hadoop hadoop 398 8月 16 06:14 id_rsa.pub
-rw-r--r-- 1 hadoop hadoop 1230 8月 16 18:05 known_hosts
id_rsa
私钥id_rsa.pub
公钥
[hadoop@hadoop000 ~]$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is SHA256:LZvkeJHnqH0AtihqFB2AcQJKwMpH1/DorPi0bIEKcQM.
ECDSA key fingerprint is MD5:9f:b5:f3:bd:f2:aa:61:97:8b:8a:e2:a3:98:5a:e4:3d.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Last login: Sun Aug 16 18:03:23 2020 from 192.168.1.3
[hadoop@hadoop000 ~]$ ls
app Desktop lib Pictures shell t.txt
authorized_keys Documents maven_resp Public software Videos
data Downloads Music README.txt Templates
[hadoop@hadoop000 ~]$ ssh localhost
Last login: Sun Aug 16 18:05:21 2020 from 127.0.0.1
Hadoop安装目录详解及hadoop-env配置
- 下载 https://download.csdn.net/download/jankin6/12668545
- 解压
tar -zxvf hadoop_name.tar.gz -C ~/app
- 添加环境变量
- 修改配置
配置JAVA_HOME
[hadoop@hadoop000 hadoop]$ ls
capacity-scheduler.xml httpfs-env.sh mapred-env.sh
configuration.xsl httpfs-log4j.properties mapred-queues.xml.template
container-executor.cfg httpfs-signature.secret mapred-site.xml
core-site.xml httpfs-site.xml mapred-site.xml.template
hadoop-env.cmd kms-acls.xml slaves
hadoop-env.sh kms-env.sh ssl-client.xml.example
hadoop-metrics2.properties kms-log4j.properties ssl-server.xml.example
hadoop-metrics.properties kms-site.xml yarn-env.cmd
hadoop-policy.xml log4j.properties yarn-env.sh
hdfs-site.xml mapred-env.cmd yarn-site.xml
[hadoop@hadoop000 hadoop]$ pwd
/home/hadoop/app/hadoop-2.6.0-cdh5.15.1/etc/hadoop
[hadoop@hadoop000 hadoop]$ sudo vi hadoop-env.sh
-----------------------------
# The java implementation to use.
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/home/hadoop/app/jdk1.8.0_91
vi ~/.bash_profile
export HADOOP_HOME=/home/hadoop/app/hadoop-2.6.0-cdh5.15.1
export PATH=$HADOOP_HOME/bin:$PATH
cd $HADOOP_HOME/bin
- 目录
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ ls
bin etc include LICENSE.txt README.txt src
bin-mapreduce1 examples lib logs sbin
cloudera examples-mapreduce1 libexec NOTICE.txt share
目录 | 用途 |
---|---|
bin | hadoop客户端名单 |
etc/hadoop | hadoop相关的配置文件存放目录 |
sbin | 启动hadoop相关进程的脚本 |
share | 常用的例子 |
HDFS格式化以及启动详解
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.15.1/
vi etc/hadoop/core-site.xml:
说明这个主节点再这台机器上的8020端口
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop000:8020</value>
</property>
</configuration>
vi etc/hadoop/hdfs-site.xml:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/app/tmp</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
vi slaves
第一次要执行格式化文件系统,不重复执行: hdfs namenode -format
cd $HADOOP_HOME/bin
相关命令再这里cd $HADOOP_HOME/bin
- 启动集群
$HADOOP_HOME/sbin/start-dfs.sh
验证成功
[hadoop@hadoop000 sbin]$ jps
13607 NameNode
14073 Jps
13722 DataNode
13915 SecondaryNameNode
- 防火墙干扰
http://192.168.1.200:50070
发现jps可以打开但浏览器不行,多半是防火墙
查看防火墙 firewall-cmd --state
关防火墙systemctl stop firewalld.service
[hadoop@hadoop000 sbin]$ firewall-cmd --state
not running
-
停止集群
$HADOOP_HOME/sbin/stop-dfs.sh
-
注意
tart-dfs. sh
等于
hadoop-daemons.sh start namenode
hadoop-daemons.sh start datanode
hadoop-daemons.sh start secondarynamenode
同理stop-dfs.sh
也是
Hadoop命令行操作详解
改了环境变量记得source ~/.bash_profile
[hadoop@hadoop000 bin]$ ./hadoop
Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
fs run a generic filesystem user client
version print the version
jar <jar> run a jar file
checknative [-a|-h] check native hadoop and compression libraries availability
distcp <srcurl> <desturl> copy file or directories recursively
archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
classpath prints the class path needed to get the
credential interact with credential providers
Hadoop jar and the required libraries
daemonlog get/set the log level for each daemon
s3guard manage data on S3
trace view and modify Hadoop tracing settings
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
[hadoop@hadoop000 bin]$ ./hadoop fs
Usage: hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] [-h] [-v] [-x] <path> ...]
[-cp [-f] [-p | -p[topax]] <src> ... <dst>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] [-x] <path> ...]
[-find <path> ... <expression> ...]
[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getfacl [-R] <path>]
[-getfattr [-R] {-n name | -d} [-e en] <path>]
[-getmerge [-nl] <src> <localdst>]
[-help [cmd ...]]
[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] [-l] <localsrc> ... <dst>]
[-rm [-f] [-r|-R] [-skipTrash] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
[-touchz <path> ...]
[-usage [cmd ...]]
- 常用命令
hadoop fs -ls /
hadoop fs -cat /
hadoop fs -text /
hadoop fs -put /
hadoop fs -copyFromLocal /
hadoop fs -get /README.txt ./
hadoop fs -mkdir /hdfs-test
hadoop fs -mv
hadoop fs -rm
hadoop fs -rmdir
hadoop fs -rmr
==hadoop fs -rm -r
hadoop fs -getmerge
hadoop fs -mkdir /hdfs-test
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ hadoop fs -put README.txt /
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ hadoop fs -ls /
Found 1 items
-rw-r--r-- 1 hadoop supergroup 1366 2020-08-17 21:35 /README.txt
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ hadoop fs -cat /README.txt
......
and our wiki, at:
......
Hadoop Core uses the SSL libraries from the Jetty project written
by mortbay.org.
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ hadoop fs -get /README.txt ./
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ hadoop fs -mkdir /hdfs-test
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ hadoop fs -ls /
Found 2 items
-rw-r--r-- 1 hadoop supergroup 1366 2020-08-17 21:35 /README.txt
drwxr-xr-x - hadoop supergroup 0 2020-08-17 21:48 /hdfs-test
HDFS的存储扩展
上图我们可以看到一个文件被拆了两个块,但是实际存储的在哪里呢?
由此我们得出put
,1个文件分割成n个块,然后再存放再不同的节点的get
,先去n个节点上的n个块上找到对应的数据信息