Hadoop基础-03-HDFS基本概念

目录

HDFS概述(Hadoop Distributed File System)

  • 分布式的
  • commodity、low-cost hardware:去中心化IoE
  • fault-tolerant:默认采用3副本机制
  • high throughput:移动计算比移动数据成本低
  • large data sets:大规模的数据集

HDFS架构详解

  • NameNode(master) / DataNodes(slave)
  • master/slave的架构
  • NN: the file system namespace
  • DN: storage
  • a file system namespace and allows user data to be stored in files
  • a file is split into one or more blocks and these blocks are stored in a set of DataNodes.
  • NameNode executes file system namespace operations like opening, closing, and renaming files and directories.
  • determines the mapping of blocks to DataNodes.
  • These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language

官方文档

举例
一个a.txt 共有150M 一个blocksize为128M
则会拆分两个block 一个是block1: 128M ; 另个block2: 22M

那么问题来了, block1 和block2 要存放在哪个DN里面?
这个 对于用户是透明的 , 这个就要用 HDFS来完成

Hadoop基础-03-HDFS基本概念

文件系统Namespace

  • user or an application can create directories and store files inside these directories.
  • The file system namespace hierarchy is similar to most other existing file systems;
  • one can create and remove files, move a file from one directory to another, or rename a file.
  • HDFS supports user quotas and access permissions. HDFS does not support hard links or soft links.
  • HDFS does not support hard links or soft links . However, the HDFS architecture does not preclude implementing these features.
  • The NameNode maintains the file system namespace

HDFS副本机制

  • It stores each file as a sequence of blocks. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file.

  • An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later.

Hadoop基础-03-HDFS基本概念

Linux环境介绍

(base) JackSundeMBP:~ jacksun$ ssh hadoop@192.168.68.200

[hadoop@hadoop000 ~]$ pwd
/home/hadoop


[hadoop@hadoop000 ~]$ ls
app   Desktop    Downloads  maven_resp  Pictures  README.txt  software   t.txt
data  Documents  lib        Music       Public    shell       Templates  Videos

文件名 用途
software 软件安装包
app 软件安装目录
data 数据
lib jar包
shell 脚本
maven_resp maven依赖包

[hadoop@hadoop000 ~]$ sudo vi /etc/hosts

192.168.68.200 hadoop000

Hadoop部署

JDK1.8部署详解

  • 获得文件scp jdk_name hadoop@192/168.1.200
  • 安装jdk tar -zvxf jdk_name -C ~/app
  • 配置系统环境
    vi .bash_profile
PATH=$PATH:$HOME/.local/bin:$HOME/bin
export JAVA_HOME=/home/hadoop/app/jdk1.8.0_91
export PATH=$JAVA_HOME/bin:$PATH

Hadoop基础-03-HDFS基本概念

修改生效source .bash_profile

java -version

java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)

打印出上述则安装成功

所有软件安装包
https://download.csdn.net/download/jankin6/12668545

ssh无密码登陆部署详解

  • 创建密钥ssh-keygen -t rsa

  • cd .ssh

  • 公钥输入到key里面cat id_rsa.pub >> authorized_keys

-rw------- 1 hadoop hadoop  796 8月  16 06:17 authorized_keys
-rw------- 1 hadoop hadoop 1675 8月  16 06:14 id_rsa
-rw-r--r-- 1 hadoop hadoop  398 8月  16 06:14 id_rsa.pub
-rw-r--r-- 1 hadoop hadoop 1230 8月  16 18:05 known_hosts

id_rsa 私钥
id_rsa.pub 公钥

[hadoop@hadoop000 ~]$ ssh localhost 
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is SHA256:LZvkeJHnqH0AtihqFB2AcQJKwMpH1/DorPi0bIEKcQM.
ECDSA key fingerprint is MD5:9f:b5:f3:bd:f2:aa:61:97:8b:8a:e2:a3:98:5a:e4:3d.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Last login: Sun Aug 16 18:03:23 2020 from 192.168.1.3
[hadoop@hadoop000 ~]$ ls
app              Desktop    lib         Pictures    shell      t.txt
authorized_keys  Documents  maven_resp  Public      software   Videos
data             Downloads  Music       README.txt  Templates
[hadoop@hadoop000 ~]$ ssh localhost 
Last login: Sun Aug 16 18:05:21 2020 from 127.0.0.1

Hadoop安装目录详解及hadoop-env配置

配置JAVA_HOME

[hadoop@hadoop000 hadoop]$ ls
capacity-scheduler.xml      httpfs-env.sh            mapred-env.sh
configuration.xsl           httpfs-log4j.properties  mapred-queues.xml.template
container-executor.cfg      httpfs-signature.secret  mapred-site.xml
core-site.xml               httpfs-site.xml          mapred-site.xml.template
hadoop-env.cmd              kms-acls.xml             slaves
hadoop-env.sh               kms-env.sh               ssl-client.xml.example
hadoop-metrics2.properties  kms-log4j.properties     ssl-server.xml.example
hadoop-metrics.properties   kms-site.xml             yarn-env.cmd
hadoop-policy.xml           log4j.properties         yarn-env.sh
hdfs-site.xml               mapred-env.cmd           yarn-site.xml
[hadoop@hadoop000 hadoop]$ pwd
/home/hadoop/app/hadoop-2.6.0-cdh5.15.1/etc/hadoop
[hadoop@hadoop000 hadoop]$ sudo vi hadoop-env.sh 

-----------------------------

# The java implementation to use.
#export JAVA_HOME=${JAVA_HOME}

export JAVA_HOME=/home/hadoop/app/jdk1.8.0_91 

vi ~/.bash_profile

export HADOOP_HOME=/home/hadoop/app/hadoop-2.6.0-cdh5.15.1
export PATH=$HADOOP_HOME/bin:$PATH

cd $HADOOP_HOME/bin

  • 目录
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ ls
bin             etc                  include  LICENSE.txt  README.txt  src
bin-mapreduce1  examples             lib      logs         sbin
cloudera        examples-mapreduce1  libexec  NOTICE.txt   share
目录 用途
bin hadoop客户端名单
etc/hadoop hadoop相关的配置文件存放目录
sbin 启动hadoop相关进程的脚本
share 常用的例子

HDFS格式化以及启动详解

http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.15.1/

vi etc/hadoop/core-site.xml:

Hadoop基础-03-HDFS基本概念

说明这个主节点再这台机器上的8020端口

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop000:8020</value>
    </property>
</configuration>

vi etc/hadoop/hdfs-site.xml:

Hadoop基础-03-HDFS基本概念

<configuration>


    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/hadoop/app/tmp</value>
    </property>

    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>


</configuration>

vi slaves
Hadoop基础-03-HDFS基本概念

第一次要执行格式化文件系统,不重复执行: hdfs namenode -format

cd $HADOOP_HOME/bin

Hadoop基础-03-HDFS基本概念

Hadoop基础-03-HDFS基本概念

相关命令再这里cd $HADOOP_HOME/bin

  • 启动集群$HADOOP_HOME/sbin/start-dfs.sh

Hadoop基础-03-HDFS基本概念

验证成功

[hadoop@hadoop000 sbin]$ jps
13607 NameNode
14073 Jps
13722 DataNode
13915 SecondaryNameNode
  • 防火墙干扰

http://192.168.1.200:50070
发现jps可以打开但浏览器不行,多半是防火墙

查看防火墙 firewall-cmd --state
关防火墙systemctl stop firewalld.service

[hadoop@hadoop000 sbin]$ firewall-cmd --state
not running

Hadoop基础-03-HDFS基本概念

Hadoop基础-03-HDFS基本概念

  • 停止集群
    $HADOOP_HOME/sbin/stop-dfs.sh

  • 注意

tart-dfs. sh等于

hadoop-daemons.sh start namenode
hadoop-daemons.sh start datanode
hadoop-daemons.sh start secondarynamenode

同理stop-dfs.sh也是

Hadoop命令行操作详解

改了环境变量记得
source ~/.bash_profile

[hadoop@hadoop000 bin]$ ./hadoop
Usage: hadoop [--config confdir] COMMAND
       where COMMAND is one of:
  fs                   run a generic filesystem user client
  version              print the version
  jar <jar>            run a jar file
  checknative [-a|-h]  check native hadoop and compression libraries availability
  distcp <srcurl> <desturl> copy file or directories recursively
  archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
  classpath            prints the class path needed to get the
  credential           interact with credential providers
                       Hadoop jar and the required libraries
  daemonlog            get/set the log level for each daemon
  s3guard              manage data on S3
  trace                view and modify Hadoop tracing settings
 or
  CLASSNAME            run the class named CLASSNAME

Most commands print help when invoked w/o parameters.


[hadoop@hadoop000 bin]$ ./hadoop fs
Usage: hadoop fs [generic options]
	[-appendToFile <localsrc> ... <dst>]
	[-cat [-ignoreCrc] <src> ...]
	[-chgrp [-R] GROUP PATH...]
	[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
	[-chown [-R] [OWNER][:[GROUP]] PATH...]
	[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
	[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-count [-q] [-h] [-v] [-x] <path> ...]
	[-cp [-f] [-p | -p[topax]] <src> ... <dst>]
	[-df [-h] [<path> ...]]
	[-du [-s] [-h] [-x] <path> ...]
	[-find <path> ... <expression> ...]
	[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-getfacl [-R] <path>]
	[-getfattr [-R] {-n name | -d} [-e en] <path>]
	[-getmerge [-nl] <src> <localdst>]
	[-help [cmd ...]]
	[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [<path> ...]]
	[-mkdir [-p] <path> ...]
	[-moveFromLocal <localsrc> ... <dst>]
	[-moveToLocal <src> <localdst>]
	[-mv <src> ... <dst>]
	[-put [-f] [-p] [-l] <localsrc> ... <dst>]
	[-rm [-f] [-r|-R] [-skipTrash] <src> ...]
	[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
	[-test -[defsz] <path>]
	[-text [-ignoreCrc] <src> ...]
	[-touchz <path> ...]
	[-usage [cmd ...]]

  • 常用命令
    hadoop fs -ls /
    hadoop fs -cat /hadoop fs -text /
    hadoop fs -put /
    hadoop fs -copyFromLocal /
    hadoop fs -get /README.txt ./
    hadoop fs -mkdir /hdfs-test
    hadoop fs -mv
    hadoop fs -rm
    hadoop fs -rmdir
    hadoop fs -rmr==hadoop fs -rm -r
    hadoop fs -getmerge
    hadoop fs -mkdir /hdfs-test
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ hadoop fs -put README.txt  /
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ hadoop fs -ls /
Found 1 items
-rw-r--r--   1 hadoop supergroup       1366 2020-08-17 21:35 /README.txt

[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ hadoop fs -cat /README.txt

......
and our wiki, at:
......
  Hadoop Core uses the SSL libraries from the Jetty project written 
by mortbay.org.

[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ hadoop fs -get /README.txt ./


[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ hadoop fs -mkdir /hdfs-test
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ hadoop fs -ls /
Found 2 items
-rw-r--r--   1 hadoop supergroup       1366 2020-08-17 21:35 /README.txt
drwxr-xr-x   - hadoop supergroup          0 2020-08-17 21:48 /hdfs-test

HDFS的存储扩展

Hadoop基础-03-HDFS基本概念
Hadoop基础-03-HDFS基本概念
Hadoop基础-03-HDFS基本概念

上图我们可以看到一个文件被拆了两个块,但是实际存储的在哪里呢?

由此我们得出
put,1个文件分割成n个块,然后再存放再不同的节点的
get,先去n个节点上的n个块上找到对应的数据信息

上一篇:c-如何从核心文件获取有关崩溃的信息?


下一篇:190518