Hadoop在处理海量数据分析方面具有独天优势。今天花了在自己的Linux上搭建了伪分布模式,期间经历很多曲折,现在将经验总结如下。
首先,了解Hadoop的三种安装模式:
1. 单机模式. 单机模式是Hadoop的默认模。当配置文件为空时,Hadoop完全运行在本地。因为不需要与其他节点交互,单机模式就不使用HDFS,也不加载任何Hadoop的守护进程。该模式主要用于开发调试MapReduce程序的应用逻辑。
2. 伪分布模式. Hadoop守护进程运行在本地机器上,模拟一个小规模的的集群。该模式在单机模式之上增加了代码调试功能,允许你检查内存使用情况,HDFS输入输出,以及其他的守护进程交互。
3.
全分布模式. Hadoop守护进程运行在一个集群上。
参考资料:
1. Ubuntu11.10下安装Hadoop1.0.0(单机伪分布式)
5. Ubuntu上搭建Hadoop环境(单机模式+伪分布模式)
6. Hadoop的快速入门之 Ubuntu上搭建Hadoop环境(单机模式+伪分布模式)
本人极力推荐5和6,这两种教程从简到难,步骤详细,且有运行算例。下面我就将自己的安装过程大致回顾一下,为省时间,很多文字粘贴子参考资料5和6,再次感谢两位作者分享自己的安装经历。另外,下面的三篇文章可以从整体上把握Hadoop的结构,使你能够理解为什么要这么这么做。
我的安装的是ubuntu12.o4, 用户名derek, 机器名称是derekUbn, Hadoop的版本Hadoop-1.1.2.tar.gz,闲话少说,步骤和每一步的图示如下:
一、在Ubuntu下创建hadoop用户组和用户
1.添加hadoop用户到系统用户
- derek@derekUbun:~$ sudo addgroup hadoop
- derek@derekUbun:~$ sudo adduser --ingroup hadoop hadoop
- derek@derekUbun:~$ sudo addgroup hadoop
- derek@derekUbun:~$ sudo adduser --ingroup hadoop hadoop
2. 现在只是添加了一个用户hadoop,它并不具备管理员权限,我们给hadoop用户添加权限,打开/etc/sudoers文件
- derek@derekUbun:~$ sudo gedit /etc/sudoers
- derek@derekUbun:~$ sudo gedit /etc/sudoers
在root ALL=(ALL:ALL) ALL下添加hadoop ALL=(ALL:ALL) ALL
二、配置SSH
配置SSH是为了实现各机器之间执行指令无需输入登录密码。务必要避免输入密码,否则,主节点每次试图访问其他节点时,都需要手动输入这个密码。
SSH无密码原理:master(namenode/jobtrack)作为客户端,要实现
无密码公钥认证,连接到服务器slave(datanode/tasktracker)上时,需要在master上生成一个公钥对,包括一个公钥和一个私
钥,而后将公钥复制到所有的slave上。当master通过SSH连接slave时,slave就会生成一个随机数并用master的公钥对随机数进行
加密,并发送给master。Master收到密钥加密数之后再用私钥解密,并将解密数回传给slave,slave确认解密数无误后就允许master
进行连接了。这就是一个公钥认证的过程,期间不需要用户手工输入密码。重要过程是将客户端master复制到slave上。
1、安装ssh
1) 由于Hadoop用ssh通信,先安装ssh. 注意,我先从derek用户转到了hadoop.
- derek@derekUbun:~$ su - hadoop
- 密码:
- hadoop@derekUbun:~$ sudo apt-get install openssh-server
- [sudo] password for hadoop:
- 正在读取软件包列表... 完成
- 正在分析软件包的依赖关系树
- 正在读取状态信息... 完成
- openssh-server 已经是最新的版本了。
- 下列软件包是自动安装的并且现在不需要了:
- kde-l10n-de language-pack-kde-de language-pack-kde-en ssh-krb5
- language-pack-de-base language-pack-kde-zh-hans language-pack-kde-en-base
- kde-l10n-engb language-pack-kde-de-base kde-l10n-zhcn firefox-locale-de
- language-pack-de language-pack-kde-zh-hans-base
- 使用‘apt-get autoremove‘来卸载它们
- 升级了 0 个软件包,新安装了 0 个软件包,要卸载 0 个软件包,有 505 个软件包未被升级。
- derek@derekUbun:~$ su - hadoop
- 密码:
- hadoop@derekUbun:~$ sudo apt-get install openssh-server
- [sudo] password for hadoop:
- 正在读取软件包列表... 完成
- 正在分析软件包的依赖关系树
- 正在读取状态信息... 完成
- openssh-server 已经是最新的版本了。
- 下列软件包是自动安装的并且现在不需要了:
- kde-l10n-de language-pack-kde-de language-pack-kde-en ssh-krb5
- language-pack-de-base language-pack-kde-zh-hans language-pack-kde-en-base
- kde-l10n-engb language-pack-kde-de-base kde-l10n-zhcn firefox-locale-de
- language-pack-de language-pack-kde-zh-hans-base
- 使用‘apt-get autoremove‘来卸载它们
- 升级了 0 个软件包,新安装了 0 个软件包,要卸载 0 个软件包,有 505 个软件包未被升级。
因为我的机器已安装最新版的ssh,因此这一步实际上什么也没做。
2)
假设ssh安装完成,先启动服务。启动后,可以通过命令查看服务是否正确启动:
- hadoop@derekUbun:~$ sudo /etc/init.d/ssh start
- Rather than invoking init scripts through /etc/init.d, use the service(8)
- utility, e.g. service ssh start
- Since the script you are attempting to invoke has been converted to an
- Upstart job, you may also use the start(8) utility, e.g. start ssh
- hadoop@derekUbun:~$ ps -e |grep ssh
- 759 ? 00:00:00 sshd
- 1691 ? 00:00:00 ssh-agent
- 12447 ? 00:00:00 ssh
- 12448 ? 00:00:00 sshd
- 12587 ? 00:00:00 sshd
- hadoop@derekUbun:~$
- hadoop@derekUbun:~$ sudo /etc/init.d/ssh start
- Rather than invoking init scripts through /etc/init.d, use the service(8)
- utility, e.g. service ssh start
- Since the script you are attempting to invoke has been converted to an
- Upstart job, you may also use the start(8) utility, e.g. start ssh
- hadoop@derekUbun:~$ ps -e |grep ssh
- 759 ? 00:00:00 sshd
- 1691 ? 00:00:00 ssh-agent
- 12447 ? 00:00:00 ssh
- 12448 ? 00:00:00 sshd
- 12587 ? 00:00:00 sshd
- hadoop@derekUbun:~$
3)
作为一个安全通信协议(ssh生成密钥有rsa和dsa两种生成方式,默认情况下采用rsa方式),使用时需要密码,因此我们要设置成免密码登录,生成私钥和公钥:
- hadoop@derekUbun:~$ ssh-keygen -t rsa -P ""
- Generating public/private rsa key pair.
- Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
- /home/hadoop/.ssh/id_rsa already exists.
- Overwrite (y/n)? y
- Your identification has been saved in /home/hadoop/.ssh/id_rsa.
- Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
- The key fingerprint is:
- c7:36:c7:77:91:a2:32:28:35:a6:9f:36:dd:bd:dc:4f hadoop@derekUbun
- The key‘s randomart image is:
- +--[ RSA 2048]----+
- | |
- | .|
- | + . o |
- | + o. .. . .|
- | o .So=.o . .|
- | o oo+o.. . |
- | = . . . E|
- | . . . o. |
- | o .o|
- +-----------------+
- hadoop@derekUbun:~$
- hadoop@derekUbun:~$ ssh-keygen -t rsa -P ""
- Generating public/private rsa key pair.
- Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
- /home/hadoop/.ssh/id_rsa already exists.
- Overwrite (y/n)? y
- Your identification has been saved in /home/hadoop/.ssh/id_rsa.
- Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
- The key fingerprint is:
- c7:36:c7:77:91:a2:32:28:35:a6:9f:36:dd:bd:dc:4f hadoop@derekUbun
- The key‘s randomart image is:
- +--[ RSA 2048]----+
- | |
- | .|
- | + . o |
- | + o. .. . .|
- | o .So=.o . .|
- | o oo+o.. . |
- | = . . . E|
- | . . . o. |
- | o .o|
- +-----------------+
- hadoop@derekUbun:~$
(注:回车后会在~/.ssh/下生成两个文件:id_rsa和id_rsa.pub这两个文件是成对出现的前者为私钥,后者为公钥)
进入~/.ssh/目录下,将公钥id_rsa.pub追加到authorized_keys授权文件中,开始是没有authorized_keys文件的(authorized_keys
用于保存所有允许以当前用户身份登录到ssh客户端用户的公钥内容):
- hadoop@derekUbun:~$ cat ~/.ssh/id_rsa.pub>> ~/.ssh/authorized_keys
- 现在可以登入ssh确认以后登录时不用输入密码:
- hadoop@derekUbun:~$ ssh localhost
- Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.2.0-27-generic-pae i686)
- * Documentation: https://help.ubuntu.com/
- 512 packages can be updated.
- 151 updates are security updates.
- Last login: Mon Mar 11 15:56:15 2013 from localhost
- hadoop@derekUbun:~$
- hadoop@derekUbun:~$ cat ~/.ssh/id_rsa.pub>> ~/.ssh/authorized_keys
- 现在可以登入ssh确认以后登录时不用输入密码:
- hadoop@derekUbun:~$ ssh localhost
- Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.2.0-27-generic-pae i686)
- * Documentation: https://help.ubuntu.com/
- 512 packages can be updated.
- 151 updates are security updates.
- Last login: Mon Mar 11 15:56:15 2013 from localhost
- hadoop@derekUbun:~$
(
注:当ssh远程登录到其它机器后,现在你控制的是远程的机器,需要执行退出命令才能重新控制本地主机。)
登出:~$
exit
这样以后登录就不用输入密码了。
- hadoop@derekUbun:~$ exit
- Connection to localhost closed.
- hadoop@derekUbun:~$
- hadoop@derekUbun:~$ exit
- Connection to localhost closed.
- hadoop@derekUbun:~$
三、安装Java
使用derek用户,安装java.
因为我的电脑上已安装java,其安装目录是/usr/java/jdk1.7.0_17,可以显示我的这个安装版本。
- hadoop@derekUbun:~$ su - derek
- 密码:
- derek@derekUbun:~$ java -version
- java version "1.7.0_17"
- Java(TM) SE Runtime Environment (build 1.7.0_17-b02)
- Java HotSpot(TM) Server VM (build 23.7-b01, mixed mode)
- hadoop@derekUbun:~$ su - derek
- 密码:
- derek@derekUbun:~$ java -version
- java version "1.7.0_17"
- Java(TM) SE Runtime Environment (build 1.7.0_17-b02)
- Java HotSpot(TM) Server VM (build 23.7-b01, mixed mode)
四、安装hadoop-1.1.2
到官网下载hadoop源文件,我下载的是最新版本
jdk-7u17-linux-i586.tar.gz,将其解压并放到希望的目录中。我把
jdk-7u17-linux-i586.tar.gz放到/usr/local/hadoop,并将解压后的文件夹重命名为hadoop。
- hadoop@derekUbun:/usr /local$ sudo tar xzf hadoop-1.1.2.tar.gz (注意,我已 将hadoop-1.1.2.tar.gz拷贝到usr/local/hadoop,然后转到hadoop用户上)
- hadoop@derekUbun:/usr/local$ sudo mv hadoop-1.1.2 /usr/local/hadoop
- hadoop@derekUbun:/usr/local$ sudo tar xzf hadoop-1.1.2.tar.gz (注意,我已将hadoop-1.1.2.tar.gz拷贝到usr/local/hadoop,然后转到hadoop用户上)
- hadoop@derekUbun:/usr/local$ sudo mv hadoop-1.1.2 /usr/local/hadoop
要确保所有的操作都是在用户hadoop下完成的,所以将该hadoop文件夹的属主用户设为hadoop
- hadoop@derekUbun:/usr/local$ sudo chown -R hadoop:hadoop hadoop
- hadoop@derekUbun:/usr/local$ sudo chown -R hadoop:hadoop hadoop
五、配置hadoop-env.sh(Java
安装路径)
进入用hadoop用户登录,进入/usr/localhadoop目录,打开conf目录的hadoop-env.sh,添加以下信息:(找到#export
JAVA_HOME=...,去掉#,然后加上本机jdk的路径)
export JAVA_HOME=/usr/java/jdk1.7.0_17
(视你机器的java安装路径而定,我的java安装目录是/usr/java/jdk1.7.0_17)
export
HADOOP_INSTALL=/usr/local/hadoop(
注意,我这里用的HADOOP_INSTALL,而不是HADOOP_HOME,因为在新版中后者已经不用了。若用,会有警告)
export
PATH=$PATH:/usr/local/hadoop/bin
- hadoop@derekUbun:/usr/local/hadoop$ sudo vi conf/hadoop-env.sh
- hadoop@derekUbun:/usr/local/hadoop$ sudo vi conf/hadoop-env.sh
- # Set Hadoop-specific environment variables here.
- # The only required environment variable is JAVA_HOME. All others are
- # optional. When running a distributed configuration it is best to
- # set JAVA_HOME in this file, so that it is correctly defined on
- # remote nodes.
- # The java implementation to use. Required.
- # export JAVA_HOME=/usr/lib/j2sdk1.5-sun
- export JAVA_HOME=/usr/java/jdk1.7.0_17
- export HADOOP_INSTALL=/usr/local/hadoop
- export PATH=$PATH:/usr/local/hadoop/bin
- # Extra Java CLASSPATH elements. Optional.
- # export HADOOP_CLASSPATH=
- # The maximum amount of heap to use, in MB. Default is 1000.
- # export HADOOP_HEAPSIZE=2000
- # Extra Java runtime options. Empty by default.
- # export HADOOP_OPTS=-server
- "conf/hadoop-env.sh" 57L, 2356C
- # Set Hadoop-specific environment variables here.
- # The only required environment variable is JAVA_HOME. All others are
- # optional. When running a distributed configuration it is best to
- # set JAVA_HOME in this file, so that it is correctly defined on
- # remote nodes.
- # The java implementation to use. Required.
- # export JAVA_HOME=/usr/lib/j2sdk1.5-sun
- export JAVA_HOME=/usr/java/jdk1.7.0_17
- export HADOOP_INSTALL=/usr/local/hadoop
- export PATH=$PATH:/usr/local/hadoop/bin
- # Extra Java CLASSPATH elements. Optional.
- # export HADOOP_CLASSPATH=
- # The maximum amount of heap to use, in MB. Default is 1000.
- # export HADOOP_HEAPSIZE=2000
- # Extra Java runtime options. Empty by default.
- # export HADOOP_OPTS=-server
- "conf/hadoop-env.sh" 57L, 2356C
并且,让环境变量配置生效source
- hadoop@derekUbun:/usr/local/hadoop$ source /usr/local/hadoop/conf/hadoop-env.sh
- hadoop@derekUbun:/usr/local/hadoop$ source /usr/local/hadoop/conf/hadoop-env.sh
至此,hadoop的单机模式已经安装成功。可以显示Hadoop版本如下
- hadoop@derekUbun:/usr/local/hadoop$ hadoop version
- Hadoop 1.1.2
- Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1440782
- Compiled by hortonfo on Thu Jan 31 02:03:24 UTC 2013
- From source with checksum c720ddcf4b926991de7467d253a79b8b
- hadoop@derekUbun:/usr/local/hadoop$
- hadoop@derekUbun:/usr/local/hadoop$ hadoop version
- Hadoop 1.1.2
- Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1440782
- Compiled by hortonfo on Thu Jan 31 02:03:24 UTC 2013
- From source with checksum c720ddcf4b926991de7467d253a79b8b
- hadoop@derekUbun:/usr/local/hadoop$
现在运行一下hadoop自带的例子WordCount来感受以下MapReduce过程:
在hadoop目录下新建input文件夹
- hadoop@derekUbun:/usr/local/hadoop$ mkdir input
- hadoop@derekUbun:/usr/local/hadoop$ mkdir input
将conf中的所有文件拷贝到input文件夹中
- hadoop@derekUbun:/usr/local/hadoop$ cp conf/* input
- hadoop@derekUbun:/usr/local/hadoop$ cp conf/* input
运行WordCount程序,并将结果保存到output中
- hadoop@derekUbun:/usr/local/hadoop$ bin/hadoop jar hadoop-examples-1.1.2.jar wordcount input output
- hadoop@derekUbun:/usr/local/hadoop$ bin/hadoop jar hadoop-examples-1.1.2.jar wordcount input output
运行
- hadoop@derekUbun:/usr/local/hadoop$ cat output/*
- hadoop@derekUbun:/usr/local/hadoop$ cat output/*
会看到conf所有文件的单词和频数都被统计出来。
六、 伪分布模式的一些配置
这里需要设定3个文件:core-site.xml hdfs-site.xml mapred-site.xml,都在/usr/local/hadoop/conf目录下
core-site.xml:
Hadoop Core的配置项,例如HDFS和MapReduce常用的I/O设置等。
hdfs-site.xml: Hadoop
守护进程的配置项,包括namenode,辅助namenode和datanode等。
mapred-site.xml: MapReduce
守护进程的配置项,包括jobtracker和tasktracker。
1.编辑三个文件:
1). core-site.xml:
- <configuration>
- <property>
- <name>fs.default.name</name>
- <value>hdfs://localhost:9000</value>
- </property>
- <property>
- <name>hadoop.tmp.dir</name>
- <value>/usr/local/hadoop/tmp</value>
- </property>
- </configuration>
- <configuration>
- <property>
- <name>fs.default.name</name>
- <value>hdfs://localhost:9000</value>
- </property>
- <property>
- <name>hadoop.tmp.dir</name>
- <value>/usr/local/hadoop/tmp</value>
- </property>
- </configuration>
2).hdfs-site.xml:
- <configuration>
- <property>
- <name>dfs.replication</name>
- <value>2</value>
- </property>
- <property>
- <name>dfs.name.dir</name>
- <value>/usr/local/hadoop/datalog1,/usr/local/hadoop/datalog2</value>
- </property>
- <property>
- <name>dfs.data.dir</name>
- <value>/usr/local/hadoop/data1,/usr/local/hadoop/data2</value>
- </property>
- </configuration>
- <configuration>
- <property>
- <name>dfs.replication</name>
- <value>2</value>
- </property>
- <property>
- <name>dfs.name.dir</name>
- <value>/usr/local/hadoop/datalog1,/usr/local/hadoop/datalog2</value>
- </property>
- <property>
- <name>dfs.data.dir</name>
- <value>/usr/local/hadoop/data1,/usr/local/hadoop/data2</value>
- </property>
- </configuration>
3). mapred-site.xml:
- <configuration>
- <property>
- <name>mapred.job.tracker</name>
- <value>localhost:9001</value>
- </property>
- </configuration>
- <configuration>
- <property>
- <name>mapred.job.tracker</name>
- <value>localhost:9001</value>
- </property>
- </configuration>
2. 启动Hadoop到相关服务,格式化namenode, secondarynamenode, tasktracker:
- hadoop@derekUbun:/usr/local/hadoop$ source /usr/local/hadoop/conf/hadoop-env.sh
- hadoop@derekUbun:/usr/local/hadoop$ hadoop namenode -format
- hadoop@derekUbun:/usr/local/hadoop$ source /usr/local/hadoop/conf/hadoop-env.sh
- hadoop@derekUbun:/usr/local/hadoop$ hadoop namenode -format
看到下面的信息就说明hdfs文件系统格式化成功了
- 13/03/11 23:08:01 INFO common.Storage: Storage directory /usr/local/hadoop/datalog2 has been successfully formatted.
- 13/03/11 23:08:01 INFO namenode.NameNode: SHUTDOWN_MSG:
- /************************************************************
- SHUTDOWN_MSG: Shutting down NameNode at derekUbun/127.0.1.1
- ************************************************************/
- 13/03/11 23:08:01 INFO common.Storage: Storage directory /usr/local/hadoop/datalog2 has been successfully formatted.
- 13/03/11 23:08:01 INFO namenode.NameNode: SHUTDOWN_MSG:
- /************************************************************
- SHUTDOWN_MSG: Shutting down NameNode at derekUbun/127.0.1.1
- ************************************************************/
3.
启动Hadoop
接着执行start-all.sh来启动所有服务,包括namenode,datanode,start-all.sh脚本用来装载守护进程。用Java的jps命令列出所有守护进程来验证安装成功,出现如下列表,表明成功.
- hadoop@derekUbun:/usr/local/hadoop$ cd bin
- hadoop@derekUbun:/usr/local/hadoop/bin$ start-all.sh
- starting namenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-namenode-derekUbun.out
- localhost: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-datanode-derekUbun.out
- localhost: starting secondarynamenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-secondarynamenode-derekUbun.out
- starting jobtracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-jobtracker-derekUbun.out
- localhost: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-tasktracker-derekUbun.out
- hadoop@derekUbun:/usr/local/hadoop/bin$
- hadoop@derekUbun:/usr/local/hadoop$ cd bin
- hadoop@derekUbun:/usr/local/hadoop/bin$ start-all.sh
- starting namenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-namenode-derekUbun.out
- localhost: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-datanode-derekUbun.out
- localhost: starting secondarynamenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-secondarynamenode-derekUbun.out
- starting jobtracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-jobtracker-derekUbun.out
- localhost: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-tasktracker-derekUbun.out
- hadoop@derekUbun:/usr/local/hadoop/bin$
用Java的jps命令列出所有守护进程来验证安装成功
- hadoop@derekUbun:/usr/local/hadoop$ jps
- hadoop@derekUbun:/usr/local/hadoop$ jps
出现如下列表,表明成功
- hadoop@derekUbun:/usr/local/hadoop$ jps
- 8431 JobTracker
- 8684 TaskTracker
- 7821 NameNode
- 8915 Jps
- 8341 SecondaryNameNode
- hadoop@derekUbun:/usr/local/hadoop$
- hadoop@derekUbun:/usr/local/hadoop$ jps
- 8431 JobTracker
- 8684 TaskTracker
- 7821 NameNode
- 8915 Jps
- 8341 SecondaryNameNode
- hadoop@derekUbun:/usr/local/hadoop$
4. 检查运行状态
所有的设置已完成,Hadoop也启动了,现在可以通过下面的操作来查看服务是否正常,在Hadoop中用于监控集群健康状态的Web界面:
http://localhost:50030/
- Hadoop 管理介面
http://localhost:50060/ -
Hadoop Task Tracker 状态
http://localhost:50070/ - Hadoop DFS
状态
至此,hadoop的伪分布模式已经安装成功,于是,再次在伪分布模式下运行一下hadoop自带的例子WordCount来感受以下MapReduce过程:
这时注意程序是在文件系统dfs运行的,创建的文件也都基于文件系统:
首先在dfs中创建input目录
- hadoop@derekUbun:/usr/local/hadoop$ hadoop dfs -mkdir input
- hadoop@derekUbun:/usr/local/hadoop$ hadoop dfs -mkdir input
将conf中的文件拷贝到dfs中的input
- hadoop@derekUbun:/usr/local/hadoop$ hadoop dfs -copyFromLocal conf/* input
- hadoop@derekUbun:/usr/local/hadoop$ hadoop dfs -copyFromLocal conf/* input
(注:可以使用查看和删除hadoop
dfs中的文件)
在伪分布式模式下运行WordCount
- hadoop jar hadoop-examples-1.1.2.jar wordcount input output
- hadoop jar hadoop-examples-1.1.2.jar wordcount input output
- hadoop@derekUbun:/usr/local/hadoop$ hadoop jar hadoop-examples-1.1.2.jar wordcount input output
- 13/03/12 09:26:05 INFO input.FileInputFormat: Total input paths to process : 16
- 13/03/12 09:26:05 INFO util.NativeCodeLoader: Loaded the native-hadoop library
- 13/03/12 09:26:05 WARN snappy.LoadSnappy: Snappy native library not loaded
- 13/03/12 09:26:05 INFO mapred.JobClient: Running job: job_201303120920_0001
- 13/03/12 09:26:06 INFO mapred.JobClient: map 0% reduce 0%
- 13/03/12 09:26:10 INFO mapred.JobClient: map 12% reduce 0%
- 13/03/12 09:26:13 INFO mapred.JobClient: map 25% reduce 0%
- 13/03/12 09:26:15 INFO mapred.JobClient: map 37% reduce 0%
- 13/03/12 09:26:17 INFO mapred.JobClient: map 50% reduce 0%
- 13/03/12 09:26:18 INFO mapred.JobClient: map 62% reduce 0%
- 13/03/12 09:26:19 INFO mapred.JobClient: map 62% reduce 16%
- 13/03/12 09:26:20 INFO mapred.JobClient: map 75% reduce 16%
- 13/03/12 09:26:22 INFO mapred.JobClient: map 87% reduce 16%
- 13/03/12 09:26:24 INFO mapred.JobClient: map 100% reduce 16%
- 13/03/12 09:26:28 INFO mapred.JobClient: map 100% reduce 29%
- 13/03/12 09:26:30 INFO mapred.JobClient: map 100% reduce 100%
- 13/03/12 09:26:30 INFO mapred.JobClient: Job complete: job_201303120920_0001
- 13/03/12 09:26:30 INFO mapred.JobClient: Counters: 29
- 13/03/12 09:26:30 INFO mapred.JobClient: Job Counters
- 13/03/12 09:26:30 INFO mapred.JobClient: Launched reduce tasks=1
- 13/03/12 09:26:30 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=29912
- 13/03/12 09:26:30 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
- 13/03/12 09:26:30 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
- 13/03/12 09:26:30 INFO mapred.JobClient: Launched map tasks=16
- 13/03/12 09:26:30 INFO mapred.JobClient: Data-local map tasks=16
- 13/03/12 09:26:30 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=19608
- 13/03/12 09:26:30 INFO mapred.JobClient: File Output Format Counters
- 13/03/12 09:26:30 INFO mapred.JobClient: Bytes Written=15836
- 13/03/12 09:26:30 INFO mapred.JobClient: FileSystemCounters
- 13/03/12 09:26:30 INFO mapred.JobClient: FILE_BYTES_READ=23161
- 13/03/12 09:26:30 INFO mapred.JobClient: HDFS_BYTES_READ=29346
- 13/03/12 09:26:30 INFO mapred.JobClient: FILE_BYTES_WRITTEN=944157
- 13/03/12 09:26:30 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=15836
- 13/03/12 09:26:30 INFO mapred.JobClient: File Input Format Counters
- 13/03/12 09:26:30 INFO mapred.JobClient: Bytes Read=27400
- 13/03/12 09:26:30 INFO mapred.JobClient: Map-Reduce Framework
- 13/03/12 09:26:30 INFO mapred.JobClient: Map output materialized bytes=23251
- 13/03/12 09:26:30 INFO mapred.JobClient: Map input records=778
- 13/03/12 09:26:30 INFO mapred.JobClient: Reduce shuffle bytes=23251
- 13/03/12 09:26:30 INFO mapred.JobClient: Spilled Records=2220
- 13/03/12 09:26:30 INFO mapred.JobClient: Map output bytes=36314
- 13/03/12 09:26:30 INFO mapred.JobClient: Total committed heap usage (bytes)=2736914432
- 13/03/12 09:26:30 INFO mapred.JobClient: CPU time spent (ms)=6550
- 13/03/12 09:26:30 INFO mapred.JobClient: Combine input records=2615
- 13/03/12 09:26:30 INFO mapred.JobClient: SPLIT_RAW_BYTES=1946
- 13/03/12 09:26:30 INFO mapred.JobClient: Reduce input records=1110
- 13/03/12 09:26:30 INFO mapred.JobClient: Reduce input groups=804
- 13/03/12 09:26:30 INFO mapred.JobClient: Combine output records=1110
- 13/03/12 09:26:30 INFO mapred.JobClient: Physical memory (bytes) snapshot=2738036736
- 13/03/12 09:26:30 INFO mapred.JobClient: Reduce output records=804
- 13/03/12 09:26:30 INFO mapred.JobClient: Virtual memory (bytes) snapshot=6773346304
- 13/03/12 09:26:30 INFO mapred.JobClient: Map output records=2615
- hadoop@derekUbun:/usr/local/hadoop$
- hadoop@derekUbun:/usr/local/hadoop$ hadoop jar hadoop-examples-1.1.2.jar wordcount input output
- 13/03/12 09:26:05 INFO input.FileInputFormat: Total input paths to process : 16
- 13/03/12 09:26:05 INFO util.NativeCodeLoader: Loaded the native-hadoop library
- 13/03/12 09:26:05 WARN snappy.LoadSnappy: Snappy native library not loaded
- 13/03/12 09:26:05 INFO mapred.JobClient: Running job: job_201303120920_0001
- 13/03/12 09:26:06 INFO mapred.JobClient: map 0% reduce 0%
- 13/03/12 09:26:10 INFO mapred.JobClient: map 12% reduce 0%
- 13/03/12 09:26:13 INFO mapred.JobClient: map 25% reduce 0%
- 13/03/12 09:26:15 INFO mapred.JobClient: map 37% reduce 0%
- 13/03/12 09:26:17 INFO mapred.JobClient: map 50% reduce 0%
- 13/03/12 09:26:18 INFO mapred.JobClient: map 62% reduce 0%
- 13/03/12 09:26:19 INFO mapred.JobClient: map 62% reduce 16%
- 13/03/12 09:26:20 INFO mapred.JobClient: map 75% reduce 16%
- 13/03/12 09:26:22 INFO mapred.JobClient: map 87% reduce 16%
- 13/03/12 09:26:24 INFO mapred.JobClient: map 100% reduce 16%
- 13/03/12 09:26:28 INFO mapred.JobClient: map 100% reduce 29%
- 13/03/12 09:26:30 INFO mapred.JobClient: map 100% reduce 100%
- 13/03/12 09:26:30 INFO mapred.JobClient: Job complete: job_201303120920_0001
- 13/03/12 09:26:30 INFO mapred.JobClient: Counters: 29
- 13/03/12 09:26:30 INFO mapred.JobClient: Job Counters
- 13/03/12 09:26:30 INFO mapred.JobClient: Launched reduce tasks=1
- 13/03/12 09:26:30 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=29912
- 13/03/12 09:26:30 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
- 13/03/12 09:26:30 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
- 13/03/12 09:26:30 INFO mapred.JobClient: Launched map tasks=16
- 13/03/12 09:26:30 INFO mapred.JobClient: Data-local map tasks=16
- 13/03/12 09:26:30 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=19608
- 13/03/12 09:26:30 INFO mapred.JobClient: File Output Format Counters
- 13/03/12 09:26:30 INFO mapred.JobClient: Bytes Written=15836
- 13/03/12 09:26:30 INFO mapred.JobClient: FileSystemCounters
- 13/03/12 09:26:30 INFO mapred.JobClient: FILE_BYTES_READ=23161
- 13/03/12 09:26:30 INFO mapred.JobClient: HDFS_BYTES_READ=29346
- 13/03/12 09:26:30 INFO mapred.JobClient: FILE_BYTES_WRITTEN=944157
- 13/03/12 09:26:30 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=15836
- 13/03/12 09:26:30 INFO mapred.JobClient: File Input Format Counters
- 13/03/12 09:26:30 INFO mapred.JobClient: Bytes Read=27400
- 13/03/12 09:26:30 INFO mapred.JobClient: Map-Reduce Framework
- 13/03/12 09:26:30 INFO mapred.JobClient: Map output materialized bytes=23251
- 13/03/12 09:26:30 INFO mapred.JobClient: Map input records=778
- 13/03/12 09:26:30 INFO mapred.JobClient: Reduce shuffle bytes=23251
- 13/03/12 09:26:30 INFO mapred.JobClient: Spilled Records=2220
- 13/03/12 09:26:30 INFO mapred.JobClient: Map output bytes=36314
- 13/03/12 09:26:30 INFO mapred.JobClient: Total committed heap usage (bytes)=2736914432
- 13/03/12 09:26:30 INFO mapred.JobClient: CPU time spent (ms)=6550
- 13/03/12 09:26:30 INFO mapred.JobClient: Combine input records=2615
- 13/03/12 09:26:30 INFO mapred.JobClient: SPLIT_RAW_BYTES=1946
- 13/03/12 09:26:30 INFO mapred.JobClient: Reduce input records=1110
- 13/03/12 09:26:30 INFO mapred.JobClient: Reduce input groups=804
- 13/03/12 09:26:30 INFO mapred.JobClient: Combine output records=1110
- 13/03/12 09:26:30 INFO mapred.JobClient: Physical memory (bytes) snapshot=2738036736
- 13/03/12 09:26:30 INFO mapred.JobClient: Reduce output records=804
- 13/03/12 09:26:30 INFO mapred.JobClient: Virtual memory (bytes) snapshot=6773346304
- 13/03/12 09:26:30 INFO mapred.JobClient: Map output records=2615
- hadoop@derekUbun:/usr/local/hadoop$
显示输出结果
- hadoop@derekUbun:/usr/local/hadoop$ hadoop dfs -cat output/*
- hadoop@derekUbun:/usr/local/hadoop$ hadoop dfs -cat output/*
当Hadoop结束时,可以通过stop-all.sh脚本来关闭Hadoop的守护进程
- hadoop@derekUbun:/usr/local/hadoop$ bin/stop-all.sh
- hadoop@derekUbun:/usr/local/hadoop$ bin/stop-all.sh
现在,开始Hadoop之旅,实现一些算法吧!
注记:
1. 在伪分布模式,可以通过hadoop dfs -ls
查看input里的内容
2. 在伪分布模式,可以通过hadoop dfs -rmr 查看input里的内容
3. 在伪分布模式,input和output都在hadoop dfs文件里