1.配置SSH
自行查阅相关资料
2.安装JDK,配置Java环境
自行查阅相关资料
3.安装SVN
[root@master ~]# yum install -y subversion
通过SVN签出(Check Out)Nutch源代码
[root@master ~]# svn co https://svn.apache.org/repos/asf/nutch/tags/release-1.7/
4.安装ANT,配置ANT环境
自行查阅相关资料
5.在~/release-1.7/conf/nutch-site.xml配置文件中增加'http.agent.name'配置
<!-- HTTP properties --> <property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (Windows NT 6.3; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version and set their values appropriately. </description>
</property>
6.进入Nutch所在目录,执行ant命令,编译Nutch源代码
[root@master release-1.7]# ant
ANT构建之后会生成runtime目录,该目录下有deploy和local两个目录,分别代表了Nutch的两种运行方式。
7.在local目录中创建urls目录
[root@master local]# mkdir urls
8.在urls目录中通过VI编辑器创建url文件
[root@master local]# vi urls/url
9.在url文件中添加要抓取的URLs
http://www.leezhen.net/
10.开始抓取
[root@master local]# nohup bin/nutch crawl urls -dir data -depth 3 -threads 100 &
参考: http://wiki.apache.org/nutch/NutchTutorial