CentOS 6.4 中安装部署 Nutch 1.7

1.配置SSH

自行查阅相关资料

2.安装JDK,配置Java环境

自行查阅相关资料

3.安装SVN

[root@master ~]# yum install -y subversion

通过SVN签出(Check Out)Nutch源代码

[root@master ~]# svn co https://svn.apache.org/repos/asf/nutch/tags/release-1.7/

4.安装ANT,配置ANT环境

自行查阅相关资料

5.在~/release-1.7/conf/nutch-site.xml配置文件中增加'http.agent.name'配置

<!-- HTTP properties -->

<property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (Windows NT 6.3; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version and set their values appropriately. </description>
</property>

6.进入Nutch所在目录,执行ant命令,编译Nutch源代码

[root@master release-1.7]# ant

ANT构建之后会生成runtime目录,该目录下有deploy和local两个目录,分别代表了Nutch的两种运行方式。

7.在local目录中创建urls目录

[root@master local]# mkdir urls

8.在urls目录中通过VI编辑器创建url文件

[root@master local]# vi urls/url

9.在url文件中添加要抓取的URLs

http://www.leezhen.net/

10.开始抓取

[root@master local]# nohup bin/nutch crawl urls -dir data -depth 3 -threads 100 &

参考: http://wiki.apache.org/nutch/NutchTutorial
上一篇:走进 Xamarin Test Recorder for Xamarin.Forms


下一篇:idea Invalid bound statement (not found):