xxl-crawler是 许雪里 大佬开源的一个java爬虫,熟悉java语言的用起来可以非常顺手。
代码仓库:
https://github.com/xuxueli/xxl-crawler
官网文档:
https://www.xuxueli.com/xxl-crawler/#爬虫示例参考
0x01:新建工程,并在pom.xml文件引入
<dependency> <groupId>com.xuxueli</groupId> <artifactId>xxl-crawler</artifactId> <version>1.2.2</version></dependency><dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.11.2</version></dependency>
0x02:编写页面数据对象
在此推荐两款工具,可以直观迅速的获取页面元素的Jquery cssQuery表达式。
- Chrome DevTools:首先定位元素位置,然后从Element选中选中元素,点击右键选择“Copy + Copy selector”即可;
Chrome DevTools使用如图
- Jquery Selector Helper(Chrome插件):首先定位元素位置,然后从Element右侧打开Selector界面,然后定位元素即可;
package com.spider.page.vo;import com.xuxueli.crawler.annotation.PageFieldSelect;import com.xuxueli.crawler.annotation.PageSelect;import com.xuxueli.crawler.conf.XxlCrawlerConf.SelectType;@PageSelect(cssQuery = "body > div.container > div > div > table > tbody > tr")public class GzGemasComCnPageMainVo { @PageFieldSelect(cssQuery = "td:nth-child(1)") private String code; @PageFieldSelect(cssQuery = "td:nth-child(2)") private String title; @PageFieldSelect(cssQuery = "td:nth-child(3)") private String status; @PageFieldSelect(cssQuery = "td:nth-child(4)") private String date; @PageFieldSelect(cssQuery = "td:nth-child(2) > a", selectType=SelectType.ATTR, selectVal="onclick") private String url; public String getCode() { return code; } public void setCode(String code) { this.code = code; } public String getTitle() { return title; } public void setTitle(String title) { this.title = title; } public String getStatus() { return status; } public void setStatus(String status) { this.status = status; } public String getDate() { return date; } public void setDate(String date) { this.date = date; } public String getUrl() { return url; } public void setUrl(String url) { this.url = url; }}
0x03:创建爬虫爬取数据
XxlCrawler crawler = new XxlCrawler.Builder() .setUrls("http://gz.gemas.com.cn/portal/article/proList.shtml?proType=guquan&typeGz=G3T3&proSource=&pageIndex=1") .setAllowSpread(false) //允许扩散爬取,将会以现有URL为起点扩散爬取整站 .setThreadCount(1) .setPageParser(new PageParser<GzGemasComCnPageMainVo>() { @Override public void parse(Document html, Element pageVoElement, GzGemasComCnPageMainVo gzGemasComCnPageVo) { // 解析封装 PageVo 对象 String pageUrl = html.baseUri(); logger.info("pageUrl: " + pageUrl); logger.info("Code: " + gzGemasComCnPageVo.getCode() + ", Title: " + gzGemasComCnPageVo.getTitle() + ", sdate: " + gzGemasComCnPageVo.getDate() + ", url: " + gzGemasComCnPageVo.getUrl() + ", status: " + gzGemasComCnPageVo.getStatus()); } }) .build();crawler.start(true);
关键步骤视频说明:
原作者:Java乐园
原文链接:Java爬虫可以非常溜
原出处:公众号
侵删