1.说明
- 爬虫采用Java的Jsoup
- ElasticSearch请在阿里云官网购买,采用客户端x-pack-transport
- 歌词网站来源:http://www.kuwo.cn/artist/index
2. 歌词网站分析
此处采用了比较笨的一种方式,即逐个分析每个请求的url,这样可以方便代码编写,就不用模拟器了(如需使用模拟器可参考使用cdp4j模拟点击事件等,但简单试了下不是很好用,且效率低)
a) 歌手获取分析
在http://www.kuwo.cn/artist/index 查看分页按钮的click事件,从js中找到分页请求的url
从click事件的artist.js中找到相关url如下图所示
其中pn参数即为页码参数
var b = host + "/artist/indexAjax?category=" + index + "&prefix=" + $("#artistContent").attr("data-letter") + "&pn=" + pn;
如http://www.kuwo.cn/artist/indexAjax?category=0&prefix=&pn=5
返回结果如下图所示,即可获取各个歌手的链接,取得每个歌手名字的链接url即为歌手详情页。
接下来代码中只要循环遍历即可。
b)歌词分页获取分析
歌手详情页如:http://www.kuwo.cn/artist/content?name=%E5%88%98%E7%8F%82%E7%9F%A3
同样查看分页按钮click事件的js相关代码,可以找到如下url,以获取分页歌词信息。
如:http://www.kuwo.cn//artist/contentMusicsAjax?artistId=2&pn=1&rn=100
其中artistId为歌手id,pn为分页参数
接下来循环遍历即可
c)歌词获取分析
每首歌歌词的详情页,从上面已经可以循环遍历到,详情页如下,
跟进html元素获取到相应歌词信息即可
注意:此处有时因为无版权,无法显示歌词,需要增加异常处理
3. 项目配置
maven配置如下:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.lewis</groupId>
<artifactId>crawlerToES</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
<version>2.10.0</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.10.0</version>
</dependency>
<dependency>
<groupId>io.webfolder</groupId>
<artifactId>cdp4j</artifactId>
<version>2.2.2</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.9.2</version>
</dependency>
<dependency>
<groupId>org.elasticsearch.client</groupId>
<artifactId>x-pack-transport</artifactId>
<version>5.3.3</version>
</dependency>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch</artifactId>
<optional>true</optional>
<version>5.3.3</version>
</dependency>
<dependency>
<groupId>commons-lang</groupId>
<artifactId>commons-lang</artifactId>
<version>2.6</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>
<plugin>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<executions>
<execution>
<id>copy-dependencies</id>
<phase>package</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<outputDirectory>${project.build.directory}/lib</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
4. 爬虫解析代码
a)获取歌手
public static List<String> getSingers() {
String url = host + "/artist/indexAjax?category=0&prefix=&pn=%d";
List<String> singer_urls = new ArrayList<String>();
int i = 0;
int max_len = 10000;
String last = null;
while (true) {
String cur_url = String.format(url, i);
try {
if(i>=max_len){
break;
}
Document doc = Jsoup.connect(cur_url).get();
String cur = doc.text();
if (cur.equals(last)) {
break;
}
Elements a_name = doc.getElementsByClass("a_name");
for (int i1 = 0; i1 < a_name.size(); i1++) {
Element a = a_name.get(i1);
String singer_url = a.attr("href");
System.out.println(singer_url);
singer_urls.add(singer_url);
}
last = cur;
i++;
} catch (IOException e) {
e.printStackTrace();
}
}
return singer_urls;
}
b)获取某歌手的所有歌曲
public static List<String> getSongs(String singer_url) {
Document doc = null;
List<String> song_urls = new ArrayList<String>();
try {
doc = Jsoup.connect(host + singer_url).get();
String artistid = doc.getElementsByClass("artistTop").get(0).attr("data-artistid");
String url = "/artist/contentMusicsAjax?artistId=%s&pn=%d&rn=100";
int i = 0;
String last = null;
while (true) {
String cur_url = String.format(url, artistid, i);
doc = Jsoup.connect(host + cur_url).get();
String cur = doc.text();
if (cur.equals(last)) {
break;
}
Elements songs = doc.getElementsByClass("listMusic").get(0).children();
for (int j = 0; j < songs.size(); j++) {
Element song = songs.get(j);
String name = song.getElementsByClass("name").get(0).text();
String href = song.getElementsByClass("name").get(0).getElementsByTag("a").get(0).attr("href");
// System.out.println(name+":"+href);
song_urls.add(href);
}
i++;
last = cur;
}
} catch (IOException e) {
e.printStackTrace();
} finally {
return song_urls;
}
}
c)获取某歌曲的歌词
public static SongBean getSong(String song_url) {
try {
Document doc = Jsoup.connect(host+song_url).get();
Element div_song = doc.getElementById("musiclrc");
// System.out.println(host+song_url);
if(div_song==null){
logger.info(song_url + "无法获取歌曲内容,可能是无版权!");
return null;
}
String song_name = div_song.getElementById("lrcName").text();
String singer = div_song.getElementsByClass("artist").get(0).getElementsByTag("span").get(0).getElementsByTag("a").get(0).text();
String album = div_song.getElementsByClass("album").get(0).getElementsByTag("span").get(0).getElementsByTag("a").get(0).text();
StringBuffer lyric = new StringBuffer();
if(div_song.getElementById("llrcId")!=null) {
Elements lyric_ps = div_song.getElementById("llrcId").children();
for (int i = 0; i < lyric_ps.size(); i++) {
lyric.append(lyric_ps.get(i).text() + "\n");
}
}else{
lyric.append("暂无歌词");
}
SongBean song = new SongBean(song_name,host+song_url,lyric.toString(),singer,album);
// System.out.println(String.format("%s:%s",song.getName(),song.getSinger()));
return song;
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
5. 写入es
a)创建ElasticSearch Client
public static TransportClient openClient(String host, String clusterName, String securityUser){
Settings settings = null;
if (StringUtils.isNotBlank(securityUser)) {
settings = Settings.builder()
.put("cluster.name", clusterName)
.put("xpack.security.user",securityUser)
.build();
} else {
settings = Settings.builder()
.put("cluster.name", clusterName).build();
}
TransportClient transportClient = new PreBuiltXPackTransportClient(settings);
String ip = host.split(":")[0].trim();
int port = Integer.parseInt(host.split(":")[1].trim());
InetAddress add = null;
try {
add = InetAddress.getByName(ip);
} catch (UnknownHostException e) {
e.printStackTrace();
}
transportClient.addTransportAddress(new InetSocketTransportAddress(add, port));
return transportClient;
}
b)批量写入elasticsearch
public static void writeToES(List<SongBean> list) throws Exception {
BulkRequestBuilder blukBuilder = client.prepareBulk();
for (SongBean song : list) {
IndexRequestBuilder indexBuilder = client.prepareIndex(es_indexName,es_indexType);
HashMap<String,String> map = new HashMap<>();
map.put("name",song.getName());
map.put("singer",song.getSinger());
map.put("album",song.getAlbum());
map.put("href",song.getHref());
map.put("lyric",song.getLyric());
indexBuilder.setSource(map);
indexBuilder.setTimeout("5d");
blukBuilder.add(indexBuilder);
}
BulkResponse bulkResponse = blukBuilder.execute().actionGet();
logger.info(String.format("write to elasticsearch songs count = %d",list.size()));
if (bulkResponse.hasFailures()) {
throw new Exception(bulkResponse.buildFailureMessage());
}
}
6. 结果验证
爬取完歌曲数据后,登录阿里云ElasticSearch控制台,在Kibana控制台中,使用Dev Tools查询爬取到的歌曲数据,如下:
POST songs/sample/_search
{
"query": {
"match_all": {}
}
}