爬虫案例:博客文章列表

博客实例:

爬取博客园文章列表,假设页面的URL是https://www.cnblogs.com/loaderman

要求:

  1. 使用requests获取页面信息,用XPath / re 做数据提取

  2. 获取每个博客里的标题,描述,链接地址,日期等

  3. 保存到 json 文件内

代码

# -*- coding:utf-8 -*-

import urllib2
import json
from lxml import etree

url = "https://www.cnblogs.com/loaderman/"
headers = {"User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}

request = urllib2.Request(url, headers=headers)

html = urllib2.urlopen(request).read()
# 响应返回的是字符串,解析为HTML DOM模式 text = etree.HTML(html)

text = etree.HTML(html)
# 返回所有结点位置,contains()模糊查询方法,第一个参数是要匹配的标签,第二个参数是标签名部分内容
node_list = text.xpath('//div[contains(@class, "post")]')
print (node_list)
items = {}
for each in node_list:
    print (each)
    title = each.xpath(".//h2/a[@class='postTitle2']/text()")[0]
    detailUrl = each.xpath(".//a[@class='postTitle2']/@href")[0]
    content = each.xpath(".//div[@class='c_b_p_desc']/text()")[0]
    date = each.xpath(".//p[@class='postfoot']/text()")[0]

    items = {
        "title": title,
        "image": detailUrl,
        "content": content,
        "date": date,

    }

    with open("loaderman.json", "a") as f:
        f.write(json.dumps(items, ensure_ascii=False).encode("utf-8") + "\n")

效果:

爬虫案例:博客文章列表

上一篇:爬虫(六):XPath、lxml模块


下一篇:爬虫多进程,etree和xpath