一、适用条件
可以对有规律或者无规律的网站进行自动爬取
二、代码讲解
(1)创健scrapy项目
(2) 创健爬虫
- E:myweb>scrapy startproject mycwpjt
- New Scrapy project 'mycwpjt', using template directory 'd:\\python35\\lib\\site-packages\\scrapy\\templates\\project', created in:
- D:\Python35\myweb\part16\mycwpjt
- You can start your first spider with:
- cd mycwpjt
- scrapy genspider example example.com
(3)item编写
- E:\myweb>scrapy genspider -t crawl weisuen sohu.com
- Created spider 'weisuen' using template 'crawl' in module:
- Mycwpjt.spiders.weisuen
- # -*- coding: utf-8 -*-
- # Define here the models for your scraped items
- #
- # See documentation in:
- # http://doc.scrapy.org/en/latest/topics/items.html
- import scrapy
- class MycwpjtItem(scrapy.Item):
- # define the fields for your item here like:
- name = scrapy.Field()
- link = scrapy.Field()
(4)pipeline编写
- # -*- coding: utf-8 -*-
- # Define your item pipelines here
- #
- # Don't forget to add your pipeline to the ITEM_PIPELINES setting
- # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
- class MycwpjtPipeline(object):
- def process_item(self, item, spider):
- print(item["name"])
- print(item["link"])
- return item
(5)settings设置
- ITEM_PIPELINES = {
- 'mycwpjt.pipelines.MycwpjtPipeline': 300,
- }
(6)爬虫编写
- # -*- coding: utf-8 -*-
- import scrapy
- from scrapy.linkextractors import LinkExtractor
- from scrapy.spiders import CrawlSpider, Rule
- from mycwpjt.items import MycwpjtItem
- #显示可用的模板 scrapy genspider -l
- #利用crawlspider创建的框架 scrapy genspider -t crawl weisun sohu.com
- #开始爬取 scrapy crawl weisun --nolog
- class WeisunSpider(CrawlSpider):
- name = 'weisun'
- allowed_domains = ['sohu.com']
- start_urls = ['http://sohu.com/']
- rules = (
- # 新闻网页的url地址类似于:
- # “http://news.sohu.com/20160926/n469167364.shtml”
- # 所以可以得到提取的正则表达式为'.*?/n.*?shtml’
- Rule(LinkExtractor(allow=('.*?/n.*?shtml'), allow_domains=('sohu.com')), callback='parse_item', follow=True),
- )
- def parse_item(self, response):
- i = MycwpjtItem()
- #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
- # 根据Xpath表达式提取新闻网页中的标题
- i["name"] = response.xpath("/html/head/title/text()").extract()
- # 根据Xpath表达式提取当前新闻网页的链接
- i["link"] = response.xpath("//link[@rel='canonical']/@href").extract()
- return i
CrawlSpider是爬取那些具有一定规则网站的常用的爬虫,它基于Spider并有一些独特属性
- rules: 是Rule对象的集合,用于匹配目标网站并排除干扰
- parse_start_url: 用于爬取起始响应,必须要返回Item,Request中的一个。
因为rules是Rule对象的集合,所以这里介绍一下Rule。它有几个参数:link_extractor、callback=None、 cb_kwargs=None、follow=None、process_links=None、process_request=None
其中的link_extractor既可以自己定义,也可以使用已有LinkExtractor类,主要参数为:
- allow:满足括号中“正则表达式”的值会被提取,如果为空,则全部匹配。
- deny:与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取。
- allow_domains:会被提取的链接的domains。
- deny_domains:一定不会被提取链接的domains。
- restrict_xpaths:使用xpath表达式,和allow共同作用过滤链接。