Scrapy框架
简介
Scrapy的五大组件
Spiders(爬虫):
它负责处理所有Responses,从中分析提取数据,获取Item字段需要的数据,并将需要跟进的URL提交给引擎,再次进入Scheduler(调度器)
Engine(引擎):
负责Spider、ItemPipeline、Downloader、Scheduler中间的通讯,信号、数据传递等。
Scheduler(调度器):
它负责接受引擎发送过来的Request请求,并按照一定的方式进行整理排列,入队,当引擎需要时,交还给引擎。
Downloader(下载器):
负责下载Scrapy Engine(引擎)发送的所有Requests请求,并将其获取到的Responses交还Scrapy Engine(引擎),由引擎交给Spider来处理
ItemPipeline(管道):
它负责处理Spider中获取到的Item,并进行进行后期处理(详细分析、过滤、存储等)的地方.
Downloader Middlewares(下载中间件):
你可以当作是一个可以自定义扩展下载功能的组件。
Spider Middlewares(Spider中间件):
你可以理解为是一个可以自定扩展和操作引擎和Spider中间
通信的功能组件(比如进入Spider的Responses;和从Spider出去的Requests)
scrapy的数据流图:
绿线是数据流
安装:
pip install scrapy
scrapy几条命令
创建项目:scrapy startproject xxx
进入项目:cd xxx
基本爬虫:scrapy genspider xxx(爬虫名) xxx.com (爬取域)
还有一条是规则爬虫的命令,只是这条有变化,前俩条不变
规则爬虫:scrapy genspider -t crawl xxx(爬虫名) xxx.com (爬取域)
运行命令:scrapy crawl xxx
建立项目
(1).建立一个文件夹 scrapyDemo1
(2).在该文件夹下面打开命令行窗口 scrapy startproject demo1
则会在文件夹scrapyDemo1
下面建立scrapy
项目文件夹demo1
,包括scrapy
的各个组件子文件
(3).进入到项目文件夹demo1下 cd demo1
(4).scrapy genspider 爬虫名 域名 e.g:scrapy genspider demo1spider baidu.com
(5).scrapy crawl 爬虫名 用于运行爬虫 一般该命令在命令行使用 而且输出许多信息 很不方便 所以专门写个run.py
来执行程序且内容几乎固定,和scrapy.cfg
同级
run.py
from scrapy import cmdline
cmdline.execute('scrapy crawl demo1spider --nolog'.split())#--nolog 控制台不输出日志
e.g:scrapy crawl demo1spider
执行三四之后相当于建立一个爬虫项目,并启动这个项目,一般在第三步骤之后就要进行相应的代码编写和配置更改
scrapy框架文件的简单配置使用介绍
setting.py 进行全局配置
项目名 User-Agent 机器人规则 并发数 延迟 cookies 默认请求头 项目管道优先级配置
主要注意注释介绍,几个常用的
# Scrapy settings for demo1 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'demo1' #项目名
SPIDER_MODULES = ['demo1.spiders'] #
NEWSPIDER_MODULE = 'demo1.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'demo1 (+http://www.yourdomain.com)' #这个可以浏览器抓包查看值 比较重要 一般都要带的
# Obey robots.txt rules
ROBOTSTXT_OBEY = False # 机器人规则 默认是true 一般都要修改为false 否则几乎爬不了太多东西
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 #最大并发数 可以开启的爬虫线程数
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 1 #下载延迟时间,单位是秒,默认是3秒,即爬一个停3秒,设置为1秒性价比较高,如果要爬取的文件较多,写零点几秒也行
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False #是否保存COOKIES,默认关闭,开机可以记录爬取过程中的COKIE,非常好用的一个参数
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#} #默认请求头,上面写了一个USER_AGENT,其实这个东西就是放在请求头里面的,这个东西可以根据你爬取的内容做相应设置。
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'demo1.middlewares.Demo1SpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'demo1.middlewares.Demo1DownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#'demo1.pipelines.Demo1Pipeline': 300,
#'demo1.pipelines.Demo1MySqlPipeline' : 200,
#} #项目管道,300为优先级,越低爬取的优先度越高 pipelines.py里面写了两个管道,一个爬取网页的管道,一个存数据库的管道,我调整了他们的优先级,如果有爬虫数据,优先执行存库操作。
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
日志配置
相关变量
LOG_LEVEL= ""
LOG_FILE="日志名.log"
在运行时后面加上–nolog,控制台就不会输出日志信息了:
scrapy crawl demo1spider --nolog
日志等级
1.DEBUG
调试信息
2.INFO
一般信息
3.WARNING
警告
4.ERROR
普通错误
5.CRITICAL
严重错误
如果设置:LOG_LEVEL="WARNING"
,就只会WARNING
等级之下的ERROR
和CRITICAL
默认等级是1
导出为几种格式
执行爬虫文件时添加-o选项即可
json格式,默认为Unicode编码
scrapy crawl 项目名 -o 项目名.json
json lines格式,默认为Unicode编码
scrapy crawl 项目名 -o 项目名.jsonlines
csv 逗号表达式,可用Excel打开
scrapy crawl 项目名 -o 项目名.csv
xml格式
scrapy crawl 项目名 -o 项目名.xml
对于json文件,在setting.js文件里添加,设置编码格式,否则会乱码:
FEED_EXPORT_ENCODING='utf-8'
xpath
选择查找类
1.from scrapy.selector import Selector
:引入选择查找类
2.selector = Selector(text=htmlText)
:装载Html文档 形成Selector对象 可以使用Xpath方法
3.Xpath可以连续调用 返回的是selector
列表 则这个列表可以连续调用Xpath
Xpath查找Html元素
1.“//”
表示文档下面所有
节点元素,“/”
表示当前节点的下一级
节点元素,“.”
表示当前
节点元素
2.如果Xpath返回Selector
对象,调用extract
()函数会得到这些对象元素文本的列表,extract_first()
获取列表中的第一个元素,如果列表为空,返回None,对于单一的Selector对象没有extract_first()
函数
3.“/@attrName”
得到一个Selector
元素的attrName属性节点对象,也是一个Selector对象
4.“/text()”
得到一个Selector元素包含的文本值
,文本值节点对象也是一个Selector对象,通过extract()函数获取文本值
5.“tag[condition1 and condition2...]”
来限定一个tag
元素,condition
是这个tag
的属性
6.Xpath可以使用position()
来确定其中一个元素的限制,选择序号从1开始
s = selector.xpath("//book[position()=1]/title").extract_first()
7.“*”
代表任何元素节点,不包括text和comment
8.“@*”
代表任何属性
9.“element/parent::*”
选择element的父节点,该节点只有一个
10.“element/following-sibling::*”
搜索element后面同级的所有兄弟节点,“element/following-sibling::[position()=1]”
搜索element后面的同级的第一个兄弟节点
11.“element/preceding-sibling::*”
搜索element前面同级的所有兄弟节点
yield函数
1.yield类似于return,但它是生成器的一部分
生成器详解
首先,如果你还没有对yield有个初步分认识,那么你先把yield看做“return”,这个是直观的,它首先是个return,普通的return是什么意思,就是在程序中返回某个值,返回之后程序就不再往下运行了。看做return之后再把它看做一个是生成器(generator)的一部分(带yield的函数才是真正的迭代器)
2.yield与return
带yield的函数是一个生成器,而不是一个函数了,这个生成器有一个函数就是next函数,next就相当于“下一步”生成哪个数,这一次的next开始的地方是接着上一次的next停止的地方执行的,所以调用next的时候,生成器并不会从foo函数的开始执行,只是接着上一步停止的地方开始,然后遇到yield后,return出要生成的数,此步就结束。
实例
爬取当当网站图书数据并保存到mysql中
基础就在demo1上进行
观查网页
当当图书网:http://search.dangdang.com
输入python,网址变化成:http://search.dangdang.com/?key=python&act=input
翻到下一页或者第二页:http://search.dangdang.com/?key=python&act=input&page_index=2
确定爬取的信息:
title 标题
author 作者
date 发布日期
publisher 出版社
detail 细节介绍
price 价格
mysql中建立相应数据库和表:
show databases ;
create database ddbookdb;
use ddbookdb;
create table books(
btitle varchar(512) primary key ,
bauthor varchar(256),
bpublisher varchar(256),
bdate varchar(32),
bprice varchar(16),
bdetail text
);
select * from books;
编写运行run.py文件:
from scrapy import cmdline
cmdline.execute('scrapy crawl demo1spider --nolog'.split())#--nolog 控制台不输出日志
编写items.py数据项目类Demo1Item:
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class Demo1Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()#标题
author = scrapy.Field()#作者
date = scrapy.Field()#发布日期
publisher = scrapy.Field()#出版社
detail = scrapy.Field()#细节介绍
price = scrapy.Field()#价格
#pass
编写pipelines.py文件,写连接和关闭mysql数据库的两个函数,将数据项传输过来的数据输出到控制台并存入到mysql数据库中,使用count变量统计爬取的书籍数量:
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import pymysql
class Demo1Pipeline:
def open_spider(self,spider):
print("opened")
try:
self.con =pymysql.connect(host="127.0.0.1", port=3306, user="root", passwd="lzyft1030", db="ddbookdb", charset="utf8")
self.cursor = self.con.cursor(pymysql.cursors.DictCursor)#创建游标
self.cursor.execute("delete from books")
self.opend = True
self.count = 0
except Exception as err:
print(err)
self.opend = False
def close_spider(self,spider):
if self.opend:
self.con.commit()#提交
self.con.close()#关闭
self.opend = False
print("closed")
print("总共爬取",self.count,"本书籍")
def process_item(self, item, spider):
#查看传输过来的数据
try:
#把数据存入到mysql中
if self.opend:
self.cursor.execute("insert into books(btitle, bauthor, bpublisher, bdate, bprice, bdetail) values(%s, %s, %s ,%s ,%s, %s)", \
(item["title"], item["author"], item["publisher"], item["date"], item["price"], item["detail"]))
#计算书籍数量
self.count+= 1
except Exception as err:
print(err)
return item
修改setting.py文件【修改机器人规则, 加入User-Agent, 打开 ITEM_PIPELINES 将数据传输过来送到 demo1Pipeline 类中然后存到mysql中】:
# Scrapy settings for demo1 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'demo1' #项目名
SPIDER_MODULES = ['demo1.spiders'] #
NEWSPIDER_MODULE = 'demo1.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'demo1 (+http://www.yourdomain.com)' #这个可以浏览器抓包查看值 比较重要 一般都要带的
# Obey robots.txt rules
ROBOTSTXT_OBEY = False # 机器人规则 默认是true 一般都要修改为false 否则几乎爬不了太多东西
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 #最大并发数 可以开启的爬虫线程数
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1 #下载延迟时间,单位是秒,默认是3秒,即爬一个停3秒,设置为1秒性价比较高,如果要爬取的文件较多,写零点几秒也行
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False #是否保存COOKIES,默认关闭,开机可以记录爬取过程中的COKIE,非常好用的一个参数
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#} #默认请求头,上面写了一个USER_AGENT,其实这个东西就是放在请求头里面的,这个东西可以根据你爬取的内容做相应设置。
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'demo1.middlewares.Demo1SpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'demo1.middlewares.Demo1DownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'demo1.pipelines.Demo1Pipeline': 300,
#'demo1.pipelines.Demo1MySqlPipeline' : 200,
} #项目管道,300为优先级,越低爬取的优先度越高 pipelines.py里面写了两个管道,一个爬取网页的管道,一个存数据库的管道,我调整了他们的优先级,如果有爬虫数据,优先执行存库操作。
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
修改demo1spider.py文件,进行具体的爬虫操作:
import scrapy
from bs4 import UnicodeDammit
from bs4 import BeautifulSoup
from ..items import Demo1Item
#重写start_requests方法
class Demo1spiderSpider(scrapy.Spider):
name = 'demo1spider'
#allowed_domains = ['baidu.com']
#start_urls = ['http://baidu.com/'] #入口地址
key = "python"
source_url = "http://search.dangdang.com/"
def start_requests(self):#入口函数 可以用入口地址代替 入口地址可以有多个 是个列表
url = Demo1spiderSpider.source_url+"?key=" + Demo1spiderSpider.key
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):#回调函数
#一般网址response返回的是二进制 可以response.body.decode()转为文本
try:
#采用bs4里面的方法来处理编码问题
dammit = UnicodeDammit(response.body, ['utf-8','gbk'])
data = dammit.unicode_markup
#建立选择查找类Selector对象 调用xpath方法
selector = scrapy.Selector(text=data)
lis = selector.xpath("//li['@ddt-pit'][starts-with(@class,'line')]")
#print(lis)
for li in lis:
title = li.xpath("./a[position()=1]/@title").extract_first()
price = li.xpath("./p[@class='price']/span[@class='search_now_price']/text()").extract_first()
author = li.xpath("./p[@class='search_book_author']/span/a/@title").extract_first()
date = li.xpath("./p[@class='search_book_author']/span[position()=2]/text()").extract_first()
publisher = li.xpath("./p[@class='search_book_author']/span[position()=3]/a/@title").extract_first()
detail = li.xpath("./p[@class='detail']/text()").extract_first()#有时为空 None
item = Demo1Item()
item['title'] = title.strip() if title else ""
item['author'] = author.strip() if author else ""
item['date'] = date.strip()[1:] if date else ""
item['publisher'] = publisher.strip() if publisher else ""
item['price'] = price.strip() if price else ""
item['detail'] = detail.strip() if detail else ""
yield item
#最后一页时link为none
link = selector.xpath("//div[@class='paging']/ul[@name='Fy']/li[@class='next']/a/@href").extract_first()
if link:
url = response.urljoin(link)
yield scrapy.Request(url=url, callback=self.parse)
except Exception as err:
print(err)
如果导入 from …items import Demo1Item 不合适 可以按一下操作
把项目的根目录设置为源路径:选中项目根目录,该项目根目录是第一个demo1,按图操作 语句写成上面的格式就欧克了