文章目录
第二十三章 scrapy_redis讲解
1. python和redis的交互
首先安装redis,pip install redis。
Collecting redis
Downloading redis-3.5.3-py2.py3-none-any.whl (72 kB)
|████████████████████████████████| 72 kB 207 kB/s
Installing collected packages: redis
Successfully installed redis-3.5.3
在pycharm里新建一个py文件redis_crawl.py
然后是连接redis,需要连接的地址和端口号
我们可以定义一个类,在初始化方法里面定义连接方法。
import redis # 首先导入redis模块
class StringRedis():
def __init__(self):
# 连接redis
self.r = redis.StrictRedis(host='127.0.0.1',port=6379)
# 定义set方法
def string_set(self,k,v):
res = self.r.set(k,v)
print(res)
if __name__ == '__main__':
s = StringRedis()
s.string_set('name','Jerry777')
我们运行一下:
conn = self.connection or pool.get_connection(command_name, **options)
File "D:\Python38\lib\site-packages\redis\connection.py", line 1192, in get_connection
connection.connect()
File "D:\Python38\lib\site-packages\redis\connection.py", line 563, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 10061 connecting to 127.0.0.1:6379. 由于目标计算机积极拒绝,无法连接。.
结果报了个错,我刚才把Redis服务关闭了,这里需要重新打开:
D:\Download\redis-latest>cd redis-latest
D:\Download\redis-latest\redis-latest>redis-server.exe
[14424] 11 Mar 14:47:45.200 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server.exe /path/to/redis.conf
_._
_.-``__ ''-._
_.-`` `. `_. ''-._ Redis 3.0.503 (00000000/0) 64 bit
.-`` .-```. ```\/ _.,_ ''-._
( ' , .-` | `, ) Running in standalone mode
|`-._`-...-` __...-.``-._|'` _.-'| Port: 6379
| `-._ `._ / _.-' | PID: 14424
`-._ `-._ `-./ _.-' _.-'
|`-._`-._ `-.__.-' _.-'_.-'|
| `-._`-._ _.-'_.-' | http://redis.io
`-._ `-._`-.__.-'_.-' _.-'
|`-._`-._ `-.__.-' _.-'_.-'|
| `-._`-._ _.-'_.-' |
`-._ `-._`-.__.-'_.-' _.-'
`-._ `-.__.-' _.-'
`-._ _.-'
`-.__.-'
[14424] 11 Mar 14:47:45.213 # Server started, Redis version 3.0.503
[14424] 11 Mar 14:47:45.219 * DB loaded from disk: 0.001 seconds
[14424] 11 Mar 14:47:45.220 * The server is now ready to accept connections on port 6379
再运行
D:\Python38\python.exe D:/work/爬虫/Day26/redis_crawl.py
True
Process finished with exit code 0
在python里面返回的是True.
在Redis里面返回的是1,这是区别。
那么我们在redis里面查看一下:
D:\Download\redis-latest\redis-latest>redis-cli
127.0.0.1:6379> get name
"Jerry777"
127.0.0.1:6379>
我们在python里定义一个方法来获取name
import redis # 首先导入redis模块
class StringRedis():
def __init__(self):
# 连接redis
self.r = redis.StrictRedis(host='127.0.0.1',port=6379)
# 定义set方法
def string_set(self,k,v):
res = self.r.set(k,v)
print(res)
def string_get(self,k):
res = self.r.get(k)
return res
if __name__ == '__main__':
s = StringRedis()
print(s.string_get('name')) # 打印数据
print(type(s.string_get('name'))) # 打印类型
运行
b'Jerry777'
<class 'bytes'>
拿到的结果是个字节类型的。这不方便我们处理数据,我们可以在前面的self.r的参数里加上
self.r = redis.StrictRedis(host='127.0.0.1',port=6379,decode_responses=True)
再运行一下:
Jerry777
<class 'str'>
拿到了字符串类型。
2. scrapy_redis讲解
集群:多个人在做相同的事情。假如一个人请假了,不影响整体工作。
分布式:多个人在做不同的事情。如果一个人请假了,会影响一个环节。但是可以集群分布式,这样一个环节有多人做,假如一个人请假了,还是不影响。
之前我们学了scrapy python的爬虫框架,爬虫效率式极高的,具有高度的定制性,但是不支持分布式。
而scrapy_redis基于redis数据库,运行在scrapy之上的组件,可以让scrapy支持分布式策略。支持主从同步(比如一个来读,一个来写,读写同步,效率非常高)。
(此处有一部分内容,图文讲解,需要补充)
3. 下载scrap_redis案例
scrapy_redis是一个基于redis的scrapy组件。这里有一个网页,
https://github.com/rolando/scrapy-redis.git
可以在这里下载样本项目文件。我们直接把里面的example文件夹复制粘贴到本地pycharm的要给文件夹里就可以了。这个支持python3.4以上的版本。
下面我们安装pip install scrapy_redis -i https://pypi.tuna.tsinghua.edu.cn/simple/(清华源)
C:\Users\MI>pip install scrapy_redis -i https://pypi.tuna.tsinghua.edu.cn/simple/
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple/
Collecting scrapy_redis
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/00/91/bbc84cb0b95c361e9066d6ec115fd387142c07cabc69c5620761afa36874/scrapy_redis-0.6.8-py2.py3-none-any.whl (19 kB)
Requirement already satisfied: six>=1.5.2 in d:\python38\lib\site-packages (from scrapy_redis) (1.15.0)
Requirement already satisfied: redis>=2.10 in d:\python38\lib\site-packages (from scrapy_redis) (3.5.3)
Requirement already satisfied: Scrapy>=1.0 in d:\python38\lib\site-packages (from scrapy_redis) (2.4.1)
Requirement already satisfied: itemadapter>=0.1.0 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (0.2.0)
Requirement already satisfied: protego>=0.1.15 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (0.1.16)
Requirement already satisfied: cryptography>=2.0 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (3.3.1)
Requirement already satisfied: zope.interface>=4.1.3 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (5.2.0)
Requirement already satisfied: service-identity>=16.0.0 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (18.1.0)
Requirement already satisfied: queuelib>=1.4.2 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (1.5.0)
Requirement already satisfied: cssselect>=0.9.1 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (1.1.0)
Requirement already satisfied: pyOpenSSL>=16.2.0 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (20.0.1)
Requirement already satisfied: w3lib>=1.17.0 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (1.22.0)
Requirement already satisfied: Twisted>=17.9.0 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (20.3.0)
Requirement already satisfied: lxml>=3.5.0 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (4.6.2)
Requirement already satisfied: parsel>=1.5.0 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (1.6.0)
Requirement already satisfied: PyDispatcher>=2.0.5 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (2.0.5)
Requirement already satisfied: itemloaders>=1.0.1 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (1.0.4)
Requirement already satisfied: cffi>=1.12 in d:\python38\lib\site-packages (from cryptography>=2.0->Scrapy>=1.0->scrapy_redis) (1.14.4)
Requirement already satisfied: pycparser in d:\python38\lib\site-packages (from cffi>=1.12->cryptography>=2.0->Scrapy>=1.0->scrapy_redis) (2.20)
Requirement already satisfied: jmespath>=0.9.5 in d:\python38\lib\site-packages (from itemloaders>=1.0.1->Scrapy>=1.0->scrapy_redis) (0.10.0)
Requirement already satisfied: pyasn1 in d:\python38\lib\site-packages (from service-identity>=16.0.0->Scrapy>=1.0->scrapy_redis) (0.4.8)
Requirement already satisfied: attrs>=16.0.0 in d:\python38\lib\site-packages (from service-identity>=16.0.0->Scrapy>=1.0->scrapy_redis) (20.3.0)
Requirement already satisfied: pyasn1-modules in d:\python38\lib\site-packages (from service-identity>=16.0.0->Scrapy>=1.0->scrapy_redis) (0.2.8)
Requirement already satisfied: Automat>=0.3.0 in d:\python38\lib\site-packages (from Twisted>=17.9.0->Scrapy>=1.0->scrapy_redis) (20.2.0)
Requirement already satisfied: incremental>=16.10.1 in d:\python38\lib\site-packages (from Twisted>=17.9.0->Scrapy>=1.0->scrapy_redis) (17.5.0)
Requirement already satisfied: constantly>=15.1 in d:\python38\lib\site-packages (from Twisted>=17.9.0->Scrapy>=1.0->scrapy_redis) (15.1.0)
Requirement already satisfied: PyHamcrest!=1.10.0,>=1.9.0 in d:\python38\lib\site-packages (from Twisted>=17.9.0->Scrapy>=1.0->scrapy_redis) (2.0.2)
Requirement already satisfied: hyperlink>=17.1.1 in d:\python38\lib\site-packages (from Twisted>=17.9.0->Scrapy>=1.0->scrapy_redis) (21.0.0)
Requirement already satisfied: idna>=2.5 in d:\python38\lib\site-packages (from hyperlink>=17.1.1->Twisted>=17.9.0->Scrapy>=1.0->scrapy_redis) (2.10)
Requirement already satisfied: setuptools in d:\python38\lib\site-packages (from zope.interface>=4.1.3->Scrapy>=1.0->scrapy_redis) (49.2.1)
Installing collected packages: scrapy-redis
Successfully installed scrapy-redis-0.6.8
我们找到那个网页,下载案例。
下载的文件加压缩后,把example-project这个文件夹复制到本地本机器里。
里面有个requirments文档:
我们都已经安装好了。
画红圈的地方依次打开。我们可以看到scrapy-redis所包含的文件就这么几个:
我们主要看spiders里面的三个文件:
我们看dmoz.py
和普通的爬虫文件不太一样。
我们再看看Settings
# Scrapy settings for example project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/topics/settings.html
#
SPIDER_MODULES = ['example.spiders']
NEWSPIDER_MODULE = 'example.spiders'
USER_AGENT = 'scrapy-redis (+https://github.com/rolando/scrapy-redis)'
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 过滤用的一个类
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"
ITEM_PIPELINES = {
'example.pipelines.ExamplePipeline': 300,
'scrapy_redis.pipelines.RedisPipeline': 400,
}
LOG_LEVEL = 'DEBUG'
# Introduce an artifical delay to make use of parallelism. to speed up the
# crawl.
DOWNLOAD_DELAY = 1
本次博客到这里结束,具体的讲解我们下次再分解。