爬虫(26)scrapy_redis讲解

文章目录

第二十三章 scrapy_redis讲解

1. python和redis的交互

首先安装redis,pip install redis。

Collecting redis
  Downloading redis-3.5.3-py2.py3-none-any.whl (72 kB)
     |████████████████████████████████| 72 kB 207 kB/s
Installing collected packages: redis
Successfully installed redis-3.5.3


在pycharm里新建一个py文件redis_crawl.py
然后是连接redis,需要连接的地址和端口号
我们可以定义一个类,在初始化方法里面定义连接方法。

import redis  # 首先导入redis模块

class StringRedis():
    def __init__(self):
        # 连接redis
        self.r = redis.StrictRedis(host='127.0.0.1',port=6379)
        # 定义set方法
    def string_set(self,k,v):
        res = self.r.set(k,v)
        print(res)


if __name__ == '__main__':
    s = StringRedis()
    s.string_set('name','Jerry777')


我们运行一下:

    conn = self.connection or pool.get_connection(command_name, **options)
  File "D:\Python38\lib\site-packages\redis\connection.py", line 1192, in get_connection
    connection.connect()
  File "D:\Python38\lib\site-packages\redis\connection.py", line 563, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 10061 connecting to 127.0.0.1:6379. 由于目标计算机积极拒绝,无法连接。.


结果报了个错,我刚才把Redis服务关闭了,这里需要重新打开:


D:\Download\redis-latest>cd redis-latest

D:\Download\redis-latest\redis-latest>redis-server.exe
[14424] 11 Mar 14:47:45.200 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server.exe /path/to/redis.conf
                _._
           _.-``__ ''-._
      _.-``    `.  `_.  ''-._           Redis 3.0.503 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 14424
  `-._    `-._  `-./  _.-'    _.-'
 |`-._`-._    `-.__.-'    _.-'_.-'|
 |    `-._`-._        _.-'_.-'    |           http://redis.io
  `-._    `-._`-.__.-'_.-'    _.-'
 |`-._`-._    `-.__.-'    _.-'_.-'|
 |    `-._`-._        _.-'_.-'    |
  `-._    `-._`-.__.-'_.-'    _.-'
      `-._    `-.__.-'    _.-'
          `-._        _.-'
              `-.__.-'

[14424] 11 Mar 14:47:45.213 # Server started, Redis version 3.0.503
[14424] 11 Mar 14:47:45.219 * DB loaded from disk: 0.001 seconds
[14424] 11 Mar 14:47:45.220 * The server is now ready to accept connections on port 6379

再运行

D:\Python38\python.exe D:/work/爬虫/Day26/redis_crawl.py
True

Process finished with exit code 0

在python里面返回的是True.
在Redis里面返回的是1,这是区别。
那么我们在redis里面查看一下:

D:\Download\redis-latest\redis-latest>redis-cli
127.0.0.1:6379> get name
"Jerry777"
127.0.0.1:6379>

我们在python里定义一个方法来获取name

import redis  # 首先导入redis模块

class StringRedis():
    def __init__(self):
        # 连接redis
        self.r = redis.StrictRedis(host='127.0.0.1',port=6379)
        # 定义set方法
    def string_set(self,k,v):
        res = self.r.set(k,v)
        print(res)

    def string_get(self,k):
        res = self.r.get(k)
        return res




if __name__ == '__main__':
    s = StringRedis()
    print(s.string_get('name')) # 打印数据
    print(type(s.string_get('name'))) # 打印类型


运行

b'Jerry777'
<class 'bytes'>


拿到的结果是个字节类型的。这不方便我们处理数据,我们可以在前面的self.r的参数里加上

self.r = redis.StrictRedis(host='127.0.0.1',port=6379,decode_responses=True)

再运行一下:

Jerry777
<class 'str'>

拿到了字符串类型。

2. scrapy_redis讲解

集群:多个人在做相同的事情。假如一个人请假了,不影响整体工作。
分布式:多个人在做不同的事情。如果一个人请假了,会影响一个环节。但是可以集群分布式,这样一个环节有多人做,假如一个人请假了,还是不影响。
之前我们学了scrapy python的爬虫框架,爬虫效率式极高的,具有高度的定制性,但是不支持分布式。
而scrapy_redis基于redis数据库,运行在scrapy之上的组件,可以让scrapy支持分布式策略。支持主从同步(比如一个来读,一个来写,读写同步,效率非常高)。
(此处有一部分内容,图文讲解,需要补充)

3. 下载scrap_redis案例

scrapy_redis是一个基于redis的scrapy组件。这里有一个网页
https://github.com/rolando/scrapy-redis.git
爬虫(26)scrapy_redis讲解
可以在这里下载样本项目文件。我们直接把里面的example文件夹复制粘贴到本地pycharm的要给文件夹里就可以了。这个支持python3.4以上的版本。

下面我们安装pip install scrapy_redis -i https://pypi.tuna.tsinghua.edu.cn/simple/(清华源)


C:\Users\MI>pip install scrapy_redis -i https://pypi.tuna.tsinghua.edu.cn/simple/
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple/
Collecting scrapy_redis
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/00/91/bbc84cb0b95c361e9066d6ec115fd387142c07cabc69c5620761afa36874/scrapy_redis-0.6.8-py2.py3-none-any.whl (19 kB)
Requirement already satisfied: six>=1.5.2 in d:\python38\lib\site-packages (from scrapy_redis) (1.15.0)
Requirement already satisfied: redis>=2.10 in d:\python38\lib\site-packages (from scrapy_redis) (3.5.3)
Requirement already satisfied: Scrapy>=1.0 in d:\python38\lib\site-packages (from scrapy_redis) (2.4.1)
Requirement already satisfied: itemadapter>=0.1.0 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (0.2.0)
Requirement already satisfied: protego>=0.1.15 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (0.1.16)
Requirement already satisfied: cryptography>=2.0 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (3.3.1)
Requirement already satisfied: zope.interface>=4.1.3 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (5.2.0)
Requirement already satisfied: service-identity>=16.0.0 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (18.1.0)
Requirement already satisfied: queuelib>=1.4.2 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (1.5.0)
Requirement already satisfied: cssselect>=0.9.1 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (1.1.0)
Requirement already satisfied: pyOpenSSL>=16.2.0 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (20.0.1)
Requirement already satisfied: w3lib>=1.17.0 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (1.22.0)
Requirement already satisfied: Twisted>=17.9.0 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (20.3.0)
Requirement already satisfied: lxml>=3.5.0 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (4.6.2)
Requirement already satisfied: parsel>=1.5.0 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (1.6.0)
Requirement already satisfied: PyDispatcher>=2.0.5 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (2.0.5)
Requirement already satisfied: itemloaders>=1.0.1 in d:\python38\lib\site-packages (from Scrapy>=1.0->scrapy_redis) (1.0.4)
Requirement already satisfied: cffi>=1.12 in d:\python38\lib\site-packages (from cryptography>=2.0->Scrapy>=1.0->scrapy_redis) (1.14.4)
Requirement already satisfied: pycparser in d:\python38\lib\site-packages (from cffi>=1.12->cryptography>=2.0->Scrapy>=1.0->scrapy_redis) (2.20)
Requirement already satisfied: jmespath>=0.9.5 in d:\python38\lib\site-packages (from itemloaders>=1.0.1->Scrapy>=1.0->scrapy_redis) (0.10.0)
Requirement already satisfied: pyasn1 in d:\python38\lib\site-packages (from service-identity>=16.0.0->Scrapy>=1.0->scrapy_redis) (0.4.8)
Requirement already satisfied: attrs>=16.0.0 in d:\python38\lib\site-packages (from service-identity>=16.0.0->Scrapy>=1.0->scrapy_redis) (20.3.0)
Requirement already satisfied: pyasn1-modules in d:\python38\lib\site-packages (from service-identity>=16.0.0->Scrapy>=1.0->scrapy_redis) (0.2.8)
Requirement already satisfied: Automat>=0.3.0 in d:\python38\lib\site-packages (from Twisted>=17.9.0->Scrapy>=1.0->scrapy_redis) (20.2.0)
Requirement already satisfied: incremental>=16.10.1 in d:\python38\lib\site-packages (from Twisted>=17.9.0->Scrapy>=1.0->scrapy_redis) (17.5.0)
Requirement already satisfied: constantly>=15.1 in d:\python38\lib\site-packages (from Twisted>=17.9.0->Scrapy>=1.0->scrapy_redis) (15.1.0)
Requirement already satisfied: PyHamcrest!=1.10.0,>=1.9.0 in d:\python38\lib\site-packages (from Twisted>=17.9.0->Scrapy>=1.0->scrapy_redis) (2.0.2)
Requirement already satisfied: hyperlink>=17.1.1 in d:\python38\lib\site-packages (from Twisted>=17.9.0->Scrapy>=1.0->scrapy_redis) (21.0.0)
Requirement already satisfied: idna>=2.5 in d:\python38\lib\site-packages (from hyperlink>=17.1.1->Twisted>=17.9.0->Scrapy>=1.0->scrapy_redis) (2.10)
Requirement already satisfied: setuptools in d:\python38\lib\site-packages (from zope.interface>=4.1.3->Scrapy>=1.0->scrapy_redis) (49.2.1)
Installing collected packages: scrapy-redis
Successfully installed scrapy-redis-0.6.8

我们找到那个网页,下载案例。
爬虫(26)scrapy_redis讲解
下载的文件加压缩后,把example-project这个文件夹复制到本地本机器里。
爬虫(26)scrapy_redis讲解
里面有个requirments文档:
爬虫(26)scrapy_redis讲解
我们都已经安装好了。
爬虫(26)scrapy_redis讲解
爬虫(26)scrapy_redis讲解
画红圈的地方依次打开。我们可以看到scrapy-redis所包含的文件就这么几个:
爬虫(26)scrapy_redis讲解
我们主要看spiders里面的三个文件:
爬虫(26)scrapy_redis讲解

我们看dmoz.py
爬虫(26)scrapy_redis讲解
和普通的爬虫文件不太一样。
我们再看看Settings

# Scrapy settings for example project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
#     http://doc.scrapy.org/topics/settings.html
#
SPIDER_MODULES = ['example.spiders']
NEWSPIDER_MODULE = 'example.spiders'

USER_AGENT = 'scrapy-redis (+https://github.com/rolando/scrapy-redis)'

DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"    # 过滤用的一个类
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"

ITEM_PIPELINES = {
    'example.pipelines.ExamplePipeline': 300,
    'scrapy_redis.pipelines.RedisPipeline': 400,
}

LOG_LEVEL = 'DEBUG'

# Introduce an artifical delay to make use of parallelism. to speed up the
# crawl.
DOWNLOAD_DELAY = 1


本次博客到这里结束,具体的讲解我们下次再分解。

上一篇:centos7 yum和 python2损坏后手动重新安装修复


下一篇:确定要包含的Delphi运行时程序包(Determining Delphi Runtime Packages to Include)