网络爬虫之协程

2024-03-26 17:43:28

一、协程的定义
协程又叫微线程，比线程还要小的一个单位；协程不是计算机提供的，是程序员自己创造出来的；协程是一个用户态的上下文切换技术，简单来说，就是通过一个线程去实现代码块（函数）之间的相互切换执行。

二、协程的特点
1. 使用协程时不需要考虑全局变量安全性的问题。
2. 协程必须要在单线程中实现并发。
3. 当协程遇到IO操作时，会自动切换到另一个协程中继续执行。
4. 协程能够完美解决IO密集型的问题，但是cpu密集型不是他的强项。
5. 协程执行效率非常高，因为协程的切换是子程序函数的切换，相比于线程的开销来说要小很多，同时，线程越多开销越大。

三、协程的原理
协程拥有自己的寄存器、上下文和栈，协程在调度切换函数时会将寄存器上下文和栈保存到其它地方，再切回来的时候会恢复之前保存的寄存器、上下文和栈继续从上一次调用的状态下继续执行。

四、进程、线程和协程的对比
1. 协程既不是进程也不是线程，是一个特殊的函数，和进程、线程不是一个维度的。
2. 一个进程可以有多个线程，一个线程可以包含多个协程。
3. 一个线程内的多个协程可以相互切换，但是多个协程之间是串联执行的，并且一个只能在一个线程内运行，所以，没有办法利用cpu的多核能力。

五、协程的实现
(1) 使用greenlet模块：最早实现协程的第三方模块
(2) yield关键字
(3) asyncio装饰器(python解释器版本3.4之后才有的)
(4) async、await关键字：非常好用，极力推荐(主要说这个实现方式)

六、案例介绍

1. 360图片下载协程实现（异步网络请求aiohttp介绍网址：https://www.cnblogs.com/fengting0913/p/14926893.html）

#async是用来定义好协程的，是定义的时候是用的，真正的调用使用的是await，
# 利用协程来下载图片
async def download(url):
    # 创建session对象，其中async with 是一个整体，表示一个异步的上下文管理器
    async with aiohttp.ClientSession() as session:
        # 发起请求与接收响应
        async with session.get(url=url) as response:
            content = await response.content.read()
            # 保存图片,注意本地的文件读写不需要await
            imag_name = url.split('/')[-1]
            with open(imag_name,'wb') as fp:
                fp.write(content)

async def main():

    # 定义url_list
    url_list = [
        'https://img0.baidu.com/it/u=291378222,233871465&fm=26&fmt=auto&gp=0.jpg',
        'https://img2.baidu.com/it/u=3466049587,2049802835&fm=26&fmt=auto&gp=0.jpg',
        'https://img0.baidu.com/it/u=213410053,396892388&fm=26&fmt=auto&gp=0.jpg',
        'https://img0.baidu.com/it/u=1380950348,3018255149&fm=26&fmt=auto&gp=0.jpg',
        'https://img1.baidu.com/it/u=4110196045,3829597861&fm=26&fmt=auto&gp=0.jpg'
    ]
    # 创建tasks对象，创建协程，将协程封装到Task对象中并添加到事件循环的任务列表中，等待事件循环去执行（默认是就绪状态）
    tasks = [
        asyncio.ensure_future(download(i)) for i in url_list
    ]
    #将任务添加到事件循环中，等待协程的调度，await:当执行某协程遇到IO操作时，会自动化切换执行其他任务。
    await asyncio.wait(tasks)

if __name__ == '__main__':
    # 创建事件循环
    loop = asyncio.get_event_loop()
    # 将协程当做任务提交到事件循环的任务列表中，协程执行完成之后终止。
    loop.run_until_complete(main())

2. 小程序社区的title获取：asyncio+aiohttp+aiomysql实现高并发爬虫

思路：
    1. 请求小程序社区列表页第一页，获取所有的详情页链接
    2. 进入到每一个文章的详情页中，获取上一篇和下一篇的链接，并请求，重复这一步
    3. 获取标题，存到MySQL中（使用MySQL的异步连接池）

aiomysql异步操作：https://www.yangyanxing.com/article/aiomysql_in_python.html

aiomysql使用介绍：https://www.cnblogs.com/zwb8848happy/p/8809861.html

from lxml import etree
import asyncio
import aiohttp
import aiomysql
import re

# 请求函数
async def get_request(url):
    try:
        async with aiohttp.ClientSession() as session:
            async with session.get(url=url, headers=headers) as response:
                if response.status == 200:
                    content = await response.text()
                    return content
    except:
        pass

# 定义解析url的函数
# 注意：解析URL，不属于io操作，是通过cpu完成的，所以，我们定义普通函数即可
def get_url(content):
    html = etree.HTML(content)
    a_href_list = html.xpath('//a')
    for a_href in a_href_list:
        # 注意xpath返回的是一个列表
        href = a_href.xpath('./@href')
        if href and url_pattern.findall(href[0]) and href[0] not in url_set:
            url_set.add(href[0])
            wait_url.append(href[0])

async def parse_article(url, pool):
    content = await get_request(url)
    try:
        get_url(content)
        html = etree.HTML(content)
        title = html.xpath('//h1/text()')[0]
        if not title:
            title = ''
        print(title)
        async with pool.acquire() as connect:
            async with connect.cursor() as cursor:
                insert_into = 'insert into titles(title)values (%s)'
                await cursor.execute(insert_into, title)
    except:
        pass

# 定义消费者函数
async def consumer(pool):
    while True:
        if len(wait_url) == 0:
            await asyncio.sleep(0.5)
            continue
        url = wait_url.pop()
        asyncio.ensure_future(parse_article(url, pool))
    pass

async def main():
    """
    1、创建mysql异步连接池，需要注意的是 aiomysql 是基于协程的，因此需要通过 await 的方式来调用。
    2、使用连接池的意义在于，有一个池子，它里保持着指定数量的可用连接，当一个查询结执行之前从这个池子里取一个连接，
    查询结束以后将连接放回池子中，这样可以避免频繁的连接数据库，节省大量的资源。
    3、高并发情况下，异步连接池可以显著提升总体读写的效率，这是单连接无法比拟的。
    """
    pool = await aiomysql.create_pool(
        host='127.0.0.1',
        port=3306,
        user='root',
        password='123456',
        db='mina',
        charset='utf8',
        autocommit=True
    )
    content = await get_request(start_url)
    # 将start_url放置到url_et集合中
    url_set.add(start_url)
    get_url(content)
    await asyncio.ensure_future(consumer(pool))

if __name__ == '__main__':
    start_url = 'https://www.wxapp-union.com/portal.php?mod=list&catid=1&page=1'
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)\
                 Chrome/80.0.3987.163 Safari/537.36'
    }
    # 定义去重集合
    url_set = set()
    # 定义待获取的url列表
    wait_url = []
    # 匹配路由的正则
    url_pattern = re.compile(r'article-\d+-\d+\.html')
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

码农公寓

相关文章