python – asyncio web scraping 101：使用aiohttp获取多个url

2023-08-19 17:13:40

在之前的问题中,aiohttp的一位作者使用Python 3.5中的新async语法建议使用fetch multiple urls with aiohttp：

import aiohttp
import asyncio

async def fetch(session, url):
    with aiohttp.Timeout(10):
        async with session.get(url) as response:
            return await response.text()

async def fetch_all(session, urls, loop):
    results = await asyncio.wait([loop.create_task(fetch(session, url))
                                  for url in urls])
    return results

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    # breaks because of the first url
    urls = ['http://SDFKHSKHGKLHSKLJHGSDFKSJH.com',
            'http://google.com',
            'http://twitter.com']
    with aiohttp.ClientSession(loop=loop) as session:
        the_results = loop.run_until_complete(
            fetch_all(session, urls, loop))
        # do something with the the_results

但是当其中一个session.get(url)请求中断时(如上所述,因为http://SDFKHSKHGKLHSKLJHGSDFKSJH.com),错误未得到处理,整个事情就会中断.

我找了一些方法来插入关于session.get(url)结果的测试,例如寻找一个尝试的地方……除了……,或者一个if response.status！= 200：但我不是了解如何使用异步,等待和各种对象.

由于异步仍然很新,所以没有很多例子.如果asyncio向导可以显示如何执行此操作,那将对许多人非常有帮助.毕竟,大多数人想要使用asyncio测试的第一件事就是同时获取多个资源.

目标

目标是我们可以检查the_results并快速查看：

>此URL失败(以及原因：状态代码,可能是异常名称),或
>这个网址工作,这是一个有用的响应对象

解决方法:

我会使用gather而不是wait,它可以将异常作为对象返回,而不会提升它们.然后,您可以检查每个结果,如果它是某个异常的实例.

import aiohttp
import asyncio

async def fetch(session, url):
    with aiohttp.Timeout(10):
        async with session.get(url) as response:
            return await response.text()

async def fetch_all(session, urls, loop):
    results = await asyncio.gather(
        *[fetch(session, url) for url in urls],
        return_exceptions=True  # default is false, that would raise
    )

    # for testing purposes only
    # gather returns results in the order of coros
    for idx, url in enumerate(urls):
        print('{}: {}'.format(url, 'ERR' if isinstance(results[idx], Exception) else 'OK'))
    return results

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    # breaks because of the first url
    urls = [
        'http://SDFKHSKHGKLHSKLJHGSDFKSJH.com',
        'http://google.com',
        'http://twitter.com']
    with aiohttp.ClientSession(loop=loop) as session:
        the_results = loop.run_until_complete(
            fetch_all(session, urls, loop))

测试：

$python test.py 
http://SDFKHSKHGKLHSKLJHGSDFKSJH.com: ERR
http://google.com: OK
http://twitter.com: OK

码农公寓

相关文章