

2021-07-11 02:19:11 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://xxxx.com/tags/undef>: HTTP status code is not handled or not allowed


  • 请求的503状态html内容进行翻译

Checking your browser before accessing xxxx.com
This process is automatic. Your browser will redirect to your requested content shortly.
Please allow up to 5 seconds…
  • 从翻译的内容来看是为了浏览器验证等待5s 网上搜了一下说是有个Cloudflare机制为了防止机器人非正常获取数据搜到 需要搭配使用cfscrape 绕过页面等待,配置如下:

安装 pip install cfscrape

class DrdSpider(scrapy.Spider):
    def start_requests(self):
        cf_requests = []
        for url in self.start_urls:
            token, agent = cfscrape.get_tokens(url, USER_AGENT)
            #token, agent = cfscrape.get_tokens(url)
            cf_requests.append(scrapy.Request(url=url, cookies={‘__cfduid‘: token[‘__cfduid‘]}, headers={‘User-Agent‘: agent}))
            print "useragent in cfrequest: " , agent
            print "token in cfrequest: ", token
        return cf_requests
  • 但是配置好后运行报错,信息如下:
Traceback (most recent call last):
  File "C:\workspace\new-crm-agent\env\lib\site-packages\scrapy\core\engine.py", line 129, in _next_request
    request = next(slot.start_requests)
  File "C:\workspace\phub\scrapy_obj\mySpider\spiders\drd.py", line 35, in start_requests
    token, agent = cfscrape.get_tokens(url)
  File "C:\workspace\new-crm-agent\env\lib\site-packages\cfscrape\__init__.py", line 398, in get_tokens
    ‘Unable to find Cloudflare cookies. Does the site actually have Cloudflare IUAM ("I\‘m Under Attack Mode") enabled?‘
ValueError: Unable to find Cloudflare cookies. Does the site actually have Cloudflare IUAM ("I‘m Under Attack Mode") enabled?
  • 从报错信息来看意思是该站点没有采用Cloudflare机制,于是我在报错前一行代码打断点看请求内容。发现状态码为200状态。


  • 我觉得可能是scrayp框架本身问题。 于是使用requests模块请求获取看看是否能正常访问,发现依然是503状态
if __name__ == "__main__":
    session = requests.session()
    heads = OrderedDict([(‘Host‘, None),
             (‘Connection‘, ‘keep-alive‘),
             (‘Upgrade-Insecure-Requests‘, ‘1‘),
              ‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36‘),
             (‘Accept-Language‘, ‘en-US,en;q=0.9‘),
             (‘Accept-Encoding‘, ‘gzip, deflate‘)])
    session.headers = heads
    resp = session.get("https://drd.com/tags/undi")


<Response [503]>
Process finished with exit code 0

  • 除了cfscrape。python自带的requests和scrapy都不能正常访问, 可能是cfscrape源码做了特殊设置,查看源码特殊部分代码如下:
class CloudflareAdapter(HTTPAdapter):
    """ HTTPS adapter that creates a SSL context with custom ciphers """

    def get_connection(self, *args, **kwargs):
        conn = super(CloudflareAdapter, self).get_connection(*args, **kwargs)

        if conn.conn_kw.get("ssl_context"):
            context = create_urllib3_context(ciphers=DEFAULT_CIPHERS)
            conn.conn_kw["ssl_context"] = context

        return conn
class CloudflareScraper(Session):
    def __init__(self, *args, **kwargs):
        self.delay = kwargs.pop("delay", None)
        # Use headers with a random User-Agent if no custom headers have been set
        headers = OrderedDict(kwargs.pop("headers", DEFAULT_HEADERS))

        # Set the User-Agent header if it was not provided
        headers.setdefault("User-Agent", DEFAULT_USER_AGENT)

        super(CloudflareScraper, self).__init__(*args, **kwargs)

        # Define headers to force using an OrderedDict and preserve header order
        self.headers = headers
        self.org_method = None

        self.mount("https://", CloudflareAdapter())
  • 问题出在这里self.mount("https://", CloudflareAdapter()), 我照着这个请求逻辑用requests发现能正常请求200。 问题可能是https请求前需要ssl认证,并且设置ssl_context。于是我搜了一下set_ciphers是干什么用的。python官方解释如下:
为使用此上下文创建的套接字设置可用密码。 它应当为 OpenSSL 密码列表格式 的字符串。 如果没有可被选择的密码(由于编译时选项或其他配置禁止使用所指定的任何密码),则将引发 SSLError。

備註 在连接后,SSL 套接字的 SSLSocket.cipher() 方法将给出当前所选择的密码。
TLS 1.3 cipher suites cannot be disabled with set_ciphers().
  • 应该是该网站443连接需要使用TLS/SSL密码验证,需要设置如下:


  • 使用requests模块需要修改http适配器, 代码如下:
if __name__ == "__main__":
    ciphers = "DEFAULT:!DH"
    class TestAdapter(HTTPAdapter):
        def get_connection(self, *args, **kwargs):
            conn = super(TestAdapter, self).get_connection(*args, **kwargs)
            if conn.conn_kw.get("ssl_context"):
                context = create_urllib3_context(ciphers=ciphers)
                conn.conn_kw["ssl_context"] = context
            return conn
    session = requests.session()
    heads = OrderedDict([(‘Host‘, None),
             (‘Connection‘, ‘keep-alive‘),
             (‘Upgrade-Insecure-Requests‘, ‘1‘),
              ‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36‘),
             (‘Accept-Language‘, ‘en-US,en;q=0.9‘),
             (‘Accept-Encoding‘, ‘gzip, deflate‘)])
    session.headers = heads
    session.mount(‘https://‘, TestAdapter())
    resp = session.get("https://javdb.com/tags/uncensored")


上一篇:background-attachment: fixed属性滚动鼠标背景抖动问题解决
