Robots 爬虫协议

2024-03-27 07:53:03

Robots 协议

Robots Exclusion Standard ,译为网络爬虫排除标准。

网络爬虫排除标准

作用：网站告知网络爬虫哪些页面可以抓取，那些不行。
形式：在网站根目录下的robots.txt 文件。

Robots 协议

Robots 协议语法

# 注释， *代表所有爬虫，/代表根目录
User-agent: *
Disallow: /

如何查看 Robots 协议

以B站为例，进行协议的爬取。

import requests
r = requests.get('https://bilibili.com/robots.txt')
print(r.txt)

>>>
User-agent: *
Disallow: /include/
Disallow: /mylist/
Disallow: /member/
Disallow: /images/
Disallow: /ass/
Disallow: /getapi
Disallow: /search
Disallow: /account
Disallow: /badlist.html
Disallow: /m/

解读：

对于任何爬虫，都不建议爬取以上字符串开头的网站内容。

如何遵守 Robots 协议

网络爬虫：自动或人工识别robots.txt，在进行内容爬取。

约束性：Robots协议是建议但非约束性，网络爬虫可以不遵守，但存在法律风险。

Tips：类人行为（频率较小）可不参考Robots协议。

机械搬砖工发布了30 篇原创文章 · 获赞 4 · 访问量 709 私信关注

码农公寓

Robots 协议

网络爬虫排除标准

Robots 协议

相关文章