文章目录
1 urllib实现
关于urllib、urllib2和urllib3的区别可以查看。python3中,urllib被打包成一个包,所拥有的模块如下:
名称 | 作用 |
---|---|
urllib.request | 打开和读取url |
urllib.error | 处理request引起的异常 |
urllib.parse | 解析url |
urllib.robotparser | 解析robots.txt文件 |
1.1 完整请求与响应模型的实现
urllib2提供一个基础函数urlopen,通过向指定的URL发出请求来获取数据,最简单的形式如下:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request
"""响应"""
res = request.urlopen('http://www.zhihu.com') #可以设置timeout,例如timeout=2
html = res.read()
print(html)
输出:
b'<!doctype html>\n<html lang="zh" data-hairline="true" data-theme="light"><head><meta charSet="utf-8"/><title data-react...'
以上代码可以分为两步:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request
"""请求"""
req = request.Request('http://www.zhihu.com')
"""响应"""
res = request.urlopen(req)
html = res.read()
print(html)
以上的两者方法都是GET请求,接下来对POST请求进行说明:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request
url = 'https://www.xxx.com//login'
postdata = {b'username': b'miao',
b'password': b'123456'}
"""请求"""
req = request.Request(url, postdata)
"""响应"""
res = request.urlopen(req)
html = res.read()
print(html)
这个自己试试就行。
1.2 请求头headers处理
下面的例子对添加请求头信息进行说明,包括设置User-Agent和Referer:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request
url = 'https://www.xxx.com//login'
postdata = {b'username': b'xxx',
b'password': b'******'}
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
referer = 'https://www.github.com'
herders = {'User-Agent': user_agent, 'Referer': referer}
"""请求"""
req = request.Request(url, postdata, herders)
"""响应"""
res = request.urlopen(req)
html = res.read()
print(html)
请求头信息也可以用add_header来添加:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request
url = 'https://www.xxxxxx.com//login'
postdata = {b'username': b'xxx',
b'password': b'******'}
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
referer = 'https://www.github.com'
req = request.Request(url, postdata)
"""修改"""
req.add_header('User-Agent', user_agent)
req.add_header('Referer', referer)
res = request.urlopen(req)
html = res.read()
print(html)
注意:.
对某些header要特别注意,服务器会针对这些header进行检查,例如:
- User-Agent:有些服务器或Proxy会通过该值来判断是否是浏览器发出的请求
- Content-Type:在使用REST接口时,服务器会检查该值,用来确定HEEP Body的内容该怎样解析,在使用服务器提供的RESTful或SOAP服务时,该值的设置错误会导致服务器拒绝服务。常见的取值如下:
application/xml (在XML RPC,如RESTful/SOAP调用时使用 |
---|
application/json (在JSON RPC调用时使用) |
application/x-www-form-urlencoded (浏览器提交Web表单时使用) |
- Referer:服务器有时会检查防盗链。
1.3 Cookie处理
如果需要得到某个Cookie的值,可以采取如下做法:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request
from http import cookiejar
cookie = cookiejar.CookieJar()
opener = request.build_opener(request.HTTPCookieProcessor(cookie))
"""响应"""
res = opener.open('http://www.zhihu.com')
for item in cookie:
print(item.name + ": " + item.value)
输出:
_xsrf: 467z...
_zap: 4f91...
KLBRSID: ed2a...
当然可以按自己的需要手动添加Cookie的内容:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request
cookie = ('Cookie', 'email=' + 'xxxxxxx@163.com')
opener = request.build_opener()
opener.addheaders = [cookie]
"""请求"""
req = request.Request('http://www.zhihu.com')
"""响应"""
res = opener.open(req)
print(res.headers)
retdata = res.read()
输出:
Date: Tue, 09 Jun 2020 06:45:54 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 49014
Connection: close
Server: CLOUD ELB 1.0.0...
1.4 获取HTTP相应码
对于200OK来说,只需使用urlopen返回对象的getcode()即可获得HTTP的返回码。但是对于其他返回码,则会抛出异常:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request
try:
"""响应"""
res = request.urlopen('http://www.zhihu.com')
print(res.getcode())
except request.HTTPError as e:
if hasattr(e, 'code'):
print("Error code: ", e.code)
输出:
200
1.5 重定向
以下代码将检查是否出现了重定向动作:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request
try:
"""响应"""
res = request.urlopen('http://www.zhihu.com')
print(res.geturl())
except request.HTTPError as e:
if hasattr(e, 'code'):
print("Error code: ", e.code)
输出:
https://www.zhihu.com/signin?next=%2F
如果不想重定向,则可以自定义HTTPRedirectHandler类:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request
class RedirectHandler(request.HTTPRedirectHandler):
def http_error_301(self, req, fp, code, msg, headers):
pass
def http_error_302(self, req, fp, code, msg, headers):
result = request.HTTPRedirectHandler.http_error_301(self, req, fp, code, msg, headers)
result.status = code
result.newurl = result.geturl()
return result
opener = request.build_opener(RedirectHandler)
res = opener.open('http://www.zhihu.cn')
print(res)
输出:
<http.client.HTTPResponse object at 0x000001BEAC776160>
1.6 Proxy的设置
示例如下:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request
proxy = request.ProxyHandler({'http': '127.0.0.1: 8087'})
opener = request.build_opener(proxy)
res = opener.open('http://www.zhihu.com/')
print(res.read())
输出:
2 request实现
2.1 完整请求与响应模型的实现
1)GET请求:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests
res = requests.get('http://www.zhihu.com')
print(res.content)
2)POST请求:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests
postdata = {'key' : 'value'}
res = requests.post('http://www.zhihu.com', data=postdata)
print(res.content)
HTTP中其他请求方式示例如下:
- requests.put (‘http://www.xxxxxx.com/put’,data={‘key’:‘value’})
- requests.delete (‘http://www.xxxxxx.com/delete’)
- requests.head (‘http://www.xxxxxx.com/get’)
- requests.options (‘http://www.xxxxxx.com/get’)
3)复杂URL的输入,除了使用完整的URL,requests还提供了以下方式:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests
payload = {'Keywords': 'bolg:qiyeboy', 'pageindex': 1}
"""可设置timeout"""
res = requests.get('http://www.zhihu.com', params=payload)
print(res.url)
输出:
https://www.zhihu.com/?Keywords=bolg%3Aqiyeboy&pageindex=1
2.2 响应与编码
以res = requests.get(‘http://www.zhihu.com’) 为例,其返回值中:
- res.content:字节形式
- res.text:文本形式
- res.encoding:根据HTTP头猜测的网页编码格式
这里使用第三方库chardet来进行字符串 / 文件编码检测:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests
import chardet
res = requests.get('http://www.zhihu.com')
"""
detect返回字典,包括:
- 'encoding':编码形式
- 'confidence':检测精确度
- 'language':超文本标记语言
"""
ret_dic = chardet.detect(res.content)
"""使用检测到的编码形式解码"""
res.encoding = ret_dic['encoding']
print(ret_dic)
print(res.text)
输出:
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
<html>
<head><title>400 Bad Request</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<hr><center>openresty</center>
</body>
</html>
2.3 请求头headers处理
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent': user_agent}
res = requests.get('http://www.zhihu.com', headers=headers)
print(res.content)
2.4 响应码code和响应头headers处理
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests
res = requests.get('http://www.baidu.com')
"""
res.status_code:获取响应码
res.status_code == requests.codes.ok:判断相应码
"""
if res.status_code == requests.codes.ok:
print("响应码:", res.status_code)
print("响应头:", res.headers)
print("字段获取:", res.headers.get('content-type'))
else:
"""
当相应码是4XX或5XX时,raise_for_status()会抛出异常
当相应码是200时,raise_for_status()返回None
"""
res.raise_for_status()
输出:
响应码: 200
响应头: {'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Tue, 09 Jun 2020 13:42:42 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:52 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}
字段获取: text/html
2.5 Cookie处理
1)自动Cookie:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers={'User-Agent':user_agent}
res = requests.get('http://www.baidu.com', headers=headers)
for cookie in res.cookies.keys():
print(cookie + ": " + res.cookies.get(cookie))
输出:
BAIDUID: D285BF54C9CC968744699A9B4F843D60:FG=1
BIDUPSID: D285BF54C9CC9687F9E45D28DB4C9F33
H_PS_PSSID: 1456_31326_21100_31069_31765_31673_30823
PSTM: 1591710519
BDSVRTM: 0
BD_HOME: 1
2)自定义Cookie:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers={'User-Agent':user_agent}
"""自定义"""
cookies = dict(name='guangtouqiang', age='18')
res = requests.get('http://www.baidu.com', headers=headers, cookies=cookies)
print(res.text)
3)自动处理Cookie:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests
login_url = 'http://www.zhihu.com/login'
s = requests.Session()
datas = {'name': 'guangtouqiang', 'passwd': '123456'}
"""
游客模式,服务器先分配一个cookie, 如果没有这一步,系统会认为时非法用户
allow_redirects=True表示允许重定向,如果重定向,则可通过res.history查看历史信息
"""
s.get(login_url, allow_redirects=True)
"""验证成功,权限将升级到会员权限"""
res = s.post(login_url, data=datas, allow_redirects=True)
print(res.text)
输出:
<html>
<head><title>400 Bad Request</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<hr><center>openresty</center>
</body>
</html>