Python3实现HTTP请求

文章目录

1 urllib实现

  关于urllib、urllib2和urllib3的区别可以查看。python3中,urllib被打包成一个包,所拥有的模块如下:

名称 作用
urllib.request 打开和读取url
urllib.error 处理request引起的异常
urllib.parse 解析url
urllib.robotparser 解析robots.txt文件

1.1 完整请求与响应模型的实现

  urllib2提供一个基础函数urlopen,通过向指定的URL发出请求来获取数据,最简单的形式如下:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request

"""响应"""
res = request.urlopen('http://www.zhihu.com') #可以设置timeout,例如timeout=2
html = res.read()
print(html)

  输出:

b'<!doctype html>\n<html lang="zh" data-hairline="true" data-theme="light"><head><meta charSet="utf-8"/><title data-react...'

  以上代码可以分为两步:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request

"""请求"""
req = request.Request('http://www.zhihu.com')
"""响应"""
res = request.urlopen(req)
html = res.read()
print(html)

  以上的两者方法都是GET请求,接下来对POST请求进行说明:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request

url = 'https://www.xxx.com//login'
postdata = {b'username': b'miao', 
            b'password': b'123456'}
"""请求"""
req = request.Request(url, postdata)
"""响应"""
res = request.urlopen(req)
html = res.read()
print(html)

  这个自己试试就行。

1.2 请求头headers处理

  下面的例子对添加请求头信息进行说明,包括设置User-AgentReferer

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request

url = 'https://www.xxx.com//login'
postdata = {b'username': b'xxx', 
            b'password': b'******'}
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
referer = 'https://www.github.com'
herders = {'User-Agent': user_agent, 'Referer': referer}
"""请求"""
req = request.Request(url, postdata, herders)
"""响应"""
res = request.urlopen(req)
html = res.read()
print(html)

  请求头信息也可以用add_header来添加:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request

url = 'https://www.xxxxxx.com//login'
postdata = {b'username': b'xxx', 
            b'password': b'******'}
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
referer = 'https://www.github.com'
req = request.Request(url, postdata)

"""修改"""
req.add_header('User-Agent', user_agent)
req.add_header('Referer', referer)

res = request.urlopen(req)
html = res.read()
print(html)

  注意:.
  对某些header要特别注意,服务器会针对这些header进行检查,例如:

  • User-Agent:有些服务器或Proxy会通过该值来判断是否是浏览器发出的请求
  • Content-Type:在使用REST接口时,服务器会检查该值,用来确定HEEP Body的内容该怎样解析,在使用服务器提供的RESTful或SOAP服务时,该值的设置错误会导致服务器拒绝服务。常见的取值如下:
application/xml (在XML RPC,如RESTful/SOAP调用时使用  
application/json (在JSON RPC调用时使用)          
application/x-www-form-urlencoded (浏览器提交Web表单时使用)
  • Referer:服务器有时会检查防盗链。

1.3 Cookie处理

  如果需要得到某个Cookie的值,可以采取如下做法:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request
from http import cookiejar

cookie = cookiejar.CookieJar()
opener = request.build_opener(request.HTTPCookieProcessor(cookie))
"""响应"""
res = opener.open('http://www.zhihu.com')
for item in cookie:
    print(item.name + ": " + item.value)

  输出:

_xsrf: 467z...
_zap: 4f91...
KLBRSID: ed2a...

  当然可以按自己的需要手动添加Cookie的内容:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request

cookie = ('Cookie', 'email=' + 'xxxxxxx@163.com')
opener = request.build_opener()
opener.addheaders = [cookie]
"""请求"""
req = request.Request('http://www.zhihu.com')
"""响应"""
res = opener.open(req)
print(res.headers)
retdata = res.read()

  输出:

Date: Tue, 09 Jun 2020 06:45:54 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 49014
Connection: close
Server: CLOUD ELB 1.0.0...

1.4 获取HTTP相应码

  对于200OK来说,只需使用urlopen返回对象的getcode()即可获得HTTP的返回码。但是对于其他返回码,则会抛出异常:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request

try:
    """响应"""
    res = request.urlopen('http://www.zhihu.com')
    print(res.getcode())
except request.HTTPError as e:
    if hasattr(e, 'code'):
        print("Error code: ", e.code)

  输出:

200

1.5 重定向

  以下代码将检查是否出现了重定向动作:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request

try:
    """响应"""
    res = request.urlopen('http://www.zhihu.com')
    print(res.geturl())
except request.HTTPError as e:
    if hasattr(e, 'code'):
        print("Error code: ", e.code)

  输出:

https://www.zhihu.com/signin?next=%2F

  如果不想重定向,则可以自定义HTTPRedirectHandler类:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request

class RedirectHandler(request.HTTPRedirectHandler):
    def http_error_301(self, req, fp, code, msg, headers):
        pass
    
    def http_error_302(self, req, fp, code, msg, headers):
        result = request.HTTPRedirectHandler.http_error_301(self, req, fp, code, msg, headers)
        result.status = code
        result.newurl = result.geturl()
        return result
    
opener = request.build_opener(RedirectHandler)
res = opener.open('http://www.zhihu.cn')
print(res)

  输出:

<http.client.HTTPResponse object at 0x000001BEAC776160>

1.6 Proxy的设置

  示例如下:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request

proxy = request.ProxyHandler({'http': '127.0.0.1: 8087'})
opener = request.build_opener(proxy)
res = opener.open('http://www.zhihu.com/')
print(res.read())

  输出:
Python3实现HTTP请求

2 request实现

2.1 完整请求与响应模型的实现

  1)GET请求:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests

res = requests.get('http://www.zhihu.com')
print(res.content)

  2)POST请求:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests

postdata = {'key' : 'value'}
res = requests.post('http://www.zhihu.com', data=postdata)
print(res.content)

  HTTP中其他请求方式示例如下:

  • requests.put (‘http://www.xxxxxx.com/put’,data={‘key’:‘value’})
  • requests.delete (‘http://www.xxxxxx.com/delete’)
  • requests.head (‘http://www.xxxxxx.com/get’)
  • requests.options (‘http://www.xxxxxx.com/get’)

  3)复杂URL的输入,除了使用完整的URL,requests还提供了以下方式:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests

payload = {'Keywords': 'bolg:qiyeboy', 'pageindex': 1}
"""可设置timeout"""
res = requests.get('http://www.zhihu.com', params=payload)
print(res.url)

  输出:

https://www.zhihu.com/?Keywords=bolg%3Aqiyeboy&pageindex=1

2.2 响应与编码

  以res = requests.get(‘http://www.zhihu.com’) 为例,其返回值中:

  • res.content:字节形式
  • res.text:文本形式
  • res.encoding:根据HTTP头猜测的网页编码格式

  这里使用第三方库chardet来进行字符串 / 文件编码检测:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests
import chardet

res = requests.get('http://www.zhihu.com')
"""
detect返回字典,包括:
    - 'encoding':编码形式 
    - 'confidence':检测精确度
    - 'language':超文本标记语言
"""
ret_dic = chardet.detect(res.content)
"""使用检测到的编码形式解码"""
res.encoding = ret_dic['encoding']
print(ret_dic)
print(res.text)

  输出:

{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
<html>

<head><title>400 Bad Request</title></head>

<body bgcolor="white">

<center><h1>400 Bad Request</h1></center>

<hr><center>openresty</center>

</body>

</html>

2.3 请求头headers处理

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent': user_agent}
res = requests.get('http://www.zhihu.com', headers=headers)
print(res.content)

2.4 响应码code和响应头headers处理

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests

res = requests.get('http://www.baidu.com')

"""
res.status_code:获取响应码
res.status_code == requests.codes.ok:判断相应码
"""
if res.status_code == requests.codes.ok:
    print("响应码:", res.status_code)
    print("响应头:", res.headers)
    print("字段获取:", res.headers.get('content-type'))
else:
	"""
	当相应码是4XX或5XX时,raise_for_status()会抛出异常
	当相应码是200时,raise_for_status()返回None
	"""
    res.raise_for_status()

  输出:

响应码: 200
响应头: {'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Tue, 09 Jun 2020 13:42:42 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:52 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}
字段获取: text/html

2.5 Cookie处理

  1)自动Cookie:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers={'User-Agent':user_agent}
res = requests.get('http://www.baidu.com', headers=headers)

for cookie in res.cookies.keys():
    print(cookie + ": " + res.cookies.get(cookie))

  输出:

BAIDUID: D285BF54C9CC968744699A9B4F843D60:FG=1
BIDUPSID: D285BF54C9CC9687F9E45D28DB4C9F33
H_PS_PSSID: 1456_31326_21100_31069_31765_31673_30823
PSTM: 1591710519
BDSVRTM: 0
BD_HOME: 1

  2)自定义Cookie:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers={'User-Agent':user_agent}
"""自定义"""
cookies = dict(name='guangtouqiang', age='18')
res = requests.get('http://www.baidu.com', headers=headers, cookies=cookies)

print(res.text)

  3)自动处理Cookie:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests

login_url = 'http://www.zhihu.com/login'
s = requests.Session()
datas = {'name': 'guangtouqiang', 'passwd': '123456'}
"""
游客模式,服务器先分配一个cookie, 如果没有这一步,系统会认为时非法用户
allow_redirects=True表示允许重定向,如果重定向,则可通过res.history查看历史信息
"""
s.get(login_url, allow_redirects=True) 
"""验证成功,权限将升级到会员权限"""
res = s.post(login_url, data=datas, allow_redirects=True)
print(res.text)

  输出:

<html>

<head><title>400 Bad Request</title></head>

<body bgcolor="white">

<center><h1>400 Bad Request</h1></center>

<hr><center>openresty</center>

</body>

</html>

2.7 重定向和历史信息

上一篇:页面内容中的Joomla PHP错误


下一篇:Android警告 – 忽略InnerClasses属性(jnamed)