python爬虫笔记(一)requests基本使用

一、requests简介

requests是一个功能强大、简单易用的 HTTP 请求库,建议爬虫使用requests。

二、requests基本使用

1. get方法

requests.get(url=url,params=None,headers=None,proxies=None,cookies=None,auth=None,verify=None,timeout=None)
该方法用于向目标网址发送请求,接收响应。该方法返回一个 Response 对象,其常用的属性和方法列举如下

  • response.url:返回请求网站的 URL
  • esponse.status_code:返回响应的状态码
  • response.encoding:返回响应的编码方式
  • response.cookies:返回响应的 Cookie 信息
  • response.headers:返回响应头
  • response.content:返回 bytes 类型的响应体
  • response.text:返回 str 类型的响应体,相当于 response.content.decode(‘utf-8’)
  • response.json():返回 dict 类型的响应体,相当于 json.loads(response.text)
>>> import requests
>>> response = requests.get('http://www.httpbin.org/get')
>>> type(response)
# <class 'requests.models.Response'>
>>> print(response.url) # 返回请求网站的 URL
# http://www.httpbin.org/get
>>> print(response.status_code) # 返回响应的状态码
# 200
>>> print(response.encoding) # 返回响应的编码方式
# None
>>> print(response.cookies) # 返回响应的 Cookie 信息
# <RequestsCookieJar[]>
>>> print(response.headers) # 返回响应头
# {'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Origin': '*', 'Content-Encoding': 'gzip', 'Content-Type': 'application/json', 'Date': 'Fri, 13 Sep 2019 11:40:31 GMT', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Server': 'nginx', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'DENY', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '188', 'Connection': 'keep-alive'}
>>> type(response.content) # 返回 bytes 类型的响应体
# <class 'bytes'>
>>> type(response.text) # 返回 str 类型的响应体
# <class 'str'>
>>> type(response.json()) # 返回 dict 类型的响应体
# <class 'dict'>

该方法的参数说明如下:

  • url:必填,指定请求 URL
  • params:字典类型,指定请求参数,常用于发送 GET 请求时使用
>>> import requests
>>> url = 'http://www.httpbin.org/get'
>>> params = {
    'key1':'value1',
    'key2':'value2'
}
>>> response = requests.get(url=url,params=params)
>>> print(response.text)
# {
#   "args": {     # 我们设定的请求参数
#     "key1": "value1", 
#     "key2": "value2"
#   }, 
#   "headers": {
#     "Accept": "*/*", 
#     "Accept-Encoding": "gzip, deflate", 
#     "Connection": "close", 
#     "Host": "www.httpbin.org", 
#     "User-Agent": "python-requests/2.19.1"
#   }, 
#   "origin": "110.64.88.141", 
#   "url": "http://www.httpbin.org/get?key1=value1&key2=value2"
# }
  • headers:字典类型,指定请求头部
>>> import requests
>>> url = 'http://www.httpbin.org/headers'
>>> headers = {
    'USER-AGENT':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
>>> response = requests.get(url=url,headers=headers)
>>> print(response.text)
# {
#   "headers": {
#     "Accept": "*/*", 
#     "Accept-Encoding": "gzip, deflate", 
#     "Connection": "close", 
#     "Host": "www.httpbin.org", 
#     "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"     # 我们设定的请求头部
#   }
# }
  • proxies:字典类型,指定使用的代理
>>> import requests
>>> url = 'http://www.httpbin.org/ip'
>>> proxies = {
    'http':'182.88.178.128:8123',
    'http':'61.135.217.7:80'
}
>>> response = requests.get(url=url,proxies=proxies)
>>> print(response.text)
# {
#   "origin": "182.88.178.128"
# }
  • cookies:字典类型,指定 Cookie
>>> import requests
>>> url = 'http://www.httpbin.org/cookies'
>>> cookies = {
    'name1':'value1',
    'name2':'value2'
}
>>> response = requests.get(url=url,cookies=cookies)
>>> print(response.text)
# {
#   "cookies": {
#     "name1": "value1",
#     "name2": "value2"
#   }
# }
  • auth:元组类型,指定登陆时的账号和密码
>>> import requests
>>> url = 'http://www.httpbin.org/basic-auth/user/password'
>>> auth = ('user','password')
>>> response = requests.get(url=url,auth=auth)
>>> print(response.text)
# {
#   "authenticated": true, 
#   "user": "user"
# }
  • verify:布尔类型,指定请求网站时是否需要进行证书验证,默认为 True,表示需要证书验证
>>> import requests
>>> response = requests.get(url='https://www.httpbin.org/',verify=False)

但是在这种情况下,一般会出现 Warning 提示,因为 Python 希望我们能够使用证书验证。如果不希望看到 Warning 信息,可以使用以下命令消除。

>>> requests.packages.urllib3.disable_warnings()
  • timeout:指定超时时间,若超过指定时间没有获得响应,则抛出异常

2.exceptions 模块

exceptions 是 requests 中负责异常处理的模块,包含下面常见的异常类:

  • Timeout:请求超时
  • ConnectionError:网络问题,例如 DNS 故障,拒绝连接等
  • TooManyRedirects:请求超过配置的最大重定向数
    注意 :所有显式抛出的异常都继承自 requests.exceptions.RequestException
>>> import requests
>>> try:
		response = requests.get('http://www.httpbin.org/get', timeout=0.1)
except requests.exceptions.RequestException as e:
		if isinstance(e,requests.exceptions.Timeout):
			print("Time out"
#  Time out
上一篇:python发送.xml格式的post请求;


下一篇:数据之路 - Python爬虫 - Requests库