1.python中关于网络请求的模块:
urllib模块 比较古老的模块,不怎么使用了
requests模块 简捷高效
2.requests模块:
python中原生的一款基于网络请求的模块,功能强大,效率高。
作用:
模拟浏览器发请求。
如何使用:
指定url
发起请求
获取响应数据
数据解析
持久化储存
环境安装:
pip install requests
3.实战
1.需求:爬取搜狗首页的页面数据
import requests
if __name__ =="__main__":
#step 1:指定url
url='https://www.sogou.com/'
# step 2:发起请求
# get方法会返回一个响应对象
response=requests.get(url=url)
# step 3: 获取响应数据.text返回的是字符串形式的响应数据
page_text=response.text
print(page_text)
# step 4:持久化存储
with open(r'./sogou.html',mode='w',encoding='utf-8') as fp:
fp.write(page_text)
print('爬取数据结束!!!')
2.需求:爬取搜狗指定词条对应的搜索结果页面(简易的网页采集器)
注意:加上协议头的User-Agent时,如果从浏览器f12后直接复制粘贴,会因为有空格,导致UA伪装不成功,所以一定要在粘贴以后,将多余的空格删掉。
import requests
#UA伪装
#User-Agent(请求载体的身份标识)
if __name__ =="__main__":
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
}
url="https://www.sogou.com/web"
#处理url携带的参数:封装到字典中
kw=input('enter a word:')
param={
"query":kw
}
#对指定的url发起请求,并且处理了参数
response=requests.get(url=url,params=param,headers=headers)
file_name=kw+".html"
page_text=response.text
with open(file_name,mode='w',encoding='utf-8') as fp:
fp.write(page_text)
print(file_name,'保存成功!!!')
3.需求:破解百度翻译
f12中,XHR抓ajax请求post,post请求携带参数,响应数据是一组json数据
import requests
import json
if __name__ =="__main__":
url="https://fanyi.baidu.com/sug"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
}
word=input('enter a word:')
data = {
"kw": word
}
response=requests.post(url=url,data=data,headers=headers)
#获取响应数据json()方法返回的是obj,确定响应数据是json才能使用
dic_obj=response.json()
# print(dic_obj)
#持久化存储
file_name=word+'.json'
fp=open(file_name,mode='w',encoding='utf-8')
json.dump(dic_obj,fp=fp,ensure_ascii=False)
fp.close()
print('保存成功!')
4.需求:爬取豆瓣电影分类排行榜
"https://movie.douban.com/typerank?type_name=%E5%96%9C%E5%89%A7&type=24&interval_id=100:90&action="
import requests
import json
if __name__ =="__main__":
url="https://movie.douban.com/j/chart/top_list"
param = {
"type": "24",
"interval_id": "100:90",
"action":"" ,
"start": "0",#从库中第几部去取
"limit": "20"#每次取多少部
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
}
response=requests.get(url=url,params=param,headers=headers)
list_data=response.json()
fp=open('./douban.json',mode='w',encoding='utf-8')
json.dump(list_data,fp=fp,ensure_ascii=False)
print('over!!!')
5.需求:爬取肯德基餐厅查询
"http://www.kfc.com.cn/kfccda/storelist/index.aspx"
自己可以加个循环,把所有符合关键词的餐厅都取出来
import requests
import json
if __name__ == "__main__":
url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
}
word = input('enter a word:')
data = {
"cname": "",
"pid": "",
"keyword": word,
"pageIndex": 1,
"pageSize": "10"
}
response = requests.post(url=url, data=data, headers=headers)
with open(word + '.json', mode='wt', encoding='utf-8') as fp:
json.dump(response.text, fp, ensure_ascii=False)
print('over!!!')
6.需求:爬取国家药品监督管理总局中基于*化妆品生产许可证相关数据
"http://scxk.nmpa.gov.cn:81/xk/"
网页加载以后,通过ajax再向服务器发送post请求获取数据
"http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsList"
import requests
import json
if __name__ =="__main__":
url="http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsList"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
}
data = {
"on": "true",
"page": 1,
"pageSize": 15,
"productName":"",
"conditionType": 1,
"applyname":"",
"applysn":""
}
response=requests.post(url=url,data=data,headers=headers)
# print(type(response.json()["list"]))#<class 'list'>
ID_list=[]
all_detail_list=[]
for item in response.json()["list"]:
ID_list.append(item["ID"])
# print(ID_list)
#获取企业详情数据
post_url="http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsById"
for i in ID_list:
data={
"id": i
}
detail_json=requests.post(post_url,data=data,headers=headers).json()
all_detail_list.append(detail_json)
#持久化存储
with open("./allData.json",mode='wt',encoding='utf-8') as fp:
json.dump(all_detail_list,fp=fp,ensure_ascii=False)
print('over!!!')