2.requests模块入门

1.python中关于网络请求的模块:

urllib模块 比较古老的模块,不怎么使用了
requests模块 简捷高效

2.requests模块:

python中原生的一款基于网络请求的模块,功能强大,效率高。

作用:

模拟浏览器发请求。

如何使用:

指定url
发起请求
获取响应数据
数据解析
持久化储存

环境安装:

pip install requests

3.实战

1.需求:爬取搜狗首页的页面数据

import requests


if __name__ =="__main__":
    #step 1:指定url
    url='https://www.sogou.com/'
    # step 2:发起请求
    # get方法会返回一个响应对象
    response=requests.get(url=url)
    # step 3: 获取响应数据.text返回的是字符串形式的响应数据
    page_text=response.text
    print(page_text)
    # step 4:持久化存储
    with open(r'./sogou.html',mode='w',encoding='utf-8') as fp:
        fp.write(page_text)
    print('爬取数据结束!!!')

2.需求:爬取搜狗指定词条对应的搜索结果页面(简易的网页采集器)

注意:加上协议头的User-Agent时,如果从浏览器f12后直接复制粘贴,会因为有空格,导致UA伪装不成功,所以一定要在粘贴以后,将多余的空格删掉。
import requests

#UA伪装
#User-Agent(请求载体的身份标识)
if __name__ =="__main__":
    headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
    }
    url="https://www.sogou.com/web"
    #处理url携带的参数:封装到字典中
    kw=input('enter a word:')
    param={
        "query":kw
    }
    #对指定的url发起请求,并且处理了参数
    response=requests.get(url=url,params=param,headers=headers)
    file_name=kw+".html"
    page_text=response.text
    with open(file_name,mode='w',encoding='utf-8') as fp:
        fp.write(page_text)
    print(file_name,'保存成功!!!')

3.需求:破解百度翻译

f12中,XHR抓ajax请求post,post请求携带参数,响应数据是一组json数据
import requests
import json

if __name__ =="__main__":
    url="https://fanyi.baidu.com/sug"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
    }
    word=input('enter a word:')
    data = {
        "kw": word
    }
    response=requests.post(url=url,data=data,headers=headers)

    #获取响应数据json()方法返回的是obj,确定响应数据是json才能使用
    dic_obj=response.json()
    # print(dic_obj)

    #持久化存储
    file_name=word+'.json'
    fp=open(file_name,mode='w',encoding='utf-8')
    json.dump(dic_obj,fp=fp,ensure_ascii=False)
    fp.close()
    print('保存成功!')

4.需求:爬取豆瓣电影分类排行榜

"https://movie.douban.com/typerank?type_name=%E5%96%9C%E5%89%A7&type=24&interval_id=100:90&action="
import requests
import json

if __name__ =="__main__":
    url="https://movie.douban.com/j/chart/top_list"
    param = {
        "type": "24",
        "interval_id": "100:90",
        "action":"" ,
        "start": "0",#从库中第几部去取
        "limit": "20"#每次取多少部
    }
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
    }
    response=requests.get(url=url,params=param,headers=headers)
    list_data=response.json()

    fp=open('./douban.json',mode='w',encoding='utf-8')
    json.dump(list_data,fp=fp,ensure_ascii=False)
    print('over!!!')

5.需求:爬取肯德基餐厅查询

"http://www.kfc.com.cn/kfccda/storelist/index.aspx"
自己可以加个循环,把所有符合关键词的餐厅都取出来
import requests
import json

if __name__ == "__main__":
    url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
    }
    word = input('enter a word:')
    data = {
        "cname": "",
        "pid": "",
        "keyword": word,
        "pageIndex": 1,
        "pageSize": "10"
    }
    response = requests.post(url=url, data=data, headers=headers)

    with open(word + '.json', mode='wt', encoding='utf-8') as fp:
        json.dump(response.text, fp, ensure_ascii=False)
    print('over!!!')

6.需求:爬取国家药品监督管理总局中基于*化妆品生产许可证相关数据

"http://scxk.nmpa.gov.cn:81/xk/"
网页加载以后,通过ajax再向服务器发送post请求获取数据
"http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsList"
import requests
import json

if __name__ =="__main__":
    url="http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsList"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
    }

    data = {
        "on": "true",
        "page": 1,
        "pageSize": 15,
        "productName":"",
        "conditionType": 1,
        "applyname":"",
        "applysn":""
    }
    response=requests.post(url=url,data=data,headers=headers)
    # print(type(response.json()["list"]))#<class 'list'>
    ID_list=[]
    all_detail_list=[]
    for item in response.json()["list"]:
        ID_list.append(item["ID"])
    # print(ID_list)

    #获取企业详情数据
    post_url="http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsById"
    for i in ID_list:
        data={
            "id": i
        }
        detail_json=requests.post(post_url,data=data,headers=headers).json()
        all_detail_list.append(detail_json)
    #持久化存储
    with open("./allData.json",mode='wt',encoding='utf-8') as fp:
        json.dump(all_detail_list,fp=fp,ensure_ascii=False)
    print('over!!!')
上一篇:unittest添加html报告


下一篇:C语言文件输入/输出的两个题目