2021-10-20

Python爬虫学习笔记

网络爬虫:是一种按照一定规则、自动抓取互联网信息的程序或者脚本

爬虫的本质:模拟浏览器打开网页,获取网页中我们想要的那部分数据

#三大库:

  1. requests
    2、BeautifulSoup
    3、lxml

爬虫基本流程:准备工作->获取数据->解析内容->保存数据

2021-10-20

 实例:

http://quote.eastmoney.com/center/gridlist.html#hs_a_board

2021-10-20

 右键->检查->JS->重新加载网址->找js后缀的文件->preview查看数据

在header里找到request url,如:

2021-10-20

往下翻可找到user-agent

2021-10-20 

 

代码:

#爬虫
import requests
import re
import pandas as pd

#用get方法访问服务器并提取页面数据
def getHtml():
    url="https://push2his.eastmoney.com/api/qt/stock/fflow/daykline/get?cb=jQuery11230519443663453389_1634723272941&lmt=0&klt=101&fields1=f1%2Cf2%2Cf3%2Cf7&fields2=f51%2Cf52%2Cf53%2Cf54%2Cf55%2Cf56%2Cf57%2Cf58%2Cf59%2Cf60%2Cf61%2Cf62%2Cf63%2Cf64%2Cf65&ut=b2884a393a59ad64002292a3e90d46a5&secid=0.000333&_=1634723272942 "
    headers={'User-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'}
    r = requests.get(url,headers=headers)
    pat = '"klines":\[(.*?)\]'
    data = re.compile(pat,re.S).findall(r.text)
    return data
#获取单个页面股票数据
def getOnePageStock():
    data = getHtml()
    datas = data[0].split('","')
    stocks = []
    for i in range(len(datas)):
        stock = datas[i].replace('"',"").split(",")
        stocks.append(stock)
    return stocks
#print(getOnePageStock())

def main():
        df = pd.DataFrame(getOnePageStock())
        columns = {1:"代码",2:"名称",3:"最新价格",4:"涨跌额",5:"涨跌幅",6:"成交量",7:"成交额",8:"振幅",9:"最高",10:"最低",11:"今开",12:"昨收",13:"量比",14:"时间",15:"备注1",16:"备注2"}
        df.rename(columns = columns,inplace=True)
        df.to_excel("股票.xls")
        print("已保存.xls")
main()

根据访问的页面不同按实际情况修改代码

数据(静态)通过to_excel( )方法保存到本地,动态数据存入数据库里

上一篇:Fragments(精读React官方文档—18)


下一篇:944. Delete Columns to Make Sorted