使用python爬虫爬取股票数据

2022-10-05 09:04:52

前言：

编写一个爬虫脚本，用于爬取东方财富网的上海股票代码，并通过爬取百度股票的单个股票数据，将所有上海股票数据爬取下来并保存到本地文件中

系统环境：

64位win10系统，64位python3.6,IDE位pycharm

预备知识：

BeautifulSoup的基本知识，re正则表达式的基本知识

代码：

import requests

from bs4 import BeautifulSoup

import traceback

import re

def getHTMLText(url):

    try:

        user_agent = '自己的浏览器头部信息'

        headers = {'User-Agent': user_agent}

        r = requests.get(url,headers = headers,timeout = 30)

        r.raise_for_status()

        r.encoding = r.apparent_encoding

        return r.text

    except:

        return ""

def getStockList(lst,stock_list_url):

    html = getHTMLText(stock_list_url)

    soup = BeautifulSoup(html,'html.parser')

    a = soup.find_all('a')

    for i in a:

        try:

            href = i.attrs['href']

            lst.append(re.findall(r"sh\d{6}",href)[0])

            #print(lst)

        except:

            continue

def getStockInfo(lst,stock_info_url,fpath):

    for stock in lst:

        url = stock_info_url + stock + '.html'

        html = getHTMLText(url)

        try:

            if html =="":

                continue

            infoDict = { }

            soup = BeautifulSoup(html,'html.parser')

            stockInfo = soup.find('div',attrs = {'class':'stock-bets'})

            if stockInfo == None:

                continue

            #print(stockInfo)

            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]

            #print(name)

            infoDict.update({'股票名称': name.text.split()[0]})

            keyList = stockInfo.find_all('dt')

            valueList = stockInfo.find_all('dd')

            for i in range(len(keyList)):

                key = keyList[i].text

                val = valueList[i].text

                infoDict[key] = val

            with open(fpath,'a',encoding = 'utf-8') as f:

                f.write(str(infoDict) + '\n')

        except:

            traceback.print_exc()

            continue

def main():

    stock_list_url = 'http://quote.eastmoney.com/stocklist.html'

    stock_info_url = 'http://gupiao.baidu.com/stock/'

    output_file = 'D://Postgraduate//Python//python项目//Python网络爬虫与信息提取-中国大学MOOC//3 网络爬虫之实战//BaiduStockInfo.txt'

    slist = []

    getStockList(slist,stock_list_url)

    getStockInfo(slist,stock_info_url,output_file)

main()

代码解释：

第一个getHTMLText函数的作用是获得所需的网页源代码

第二个getStockList函数的作用是获得东方财富网上面上海股票的全部代码，查看网页源代码可知，股票代码的数据放在'a'标签里面，如下图所示：

因此，首先用find_all方法遍历所有'a'标签，然后在'a'标签里面提取出href部分信息，在提取出来的href信息里面，用正则表达式匹配所需的信息，“sh\d{6}”，即徐亚匹配例如sh200010的信息
第三个函数需要根据第二个函数得到的股票代码，拼接出一个url，在这个特定的url的网页里，使用第一个函数解析网页，首先加一个判断，如果遇到html为空，那么要继续执行下去，同样，我们也需要再加一个判断（关键之处），遇到网页不存在，
但html源代码仍然是存在的，因此接下去这个命令

stockInfo = soup.find('div',attrs = {'class':'stock-bets'})

可能为空，如果不加判断，程序执行到这里就会报错而无法继续执行，因此添加：

if stockInfo == None:

    continue

码农公寓

前言：

系统环境：

预备知识：

代码：

代码解释：

相关文章