一、选题背景
为什么要选择此选题?要达到的数据分析的预期目标是什么?
随着互联网进入大数据时代,人们获取咨询的方法越来越多,而财经信息又与人们的生活息息相关,所以关于财经的信息就有为重要,为了能更快更好的了解市场基金的走向,我选择了这个课题,主要为了更方便了解有关基金的动态。
二、主题式网络爬虫设计方案
1.主题式网络爬虫名称:天天基金网爬虫分析
2.主题式网络爬虫爬取的内容与数据特征分析:通过访问天天基金的网站,爬取相对应的信息,最后保存下来做可视化分析。
3.主题式网络爬虫设计方案概述(包括实现思路与技术难点):
首先,用request进行访问页面。
其次,用xtree来获取页面内容,用etree.xpath进行数据筛选。
最后,文件操作进行数据的保存。
难点:网站的爬取与数据筛选。
技术难点:
三、主题页面的结构特征分析
1.主题页面的结构与特征分析
数据来源:http://fund.eastmoney.com/fund.html
2.Htmls 页面解析
四、网络爬虫程序设计
爬虫程序主体要包括以下各部分,要附源代码及较详细注释,并在每部分程序后面提供输出结果的截图。
1.数据爬取与采集
"""ua大列表"""
USER_AGENT_LIST = [
'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36',
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1',
'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)',
'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36',
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1',
'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)',
'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4093.3 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko; compatible; Swurl) Chrome/77.0.3865.120 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4086.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:75.0) Gecko/20100101 Firefox/75.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) coc_coc_browser/91.0.146 Chrome/85.0.4183.146 Safari/537.36',
'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/537.36 (KHTML, like Gecko) Safari/537.36 VivoBrowser/8.4.72.0 Chrome/62.0.3202.84',
'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.60',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:83.0) Gecko/20100101 Firefox/83.0',
'Mozilla/5.0 (X11; CrOS x86_64 13505.63.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:68.0) Gecko/20100101 Firefox/68.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 OPR/72.0.3815.400',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',
]
2.对数据进行清洗和处理
def __init__(self): # 起始的请求地址----初始化 self.start_url = 'http://fund.eastmoney.com/fund.html' # 第二份数据地址 self.next_url = 'http://fund.eastmoney.com/HBJJ_pjsyl.html' def parse_start_url(self): """ 发送请求,获取响应 :return: """ # 请求头 headers = { # 通过随机模块提供的随机拿取数据方法 'User-Agent': random.choice(USER_AGENT_LIST) } # 发送请求,获取响应字节数据 response = session.get(self.start_url, headers=headers).content """序列化对象,将字节内容数据,经过转换,变成可进行xpath操作的对象""" response = etree.HTML(response) """调用提取第二份响应数据""" self.parse_next_url_response(response) def parse_next_url_response(self, response_1): """ 解析第二个数据页地址 :return: """ # 请求头 headers = { # 通过随机模块提供的随机拿取数据方法 'User-Agent': random.choice(USER_AGENT_LIST) } # 发送请求,获取响应字节数据 response = session.get(self.start_url, headers=headers).content """序列化对象,将字节内容数据,经过转换,变成可进行xpath操作的对象""" response = etree.HTML(response) """调用解析response响应数据方法""" self.parse_response_data(response, response_1) def parse_response_data(self, response_1, response): """ 解析response响应数据,提取 :return: """ # 股票名称 name_list_1 = response.xpath('//tbody/tr/td[5]/nobr/a[1]/text()') name_list_2 = response_1.xpath('//tbody/tr/td[5]/nobr/a[1]/text()') # 合并 name_list = name_list_1 + name_list_2 # 昨日单位净值 num_1_list_data_1 = response.xpath('//tbody/tr/td[6]/text()') num_1_list_data_2 = response_1.xpath('//tr/td[6]/span/text()') # 合并 num_1_list = num_1_list_data_1 + num_1_list_data_2 # 昨日累计净值 num_2_list_data_1 = response.xpath('//tbody/tr/td[7]/text()') num_2_list_data_2 = response_1.xpath('//tr/td[7]/text()') # 合并 num_2_list = num_2_list_data_1 + num_2_list_data_2 """调用解析三个列表的方法""" self.for_parse_three_list(name_list, num_1_list, num_2_list) def for_parse_three_list(self, name_list, num_1_list, num_2_list): """ 解析循环, :param name_list: 股票名称 :param num_1_list: 昨日单位净值 :param num_2_list: 昨日累计净值 :return: """ # 遍历解析3个列表数据 for a, b, c in zip(name_list, num_1_list, num_2_list): # 构造保存的excel字典数据 dict_data = { # 会根据该字典的key值创建工作簿的sheet名 '股票数据': [a, b, c] } """调用解析保存excel表格方法""" self.parse_save_excel(dict_data) print(f'企业:{a}----采集完成!') """数据采集完成,调用分析生成图像方法""" self.parse_random_data(name_list, num_1_list, num_2_list) def parse_random_data(self, name_list, num_1_list, num_2_list): """ 随机抽取15条数据,进行分析 :return: """ # 存放随机号码的列表 index_list = [] for i in range(15): # 随机抽取15个数据进行分析 random_num = random.randint(0, 200) # 将随机抽取的号码添加进入准备的列表中 index_list.append(random_num) """随机号码生成以后,调用解析生成四张分析图的方法""" self.parse_img_four_func(index_list, name_list, num_1_list, num_2_list)
4.数据分析与可视化(例如:数据柱形图、直方图、散点图、盒图、分布图)
def parse_img_four_func(self, index_list, name_list, num_1_list, num_2_list): """ 解析生成四张分析图 :param index_list: 随机数据的下标 :param name_list: 股票名称列表 :param num_1_list: 昨日单位净值列表 :param num_2_list: 昨日累计净值列表 :return: """ title_list = [] # 名称 qy_num_1 = [] # 单位净值 qy_num_2 = [] # 累计净值 for index_num in index_list: # 企业名称列表 title_list.append(name_list[index_num]) # 昨日单位净值列表 qy_num_1.append(num_1_list[index_num]) # 昨日累计净值列表 qy_num_2.append(num_2_list[index_num]) # 第一张图:根据净值生成折线图 plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False # plot中参数的含义分别是横轴值,纵轴值,线的形状,颜色,透明度,线的宽度和标签 plt.plot(title_list, qy_num_2, 'ro-', color='#4169E1', alpha=0.8, linewidth=1, label='累计净值') plt.plot(title_list, qy_num_1, 'ro-', color='#69e141', alpha=0.8, linewidth=1, label='单位净值') # 显示标签,如果不加这句,即使在plot中加了label='一些数字'的参数,最终还是不会显示标签 plt.legend(loc="upper right") plt.xticks(rotation=270) plt.xlabel('地点数量') plt.ylabel('工作属性数量') plt.savefig('根据净值生成折线图.png') plt.show() # 第二张图:根据单位净值生成饼图 addr_dict_key = title_list addr_dict_value = qy_num_1 plt.rcParams['font.sans-serif'] = ['Microsoft YaHei'] plt.rcParams['axes.unicode_minus'] = False plt.pie(addr_dict_value, labels=addr_dict_key, autopct='%1.1f%%') plt.title(f'单位净值对比') plt.savefig(f'单位净值对比-饼图') plt.show() # 第三张图:根据累计净值生成散点图 # 这两行代码解决 plt 中文显示的问题 plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False # 输入岗位地址和岗位属性数据 production = title_list tem = qy_num_2 colors = np.random.rand(len(tem)) # 颜色数组 plt.scatter(tem, production, s=200, c=colors) # 画散点图,大小为 200 plt.xlabel('数量') # 横坐标轴标题 plt.xticks(rotation=270) plt.ylabel('名称') # 纵坐标轴标题 plt.savefig(f'净值散点图.png') plt.show() # 第四张图:根据净值生成柱状图 import matplotlib;matplotlib.use('TkAgg') plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False zhfont1 = matplotlib.font_manager.FontProperties(fname='C:\Windows\Fonts\simsun.ttc') name_list = title_list num_list = [float(i) for i in qy_num_1] # 单位净值 width = 0.5 # 柱子的宽度 index = np.arange(len(name_list)) plt.bar(index, num_list, width, color='steelblue', tick_label=name_list, label='单位净值') plt.bar(index + width, qy_num_2, width, color='red', hatch='\\', label='累计净值') plt.legend(['单位净值', '累计净值'], prop=zhfont1, labelspacing=1) for a, b in zip(index, num_list): # 柱子上的数字显示 plt.text(a, b, '%.2f' % b, ha='center', va='bottom', fontsize=7) plt.xticks(rotation=270) plt.title('净值柱状图') plt.ylabel('率') plt.legend() plt.savefig(f'净值-柱状图', bbox_inches='tight') plt.show()
5.将以上各部分的代码汇总,附上完整程序代码
"""ua大列表""" USER_AGENT_LIST = [ 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0', 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2', 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1', 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36', 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)', 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0', 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2', 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1', 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36', 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)', 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4093.3 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko; compatible; Swurl) Chrome/77.0.3865.120 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4086.0 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:75.0) Gecko/20100101 Firefox/75.0', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) coc_coc_browser/91.0.146 Chrome/85.0.4183.146 Safari/537.36', 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/537.36 (KHTML, like Gecko) Safari/537.36 VivoBrowser/8.4.72.0 Chrome/62.0.3202.84', 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.60', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:83.0) Gecko/20100101 Firefox/83.0', 'Mozilla/5.0 (X11; CrOS x86_64 13505.63.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:68.0) Gecko/20100101 Firefox/68.0', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 OPR/72.0.3815.400', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36', ] from requests_html import HTMLSession import os, xlwt, xlrd, random from xlutils.copy import copy import numpy as np from matplotlib import pyplot as plt from matplotlib.font_manager import FontProperties # 字体库 from lxml import etree session = HTMLSession() class DFSpider(object): def __init__(self): # 起始的请求地址----初始化 self.start_url = 'http://fund.eastmoney.com/fund.html' # 第二份数据地址 self.next_url = 'http://fund.eastmoney.com/HBJJ_pjsyl.html' def parse_start_url(self): """ 发送请求,获取响应 :return: """ # 请求头 headers = { # 通过随机模块提供的随机拿取数据方法 'User-Agent': random.choice(USER_AGENT_LIST) } # 发送请求,获取响应字节数据 response = session.get(self.start_url, headers=headers).content """序列化对象,将字节内容数据,经过转换,变成可进行xpath操作的对象""" response = etree.HTML(response) """调用提取第二份响应数据""" self.parse_next_url_response(response) def parse_next_url_response(self, response_1): """ 解析第二个数据页地址 :return: """ # 请求头 headers = { # 通过随机模块提供的随机拿取数据方法 'User-Agent': random.choice(USER_AGENT_LIST) } # 发送请求,获取响应字节数据 response = session.get(self.start_url, headers=headers).content """序列化对象,将字节内容数据,经过转换,变成可进行xpath操作的对象""" response = etree.HTML(response) """调用解析response响应数据方法""" self.parse_response_data(response, response_1) def parse_response_data(self, response_1, response): """ 解析response响应数据,提取 :return: """ # 股票名称 name_list_1 = response.xpath('//tbody/tr/td[5]/nobr/a[1]/text()') name_list_2 = response_1.xpath('//tbody/tr/td[5]/nobr/a[1]/text()') # 合并 name_list = name_list_1 + name_list_2 # 昨日单位净值 num_1_list_data_1 = response.xpath('//tbody/tr/td[6]/text()') num_1_list_data_2 = response_1.xpath('//tr/td[6]/span/text()') # 合并 num_1_list = num_1_list_data_1 + num_1_list_data_2 # 昨日累计净值 num_2_list_data_1 = response.xpath('//tbody/tr/td[7]/text()') num_2_list_data_2 = response_1.xpath('//tr/td[7]/text()') # 合并 num_2_list = num_2_list_data_1 + num_2_list_data_2 """调用解析三个列表的方法""" self.for_parse_three_list(name_list, num_1_list, num_2_list) def for_parse_three_list(self, name_list, num_1_list, num_2_list): """ 解析循环, :param name_list: 股票名称 :param num_1_list: 昨日单位净值 :param num_2_list: 昨日累计净值 :return: """ # 遍历解析3个列表数据 for a, b, c in zip(name_list, num_1_list, num_2_list): # 构造保存的excel字典数据 dict_data = { # 会根据该字典的key值创建工作簿的sheet名 '股票数据': [a, b, c] } """调用解析保存excel表格方法""" self.parse_save_excel(dict_data) print(f'企业:{a}----采集完成!') """数据采集完成,调用分析生成图像方法""" self.parse_random_data(name_list, num_1_list, num_2_list) def parse_random_data(self, name_list, num_1_list, num_2_list): """ 随机抽取15条数据,进行分析 :return: """ # 存放随机号码的列表 index_list = [] for i in range(15): # 随机抽取15个数据进行分析 random_num = random.randint(0, 200) # 将随机抽取的号码添加进入准备的列表中 index_list.append(random_num) """随机号码生成以后,调用解析生成四张分析图的方法""" self.parse_img_four_func(index_list, name_list, num_1_list, num_2_list) def parse_img_four_func(self, index_list, name_list, num_1_list, num_2_list): """ 解析生成四张分析图 :param index_list: 随机数据的下标 :param name_list: 股票名称列表 :param num_1_list: 昨日单位净值列表 :param num_2_list: 昨日累计净值列表 :return: """ title_list = [] # 名称 qy_num_1 = [] # 单位净值 qy_num_2 = [] # 累计净值 for index_num in index_list: # 企业名称列表 title_list.append(name_list[index_num]) # 昨日单位净值列表 qy_num_1.append(num_1_list[index_num]) # 昨日累计净值列表 qy_num_2.append(num_2_list[index_num]) # 第一张图:根据净值生成折线图 plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False # plot中参数的含义分别是横轴值,纵轴值,线的形状,颜色,透明度,线的宽度和标签 plt.plot(title_list, qy_num_2, 'ro-', color='#4169E1', alpha=0.8, linewidth=1, label='累计净值') plt.plot(title_list, qy_num_1, 'ro-', color='#69e141', alpha=0.8, linewidth=1, label='单位净值') # 显示标签,如果不加这句,即使在plot中加了label='一些数字'的参数,最终还是不会显示标签 plt.legend(loc="upper right") plt.xticks(rotation=270) plt.xlabel('地点数量') plt.ylabel('工作属性数量') plt.savefig('根据净值生成折线图.png') plt.show() # 第二张图:根据单位净值生成饼图 addr_dict_key = title_list addr_dict_value = qy_num_1 plt.rcParams['font.sans-serif'] = ['Microsoft YaHei'] plt.rcParams['axes.unicode_minus'] = False plt.pie(addr_dict_value, labels=addr_dict_key, autopct='%1.1f%%') plt.title(f'单位净值对比') plt.savefig(f'单位净值对比-饼图') plt.show() # 第三张图:根据累计净值生成散点图 # 这两行代码解决 plt 中文显示的问题 plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False # 输入岗位地址和岗位属性数据 production = title_list tem = qy_num_2 colors = np.random.rand(len(tem)) # 颜色数组 plt.scatter(tem, production, s=200, c=colors) # 画散点图,大小为 200 plt.xlabel('数量') # 横坐标轴标题 plt.xticks(rotation=270) plt.ylabel('名称') # 纵坐标轴标题 plt.savefig(f'净值散点图.png') plt.show() # 第四张图:根据净值生成柱状图 import matplotlib;matplotlib.use('TkAgg') plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False zhfont1 = matplotlib.font_manager.FontProperties(fname='C:\Windows\Fonts\simsun.ttc') name_list = title_list num_list = [float(i) for i in qy_num_1] # 单位净值 width = 0.5 # 柱子的宽度 index = np.arange(len(name_list)) plt.bar(index, num_list, width, color='steelblue', tick_label=name_list, label='单位净值') plt.bar(index + width, qy_num_2, width, color='red', hatch='\\', label='累计净值') plt.legend(['单位净值', '累计净值'], prop=zhfont1, labelspacing=1) for a, b in zip(index, num_list): # 柱子上的数字显示 plt.text(a, b, '%.2f' % b, ha='center', va='bottom', fontsize=7) plt.xticks(rotation=270) plt.title('净值柱状图') plt.ylabel('率') plt.legend() plt.savefig(f'净值-柱状图', bbox_inches='tight') plt.show() def parse_save_excel(self, data_dict): """ 保存数据 :return: """ # 判断保存数据的文件夹是否存在,不存在,就创建 os_path_1 = os.getcwd() + '/数据/' if not os.path.exists(os_path_1): os.mkdir(os_path_1) os_path = os_path_1 + '股票数据.xls' if not os.path.exists(os_path): # 创建新的workbook(其实就是创建新的excel) workbook = xlwt.Workbook(encoding='utf-8') # 创建新的sheet表 worksheet1 = workbook.add_sheet("股票数据", cell_overwrite_ok=True) excel_data_1 = ('股票名称', '昨日单位净值', '昨日累计净值') for i in range(0, len(excel_data_1)): worksheet1.col(i).width = 2560 * 3 # 行,列, 内容, 样式 worksheet1.write(0, i, excel_data_1[i]) workbook.save(os_path) # 判断工作表是否存在 if os.path.exists(os_path): # 打开工作薄 workbook = xlrd.open_workbook(os_path) # 获取工作薄中所有表的个数 sheets = workbook.sheet_names() for i in range(len(sheets)): for name in data_dict.keys(): worksheet = workbook.sheet_by_name(sheets[i]) # 获取工作薄中所有表中的表名与数据名对比 if worksheet.name == name: # 获取表中已存在的行数 rows_old = worksheet.nrows # 将xlrd对象拷贝转化为xlwt对象 new_workbook = copy(workbook) # 获取转化后的工作薄中的第i张表 new_worksheet = new_workbook.get_sheet(i) for num in range(0, len(data_dict[name])): new_worksheet.write(rows_old, num, data_dict[name][num]) new_workbook.save(os_path) def run(self): """ 启动方法 :return: """ self.parse_start_url() if __name__ == '__main__': d = DFSpider() d.run()
五、总结
通过这次的课程设计实验,我对Python又有了进一步的了解,也对Python的爬虫技术有了更熟练的操作,在实验制作过程中也遇到了很多问题,但都通过同学、老师的帮助以及自己上网搜集到的资料从而能够完成此次的实验。
在此次实验中,我发现自己还是有很多的不足,以及对Python学习存在许多盲区,从而让我对Python的学习预发重视。