前言
最近的瓜是又大又圆,作为前排吃瓜群众中的一员,自然要有独特的吃瓜方式,自己做个微博热搜实时平台,一个瓜都漏不掉
目录
结果展示
实时平台简单架构图
一、定位爬取的数据内容
微博热搜榜单的数据都很有规律,在一个table标签中放着,很方便我们写xpath爬取,我们只要获取热搜标题,热搜网址,热搜热度值就可以了
二、编写Python爬虫脚本
本次的python爬虫使用的是requests库进行发送请求,再将网页返回,随后使用etree和xpath进行数据解析。我把他们都封装成了函数,这样更加方便复用,这里已经把相应的xpath规则写好了。顺便给它定了个时,让它每间隔一会就爬取一次,可以自己修改时间,虽然微博热榜的防爬机制并没有很严,但也不要改得太小,以免被误杀
import requests
import time
from lxml import etree
def getHTMLText(url):
""" 获取微博网页数据 """
try:
# 记得加上请求头
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
r = requests.get(url,headers = headers)
""" 如果状态码不是200 则会引起HTTPerror异常 """
r.raise_for_status()
# 编码转换
r.encoding = r.apparent_encoding
return r.text
except:
return "产生异常"
if __name__ == "__main__":
url = "https://s.weibo.com/top/summary?cate=realtimehot"
# 定时任务5s
while(True):
web_data = getHTMLText(url)
html = etree.HTML(web_data)
#热搜标题
html_title = html.xpath("//*[@id='pl_top_realtimehot']/table/tbody/tr/td[2]/a/text()")
#热搜链接
html_url = html.xpath("//*[@id='pl_top_realtimehot']/table/tbody/tr/td[2]/a/@href")
#热度
html_hot = html.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr/td[2]/span/text()')
#输出到控制台,方便查看
for i in range(len(html_hot)):
print(html_title[i+1]," ",r'https://s.weibo.com'+html_url[i+1]," ",html_hot[i])
time.sleep(5)
查看爬取结果:
结果非常的Amazing啊,很符合我们的预期,实际上我们只要前二十五条热搜就可以了,不然可视化页面有点装不下
三、存储到Mysql数据库
因为考虑到这样一个实时数据平台放在本地上一直跑也不是办法,于是选择让python脚本在服务器里跑,然后把爬取到的数据存到服务器数据库,最后通过PHP响应前端的Ajax请求,将数据返回给前端进行可视化展示。这样我们只需要打开网址,随时都能看到自己做的这个小项目啦,具体服务器部署并不难,在这里就不再赘述了。
我这次选用的是Mysql数据库,比较顺手,其实也可以选择使用MongoDb等其他数据库进行存储数据
简单数据库表设计
CREATE TABLE `hot_list` (
`uid` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`url` varchar(255) DEFAULT NULL,
`scores` varchar(20) DEFAULT NULL,
PRIMARY KEY (`uid`)
) ENGINE=InnoDB AUTO_INCREMENT=13970 DEFAULT CHARSET=utf8;
在python上连接使用mysql,需要pip install pymysql模块,然后记得配置下自己的数据库
import requests
import pymysql
import time
from lxml import etree
""" 爬取微博热搜榜 """
def getHTMLText(url):
""" 获取微博网页数据 """
try:
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
}
r = requests.get(url,headers = headers)
""" 如果状态码不是200 则会引起HTTPerror异常 """
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return "产生异常"
def insert(value):
""" 将数据插入mysql """
sql = "INSERT INTO hot_list(name,url,scores) values(%s,%s,%s)"
try:
cursor.execute(sql,value)
db.commit()
print('插入数据成功')
except:
db.rollback()
print("插入数据失败")
def insert(value):
""" 将数据插入mysql """
sql = "INSERT INTO hot_list(name,url,scores) values(%s,%s,%s)"
try:
cursor.execute(sql,value)
db.commit()
print('插入数据成功')
except:
db.rollback()
print("插入数据失败")
def delete(mycursor):
""" 删除mysql的数据 """
sql = "DELETE FROM hot_list"
mycursor.execute(sql)
if __name__ == "__main__":
# 数据库配置
dbConfig = {
'host':'localhost',
'user':'root',
'password':'root',
'db':'data'
}
url = "https://s.weibo.com/top/summary?cate=realtimehot"
# 连接mysql
db = pymysql.connect(host = dbConfig['host'], user=dbConfig['user'], password=dbConfig['password'], port=3306, db=dbConfig['db'])
cursor = db.cursor()
# 定时任务5s
while(True):
web_data = getHTMLText(url)
# print(web_data)
html = etree.HTML(web_data)
html_title = html.xpath("//*[@id='pl_top_realtimehot']/table/tbody/tr/td[2]/a/text()")
html_url = html.xpath("//*[@id='pl_top_realtimehot']/table/tbody/tr/td[2]/a/@href")
html_hot = html.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr/td[2]/span/text()')
# 删除数据库原有数据
delete(cursor)
for i in range(len(html_hot)):
print(html_title[i+1]," ",r'https://s.weibo.com'+html_url[i+1]," ",html_hot[i])
value = (html_title[i+1],r'https://s.weibo.com'+html_url[i+1],html_hot[i])
insert(value)
time.sleep(5)
db.close()
四、编写简单的服务器端PHP脚本
其实Python也可以在服务器上进行监听,不过还是想让他们各司其职,让PHP来接收请求
这里还可以继续优化,因为这次没有使用数据库连接池,PHP就会反复连接和断开mysql数据库,比较耗费资源,建议自己加上个数据库连接池要更好一些
<?php
header("Access-Control-Allow-Origin: *");
function getData()
{
$servername = "localhost";
$username = "root";
$password = "root";
$db_name = "data";
// 创建连接 可以改用连接池 mysql_pconnect("localhost", "root", "123456");
$conn = new mysqli($servername, $username, $password,$db_name);
// 检测连接
if ($conn->connect_error)
{
die("连接失败: " . $conn->connect_error);
}
$sql = "select * from hot_list ORDER BY uid asc limit 25";
$result = $conn->query($sql);
$res = array();
$res[] = [0=>"score",1=> "热度", 2=>"titleName"];
if ($result->num_rows > 0) {
header("Content-Type: application/json; charset=UTF-8");
while($row = $result->fetch_array()) {
// 将数据库数据组装成前端能用的格式
$temp = array();
$score = ($row['scores']/100000)*rand(1, 2);
if($score<10){
$temp[] = ($row['scores']/100000)*rand(2, 3);
}else{
$temp[] = $score;
}
$temp[] = $row['scores'];
$temp[] = $row['name'];
$res[] = $temp;
}
}
$json = json_encode($res,JSON_UNESCAPED_UNICODE);
$conn->close();
echo $json;
}
getData();
五、前端Html5+Echarts可视化
简单的使用jQuery来发送ajax请求,再用echarts可视化数据,这里都使用CDN的方式导入,更加方便使用
如果想做些其他的可视化的话,可以自己来Echarts官网逛逛
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<title>微博热搜实时平台</title>
<!-- 引入 echarts.js和jqery.js -->
<script src="https://cdn.bootcss.com/echarts/4.2.1-rc1/echarts.min.js"></script>
<script src="https://apps.bdimg.com/libs/jquery/2.1.4/jquery.min.js"></script>
<style type="text/css">
.bg {
display: flex;
justify-content: center;
align-items: center;
}
</style>
</head>
<body>
<!-- 为ECharts准备一个具备大小(宽高)的Dom -->
<div class="bg">
<div id="main" style="width: 1200px; height: 850px"></div>
</div>
<script type="text/javascript">
// 基于准备好的dom,初始化echarts实例
var myChart = echarts.init(document.getElementById("main"));
// 指定图表的配置项和数据
option = {
dataset: {
//数据放入这里面
source: [
//第一行要留着
["score", "热度", "titleName"],
[89.3, 58212, "Matcha Latte"],
[57.1, 78254, "Milk Tea"],
[74.4, 41032, "Cheese Cocoa"],
[50.1, 12755, "Cheese Brownie"],
[89.7, 20145, "Matcha Cocoa"],
[68.1, 79146, "Tea"],
],
},
title: {
text: "微博实时热搜平台",
},
grid: { containLabel: true },
xAxis: { name: "热度" },
yAxis: { type: "category", inverse: true },
visualMap: {
orient: "horizontal",
left: "center",
min: 10,
max: 100,
text: ["High Score", "Low Score"],
// Map the score column to color
dimension: 0,
inRange: {
color: ["#65B581", "#FFCE34", "#FD665F"],
},
},
series: [
{
type: "bar",
encode: {
// Map the "amount" column to X axis.
x: "热度",
// Map the "product" column to Y axis
y: "titleName",
},
},
],
};
function changeData() {
let url = "http://localhost:8012/crawlServer.php";
$.get(url, {}, function (data, status) {
console.log(data, status);
if (status == "success") {
option.dataset.source = data;
}
});
myChart.setOption(option);
}
setInterval("changeData()", 3000);
</script>
</body>
</html>
六、写在最后
瓜虽好,可不要贪多哟~
这个小项目的源码已经上传至github,有兴趣的小伙伴可以拉取到本地玩耍哦
要是觉得还不错,不妨点个star~(抱拳感谢)