01.前言
网上爬微博评论的大部分都是使用python,在这里我使用java进行爬取.
首先,微博pc端访问有三种不同的页面
第一:https://m.weibo.cn/
第二:https://weibo.cn/
第三:https://www.weibo.com/
在这里我选择第一种
02.分析
1.获得微博评论的链接
随便找一条微博,点开直接F12从Network里选择XHR,很容易就找到微博评论的链接,然后直接点开
可以看到这里就是微博评论的数据.
2.获取下一页的微博评论url
我们接着往下翻,微博采用瀑布流的方式,每次请求20条评论
我们得到第二页评论数据的url,对比两条url.
//第一页
https://m.weibo.cn/comments/hotflow?id=4618394022707259&mid=4618394022707259&max_id_type=0
//第二页
https://m.weibo.cn/comments/hotflow?id=4618394022707259&mid=4618394022707259&max_id=14383848249598428&max_id_type=0
可以看到第二页请求的url在原来的mid和max_id_type中间加了一个max_id,而这个max_id是第一页请求的数据得到的,由此我们可以拼接出来下一页的url,但是微博在非登录的情况下只能获取两页的数据,所以我们得进行登录,我选择用cookie的方式
3.获取cookie
直接点赞这条微博,就可以看到cookie,复制下就行
03.编写
准备工作都做完了,现在开始撸代码
首先写一个发送get请求的方法
/**
* 发送get请求
* @param urlStr
* @return
* @throws Exception
*/
public static String sendGet(String urlStr) throws Exception {
//创建httpclient实例
HttpClient httpClient = new HttpClient();
//设置httpclient连接主机服务器超时时间: 以毫秒为单位 1000ms=1s
httpClient.getHttpConnectionManager().getParams().setConnectionTimeout(10000);
//创建get请求方法实例对象
GetMethod getMethod = new GetMethod(urlStr);
//设置get请求超时时间,value以毫秒为单位
getMethod.getParams().setParameter(HttpMethodParams.SO_TIMEOUT,10000);
//设置请求头
getMethod.addRequestHeader(Universal.ACCEPT,ACCEPT);
getMethod.addRequestHeader(Universal.ACCEPT_LANGUAGE,ACCEPT_LANGUAGE);
getMethod.addRequestHeader(Universal.CACHE_CONTROL,CACHE_CONTROL);
getMethod.addRequestHeader(Universal.COOKIE,COOKIE);
getMethod.addRequestHeader(Universal.SEC_CH_UA,SEC_CH_UA);
getMethod.addRequestHeader(Universal.SEC_CH_UA_MOBILE,SEC_CH_UA_MOBILE);
getMethod.addRequestHeader(Universal.SEC_FETCH_DEST,SEC_FETCH_DEST);
getMethod.addRequestHeader(Universal.SEC_FETCH_MODE,SEC_FETCH_MODE);
getMethod.addRequestHeader(Universal.SEC_FETCH_SITE,SEC_FETCH_SITE);
getMethod.addRequestHeader(Universal.UPGRADE_INSECURE_REQUESTS,UPGRADE_INSECURE_REQUESTS);
getMethod.addRequestHeader(Universal.USER_AGENT,user_agent);
//执行get
httpClient.executeMethod(getMethod);
//获取返回数据
String result = getMethod.getResponseBodyAsString();
//释放http连接
getMethod.releaseConnection();
return result;
}
为了高大上点,这里写了个实体类,当然懒得写实体类的,直接写成字符串就行
private static String ACCEPT = "accept";
private static String ACCEPT_LANGUAGE = "accept-language";
private static String CACHE_CONTROL = "cache-control";
private static String COOKIE = "这块替换成你自己的cookie";
private static String SEC_CH_UA = "sec-ch-ua";
private static String SEC_CH_UA_MOBILE = "sec-ch-ua-mobile";
private static String SEC_FETCH_DEST = "sec-fetch-dest";
private static String SEC_FETCH_MODE = "sec-fetch-mode";
private static String SEC_FETCH_SITE = "sec-fetch-site";
private static String UPGRADE_INSECURE_REQUESTS = "upgrade-insecure-requests";
private static String user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36";
这样,我们就可以拿到一个微博评论的json字符串了.
格式化后可以很清晰的看到评论数据在第二层的data里,总共20条
首先拿到第一个data以及需要拼接下一页url的max_id
//将获取到的json字符串转成json对象
JSONObject jsonObj = JSONObject.parseObject(jsonStr);
//拿到第一层data的数据
JSONObject dataObj = (JSONObject)jsonObj.get("data");
//获取下一页评论的max_id
String max_id = dataObj.get("max_id").toString();
然后我们再从第一层data里拿到第二层的data,也就是评论数据,把他转成json数组
//拿到有关评论的数据
Object json = dataObj.get("data");
String jsonArr = json.toString();
JSONArray jsonArray = JSONArray.fromObject(jsonArr);
然后,我们把评论数据封装到一个list集合里
ArrayList<WeiBoCommentPo> jsonList = new ArrayList<WeiBoCommentPo>();
//遍历获取每一条评论数据
for (Object jsonobj : jsonArray) {
String value = jsonobj.toString();
//System.out.println(value);
JSONObject jsonObject = JSON.parseObject(value);
WeiBoCommentPo weiBoCommentPo = JSONObject.toJavaObject(jsonObject, WeiBoCommentPo.class);
jsonList.add(weiBoCommentPo);
}
在这里我写了一个实体类,get和set方法我就不在这里放了,太长了,大家自行生成吧.
public class WeiBoCommentPo {
private String created_at;
private String id;
private String rootid;
private String rootidstr;
private String floor_number;
private String text;
private String disable_reply;
private String user;
private String mid;
private String readtimetype;
private String comments;
private String max_id;
private String total_number;
private String isLikedByMblogAuthor;
private String more_info_type;
private String bid;
private String source;
private String like_count;
private String liked;
}
解析json 的完整代码:
/**
* 微博评论json解析
* @param jsonStr
* @return
* @throws Exception
*/
public static List<WeiBoCommentPo> jsonAnalyse(String jsonStr) throws Exception {
ArrayList<WeiBoCommentPo> jsonList = new ArrayList<WeiBoCommentPo>();
//将获取到的json字符串转成json对象
JSONObject jsonObj = JSONObject.parseObject(jsonStr);
//拿到第一层data的数据
JSONObject dataObj = (JSONObject)jsonObj.get("data");
//获取下一页评论的max_id
max_id = dataObj.get("max_id").toString();
//拿到真正的有关评论的数据
Object json = dataObj.get("data");
String jsonArr = json.toString();
JSONArray jsonArray = JSONArray.fromObject(jsonArr);
//遍历获取每一条评论数据
for (Object jsonobj : jsonArray) {
String value = jsonobj.toString();
//System.out.println(value);
JSONObject jsonObject = JSON.parseObject(value);
WeiBoCommentPo weiBoCommentPo = JSONObject.toJavaObject(jsonObject, WeiBoCommentPo.class);
jsonList.add(weiBoCommentPo);
}
return jsonList;
}
注:为了方便,max_id为了方便,所以这样写了,大家也可以使用其他的方法.
然后,我们直接写个main方法运行
public static void main(String[] args) {
String json = sendGet(url); //此处调发送get请求的方法,url自行填写
//得到评论json,调解析json的方法
List<WeiBoCommentPo> commentList = jsonAnalyse(json);
for (int i = 0; i < commentList.size(); i++) {
String text = commentList.get(i).getText();
}
}
text字段是评论,所以直接getText就可以得到
但是有些评论是带图片或者带表情的
我们需要把这些东西处理掉,写一个处理评论的方法
/**处理字符串
* @param text
* @return
*/
public static String strSplit(String text) {
//如果评论带有图片就处理掉图片
if (text.contains("<span")) {
String[] split = text.split("<span");
text = split[0];
}else if (text.contains("<a")) {
String[] split = text.split("<a");
text = split[0];
}
//处理掉特殊字符
if (text.contains(""")) {
text = text.replace(""","");
}
return text;
}
这样我们就可以得到一条纯字符串的评论了.
最后,我们简单优化一下
public static void main(String[] args) throws Exception {
boolean flag = true;
String url = "";
while (flag) {
System.out.println("请输入url:");
Scanner scUrl = new Scanner(System.in);
url = scUrl.nextLine();
if (url.contains("detail/")) {
String[] split = url.split("detail/");
id = split[1];
flag = false;
} else {
System.err.println("链接格式不正确");
}
}
Integer page = 0;
Integer count = 1;
Scanner scPage = new Scanner(System.in);
System.out.println("请输入要查询几页(每页20条数据):");
page = scPage.nextInt();
System.out.println("===========开始执行==========");
while (count<=page) {
url = urlAnalyse();
String json = sendGet(url);
List<WeiBoCommentPo> commentList = jsonAnalyse(json);
System.out.println("==========开始打印第"+count+"页==========");
System.out.println("第"+count+"页url:"+url);
for (int i = 0; i < commentList.size(); i++) {
String text = commentList.get(i).getText();
//处理字符串
System.out.println(i+1+"."+strSplit(text));
}
count++;
}
System.out.println("==========执行结束==========");
}
我们可以直接输入要爬取的微博的url,再输入需要获取几页的数据就行了,非常滴方便
最后附上完整代码
public class Spider {
private static String ACCEPT = "accept";
private static String ACCEPT_LANGUAGE = "accept-language";
private static String ACCEPT_ENCODING = "accept-encoding";
private static String CACHE_CONTROL = "cache-control";
private static String COOKIE = "此处替换成自己的cookie";
private static String SEC_CH_UA = "sec-ch-ua";
private static String SEC_CH_UA_MOBILE = "sec-ch-ua-mobile";
private static String SEC_FETCH_DEST = "sec-fetch-dest";
private static String SEC_FETCH_MODE = "sec-fetch-mode";
private static String SEC_FETCH_SITE = "sec-fetch-site";
private static String UPGRADE_INSECURE_REQUESTS = "upgrade-insecure-requests";
private static String user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36";
private static String max_id = "";
private static String id ="";
public static void main(String[] args) throws Exception {
boolean flag = true;
String url = "";
while (flag) {
System.out.println("请输入url:");
Scanner scUrl = new Scanner(System.in);
url = scUrl.nextLine();
if (url.contains("detail/")) {
String[] split = url.split("detail/");
id = split[1];
flag = false;
} else {
System.err.println("链接格式不正确");
}
}
Integer page = 0;
Integer count = 1;
Scanner scPage = new Scanner(System.in);
System.out.println("请输入要查询几页(每页20条数据):");
page = scPage.nextInt();
System.out.println("===========开始执行==========");
while (count<=page) {
url = urlAnalyse();
String json = sendGet(url);
List<WeiBoCommentPo> commentList = jsonAnalyse(json);
System.out.println("==========开始打印第"+count+"页==========");
System.out.println("第"+count+"页url:"+url);
for (int i = 0; i < commentList.size(); i++) {
String text = commentList.get(i).getText();
//处理字符串
System.out.println(i+1+"."+strSplit(text));
}
count++;
}
System.out.println("==========执行结束==========");
}
/**处理字符串
* @param text
* @return
*/
public static String strSplit(String text) {
//如果评论带有图片就处理掉图片
if (text.contains("<span")) {
String[] split = text.split("<span");
text = split[0];
}else if (text.contains("<a")) {
String[] split = text.split("<a");
text = split[0];
}
//处理掉特殊字符
if (text.contains(""")) {
text = text.replace(""","");
}
return text;
}
/**
* 解析url
* @return
*/
public static String urlAnalyse() {
String url="https://m.weibo.cn/comments/hotflow?id="+id+"&mid="+id+"&max_id_type=0";
if (StringUtils.isNotEmpty(max_id)){
url="https://m.weibo.cn/comments/hotflow?id="+id+"&mid="+id+"&max_id="+max_id+"&max_id_type=0";
}
return url;
}
/**
* 微博评论json解析
* @param jsonStr
* @return
* @throws Exception
*/
public static List<WeiBoCommentPo> jsonAnalyse(String jsonStr) throws Exception {
ArrayList<WeiBoCommentPo> jsonList = new ArrayList<WeiBoCommentPo>();
//将获取到的json字符串转成json对象
JSONObject jsonObj = JSONObject.parseObject(jsonStr);
//拿到第一层data的数据
JSONObject dataObj = (JSONObject)jsonObj.get("data");
//获取下一页评论的max_id
max_id = dataObj.get("max_id").toString();
//拿到真正的有关评论的数据
Object json = dataObj.get("data");
String jsonArr = json.toString();
JSONArray jsonArray = JSONArray.fromObject(jsonArr);
//遍历获取每一条评论数据
for (Object jsonobj : jsonArray) {
String value = jsonobj.toString();
//System.out.println(value);
JSONObject jsonObject = JSON.parseObject(value);
WeiBoCommentPo weiBoCommentPo = JSONObject.toJavaObject(jsonObject, WeiBoCommentPo.class);
jsonList.add(weiBoCommentPo);
}
return jsonList;
}
/**
* 发送get请求
* @param urlStr
* @return
* @throws Exception
*/
public static String sendGet(String urlStr) throws Exception {
//创建httpclient实例
HttpClient httpClient = new HttpClient();
//设置httpclient连接主机服务器超时时间: 以毫秒为单位 1000ms=1s
httpClient.getHttpConnectionManager().getParams().setConnectionTimeout(10000);
//创建get请求方法实例对象
GetMethod getMethod = new GetMethod(urlStr);
//设置get请求超时时间,value以毫秒为单位
getMethod.getParams().setParameter(HttpMethodParams.SO_TIMEOUT,10000);
//读取请求头文件
HeaderPo headerPo = IoUtil.IoHeader(new FileInputStream("src\\main\\resources\\header.txt"));
//设置请求头
getMethod.addRequestHeader(Universal.ACCEPT,ACCEPT);
getMethod.addRequestHeader(Universal.ACCEPT_LANGUAGE,ACCEPT_LANGUAGE);
getMethod.addRequestHeader(Universal.CACHE_CONTROL,CACHE_CONTROL);
getMethod.addRequestHeader(Universal.COOKIE,COOKIE);
getMethod.addRequestHeader(Universal.SEC_CH_UA,SEC_CH_UA);
getMethod.addRequestHeader(Universal.SEC_CH_UA_MOBILE,SEC_CH_UA_MOBILE);
getMethod.addRequestHeader(Universal.SEC_FETCH_DEST,SEC_FETCH_DEST);
getMethod.addRequestHeader(Universal.SEC_FETCH_MODE,SEC_FETCH_MODE);
getMethod.addRequestHeader(Universal.SEC_FETCH_SITE,SEC_FETCH_SITE);
getMethod.addRequestHeader(Universal.UPGRADE_INSECURE_REQUESTS,UPGRADE_INSECURE_REQUESTS);
getMethod.addRequestHeader(Universal.USER_AGENT,user_agent);
//执行get
httpClient.executeMethod(getMethod);
//获取返回数据
String result = getMethod.getResponseBodyAsString();
//释放http连接
getMethod.releaseConnection();
return result;
}
}
04.总结
最后总结下
总体没什么难度,代码我写完了很久才写的这篇文章,主要就是模拟登陆这块,需要用到cookie,我个人觉得也挺方便,当然还有其他办法我就不琢磨了.