[弹幕词云姬]硬核b站up主一周撸出来的小工具，根据b站弹幕生成词云（一）

2024-01-09 14:07:28

过去一周突然有个很不错的想法，想用b站的弹幕来生成一个词云的效果。于是辛苦奋战一周，大概花了十个小时左右，整出了这个全新的小工具——《词云弹幕姬》访问地址http://danmu.xiezuoguan.cn/
参考我上传的b站视频

「弹幕词云姬」硬核up主花费一周开发了一款根据弹幕生成词云的工具

先给大家看下效果，主要界面如下，上面是两个输入框来输入av号和cid号来定位视频。中间是选择背景图，下方是最近大家提交过的内容。

这个是输入av号点查询的效果。查询成功会提示解析成功并自动带出cid号（什么是cid号下面会讲）

这个是词云最终生成的效果图

演示完成后我们开始简单讲技术部分
后端开发：
核心内容就三部分，第一部分如何获取弹幕信息，第二部分如何分词，第三部分如何生成词云。

第一部分获取弹幕：

这个是百度一下就有前人的经验，这个玩意儿是从cid号查到的，访问’http://comment.bilibili.com/’+str(cid)+’.xml’就是获取的弹幕信息了。那cid号又是从哪里来的呢？
访问一个视频后，加载的html里直接ctrl+F搜索cid就可以了。但是我们是学技术的嘛，所要用正则表达式匹配出来，函数代码如下：

#根据av号获取cid
def get_cid_from_av(av):
    url = 'http://www.bilibili.com/video/av'+str(av)
    #print(url)
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.103 Safari/537.36'}
    response = requests.get(url=url, headers=headers)
    response.encoding = 'utf-8'
    html = response.text
    #print(html)
    #用try防止有些av号没视频
    try:
        soup = BeautifulSoup(html, 'lxml')
        #视频名
        title = soup.select('meta[name="title"]')[0]['content']
        #投稿人
        author = soup.select('meta[name="author"]')[0]['content']
        #弹幕的网站代码
        danmu_id = re.findall(r'cid=(\d+)&', html)[0]
        #print(title, author)
        #return danmu_id
        #写到文件里
        write_last_commit(title, author)
        return {'status': 'ok','title': title, 'author': author, 'cid': danmu_id}
    except:
        print('视频不见了哟')
    return {'status': 'no'}

然后有了cid之后获取弹幕，通过jieba这个python库进行分词

#获取弹幕
def get_danmu(cid):
    url = 'http://comment.bilibili.com/'+str(cid)+'.xml'
    # print(url)
    req = requests.get(url)
    html = req.content
    html_doc=str(html,'utf-8')   #修改成utf-8
    
    #解析
    soup = BeautifulSoup(html_doc,"lxml")
    
    results = soup.find_all('d')
    
    contents = [x.text for x in results]
    # print(contents)
    return contents

第一部分完成

第二部分分词

#弹幕分词
def danmu_cut(data):
    word_frequency = dict()
    #获取停止词
    stop_word = []
    #路径问题
    module_path = os.path.abspath(os.curdir)
    file_path = os.path.join(module_path,"stopWordList.txt")
    with open(file_path,'r', encoding='UTF-8' ) as f:
        for line in f.readlines():
            stop_word.append(line.strip())
    #分词
    words = jieba.cut(data)
    #统计词频
    for word in words:
        if word not in stop_word:
            word_frequency[word] = word_frequency.get(word, 0) + 1
    return word_frequency

最后第三部分生成词云

这块用到wordcloud库，百度也有很多写好的东西，我们拿过来改一改就可以用了。

#词云用于web
def draw_wordcloud_for_web(words_dict,cid,mask_pic):
    #背景图片
    #绝对路径问题
    module_path = os.path.abspath(os.curdir)
    #filename = module_path + "\\bilibili\\"+mask_pic
    input_file = os.path.join(module_path,'mask_pic',mask_pic)
    mask = np.array(Image.open(input_file))
    font_file = os.path.join(module_path,'simhei.ttf')
    print(font_file)
    wc = WordCloud(
        font_path=font_file,  # 设置字体格式
        mask=mask,
        max_words=200,
        max_font_size=60,
        min_font_size=10,
        random_state=50,
    )
    wc.generate_from_frequencies(words_dict)
    image_colors = ImageColorGenerator(mask)
    wc.recolor(color_func=image_colors)
    # return wc.to_image()
    #输出的文件名
    out_file_name = str(cid)+'.jpg'
    out_file = os.path.join(module_path,'out_put',out_file_name)
    wc.to_file(out_file)
    return out_file

至此，python的脚本部分已经完成（实不相瞒，此部分就花了我两个小时不到）
下一节将如何用fastapi将这些通过http server 对外提供成api服务
再下一节简单讲讲画前端vue

半世浮华殆尽发布了39 篇原创文章 · 获赞 63 · 访问量 17万+ 私信关注

码农公寓

第一部分获取弹幕：

第二部分分词

最后第三部分生成词云

相关文章