使用NLP创建摘要

2024-03-24 10:43:22

作者|Louis Teo
编译|VK
来源|Towards Data Science

你有没有读过很多的报告，而你只想对每个报告做一个快速的总结摘要？你是否曾经遇到过这样的情况？

摘要已成为21世纪解决数据问题的一种非常有帮助的方法。在本篇文章中，我将向你展示如何使用Python中的自然语言处理（NLP）创建个人文本摘要生成器。

前言：个人文本摘要器不难创建——初学者可以轻松做到！

什么是文本摘要

基本上，在保持关键信息的同时，生成准确的摘要，而不失去整体意义，这是一项任务。

摘要有两种一般类型：

抽象摘要>>从原文中生成新句子。
提取摘要>>识别重要句子，并使用这些句子创建摘要。

应该使用哪种总结方法

我使用提取摘要，因为我可以将此方法应用于许多文档，而不必执行大量（令人畏惧）的机器学习模型训练任务。

此外，提取摘要法比抽象摘要具有更好的总结效果，因为抽象摘要必须从原文中生成新的句子，这是一种比数据驱动的方法提取重要句子更困难的方法。

如何创建自己的文本摘要器

我们将使用单词直方图来对句子的重要性进行排序，然后创建一个总结。这样做的好处是，你不需要训练你的模型来将其用于文档。

文本摘要工作流

下面是我们将要遵循的工作流…

导入文本>>>>清理文本并拆分成句子>>删除停用词>>构建单词直方图>>排名句子>>选择前N个句子进行提取摘要

（1）示例文本

我用了一篇新闻文章的文本，标题是苹果以5000万美元收购AI初创公司，以推进其应用程序。你可以在这里找到原始的新闻文章：https://analyticsindiamag.com/apple-acquires-ai-startup-for-50-million-to-advance-its-apps/

你还可以从我的Github下载文本文档：https://github.com/louisteo9/personal-text-summarizer

（2）导入库

# 自然语言工具包（NLTK）
import nltk
nltk.download('stopwords')

# 文本预处理的正则表达式
import re

# 队列算法求首句
import heapq

# 数值计算的NumPy
import numpy as np

# 用于创建数据帧的pandas
import pandas as pd

# matplotlib绘图
from matplotlib import pyplot as plt
%matplotlib inline

（3）导入文本并执行预处理

有很多方法可以做到。这里的目标是有一个干净的文本，我们可以输入到我们的模型中。

# 加载文本文件
with open('Apple_Acquires_AI_Startup.txt', 'r') as f:
    file_data = f.read()

这里，我们使用正则表达式来进行文本预处理。我们将

（A）用空格（如果有的话…）替换参考编号，即[1]、[10]、[20]，

（B）用单个空格替换一个或多个空格。

text = file_data
# 如果有，请用空格替换
text = re.sub(r'\[[0-9]*\]',' ',text) 

# 用单个空格替换一个或多个空格
text = re.sub(r'\s+',' ',text)

然后，我们用小写（不带特殊字符、数字和额外空格）形成一个干净的文本，并将其分割成单个单词，用于词组分数计算和构词直方图。

形成一个干净文本的原因是，算法不会把“理解”和“理解”作为两个不同的词来处理。

# 将所有大写字符转换为小写字符
clean_text = text.lower()

# 用空格替换[a-zA-Z0-9]以外的字符
clean_text = re.sub(r'\W',' ',clean_text) 

# 用空格替换数字
clean_text = re.sub(r'\d',' ',clean_text) 

# 用单个空格替换一个或多个空格
clean_text = re.sub(r'\s+',' ',clean_text)

（4）将文本拆分为句子

我们使用NLTK sent_tokenize方法将文本拆分为句子。我们将评估每一句话的重要性，然后决定是否应该将每一句都包含在总结中。

sentences = nltk.sent_tokenize(text)

（5）删除停用词

停用词是指不给句子增加太多意义的英语单词。他们可以安全地被忽略，而不牺牲句子的意义。我们已经下载了一个文件，其中包含英文停用词

这里，我们将得到停用词的列表，并将它们存储在stop_word 变量中。

# 获取停用词列表
stop_words = nltk.corpus.stopwords.words('english')

（6）构建直方图

让我们根据每个单词在整个文本中出现的次数来评估每个单词的重要性。

我们将通过（1）将单词拆分为干净的文本，（2）删除停用词，然后（3）检查文本中每个单词的频率。

# 创建空字典以容纳单词计数
word_count = {}

# 循环遍历标记化的单词，删除停用单词并将单词计数保存到字典中
for word in nltk.word_tokenize(clean_text):
    # remove stop words
    if word not in stop_words:
        # 将字数保存到词典
        if word not in word_count.keys():
            word_count[word] = 1
        else:
            word_count[word] += 1

让我们绘制单词直方图并查看结果。

plt.figure(figsize=(16,10))
plt.xticks(rotation = 90)
plt.bar(word_count.keys(), word_count.values())
plt.show()

让我们把它转换成横条图，只显示前20个单词，下面有一个helper函数。

# helper 函数，用于绘制最上面的单词。
def plot_top_words(word_count_dict, show_top_n=20):
    word_count_table = pd.DataFrame.from_dict(word_count_dict, orient = 'index').rename(columns={0: 'score'})
    
    word_count_table.sort_values(by='score').tail(show_top_n).plot(kind='barh', figsize=(10,10))
    plt.show()

让我们展示前20个单词。

plot_top_words(word_count, 20)

从上面的图中，我们可以看到“ai”和“apple”两个词出现在顶部。这是有道理的，因为这篇文章是关于苹果收购一家人工智能初创公司的。

（7）根据分数排列句子

现在，我们将根据句子得分对每个句子的重要性进行排序。我们将：

删除超过30个单词的句子，认识到长句未必总是有意义的；
然后，从构成句子的每个单词中加上分数，形成句子分数。

高分的句子将排在前面。前面的句子将形成我们的总结。

注意：根据我的经验，任何25到30个单词都可以给你一个很好的总结。

# 创建空字典来存储句子分数
sentence_score = {}

# 循环通过标记化的句子，只取少于30个单词的句子，然后加上单词分数来形成句子分数
for sentence in sentences:
    # 检查句子中的单词是否在字数字典中
    for word in nltk.word_tokenize(sentence.lower()):
        if word in word_count.keys():
            # 只接受少于30个单词的句子
            if len(sentence.split(' ')) < 30:
                # 把单词分数加到句子分数上
                if sentence not in sentence_score.keys():
                    sentence_score[sentence] = word_count[word]
                else:
                    sentence_score[sentence] += word_count[word]

我们将句子-分数字典转换成一个数据框，并显示sentence_score。

注意：字典不允许根据分数对句子进行排序，因此需要将字典中存储的数据转换为DataFrame。

df_sentence_score = pd.DataFrame.from_dict(sentence_score, orient = 'index').rename(columns={0: 'score'})
df_sentence_score.sort_values(by='score', ascending = False)

（8）选择前面的句子作为摘要

我们使用堆队列算法来选择前3个句子，并将它们存储在best_quences变量中。

通常3-5句话就足够了。根据文档的长度，可以随意更改要显示的最上面的句子数。

在本例中，我选择了3，因为我们的文本相对较短。

# 展示最好的三句话作为总结         
best_sentences = heapq.nlargest(3, sentence_score, key=sentence_score.get)

让我们使用print和for loop函数显示摘要文本。

print('SUMMARY')
print('------------------------')

# 根据原文中的句子顺序显示最上面的句子
for sentence in sentences:
    if sentence in best_sentences:
        print (sentence)

这是到我的Github的链接以获取Jupyter笔记本。你还将找到一个可执行的Python文件，你可以立即使用它来总结你的文本：https://github.com/louisteo9/personal-text-summarizer

让我们看看算法的实际操作！

以下是一篇题为“苹果以5000万美元收购人工智能创业公司（Apple Acquire AI Startup）以推进其应用程序”的新闻文章的原文（原文可在此处找到）：https://analyticsindiamag.com/apple-acquires-ai-startup-for-50-million-to-advance-its-apps/

In an attempt to scale up its AI portfolio, Apple has acquired Spain-based AI video startup — Vilynx for approximately $50 million.

Reported by Bloomberg, the AI startup — Vilynx is headquartered in Barcelona, which is known to build software using computer vision to analyse a video’s visual, text, and audio content with the goal of “understanding” what’s in the video. This helps it categorising and tagging metadata to the videos, as well as generate automated video previews, and recommend related content to users, according to the company website.

Apple told the media that the company typically acquires smaller technology companies from time to time, and with the recent buy, the company could potentially use Vilynx’s technology to help improve a variety of apps. According to the media, Siri, search, Photos, and other apps that rely on Apple are possible candidates as are Apple TV, Music, News, to name a few that are going to be revolutionised with Vilynx’s technology.

With CEO Tim Cook’s vision of the potential of augmented reality, the company could also make use of AI-based tools like Vilynx.

The purchase will also advance Apple’s AI expertise, adding up to 50 engineers and data scientists joining from Vilynx, and the startup is going to become one of Apple’s key AI research hubs in Europe, according to the news.

Apple has made significant progress in the space of artificial intelligence over the past few months, with this purchase of UK-based Spectral Edge last December, Seattle-based Xnor.ai for $200 million and Voysis and Inductiv to help it improve Siri. With its habit of quietly purchasing smaller companies, Apple is making a mark in the AI space. In 2018, CEO Tim Cook said in an interview that the company had bought 20 companies over six months, while only six were public knowledge.

摘要如下：

SUMMARY
------------------------
In an attempt to scale up its AI portfolio, Apple has acquired Spain-based AI video startup — Vilynx for approximately $50 million.
With CEO Tim Cook’s vision of the potential of augmented reality, the company could also make use of AI-based tools like Vilynx.
With its habit of quietly purchasing smaller companies, Apple is making a mark in the AI space.

结尾

祝贺你！你已经在Python中创建了你的个人文本摘要器。我希望，摘要看起来很不错。

原文链接：https://towardsdatascience.com/report-is-too-long-to-read-use-nlp-to-create-a-summary-6f5f7801d355

欢迎关注磐创AI博客站：
http://panchuang.net/

sklearn机器学习中文官方文档：
http://sklearn123.com/

欢迎关注磐创博客资源汇总站：
http://docs.panchuang.net/

码农公寓

什么是文本摘要

应该使用哪种总结方法

如何创建自己的文本摘要器

文本摘要工作流

（1） 示例文本

（2） 导入库

（3） 导入文本并执行预处理

（4） 将文本拆分为句子

（5） 删除停用词

（6） 构建直方图

（7） 根据分数排列句子

（8） 选择前面的句子作为摘要