自然语言处理NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Summarizing documents with T5-large
目录
- Summarizing documents with T5-large
- Creating a summarization function
- A general topic sample
- The Bill of Rights sample
- A corporate law sample
- 星空智能对话机器人系列博客
Summarizing documents with T5-large
我们将创建一个摘要函数,可以使用任何文本调用文本摘要函数,将以法律和金融方面的例子进行文本摘要,将挑战该方法的局限性。
Creating a summarization function
首先, 我们创建一个名为summary的摘要函数,该函数有两个参数,第一个参数是preprocess_text,即要摘要的文本。第二个参数是ml,是摘要文本的最大长度。
我们将 T5任务前缀prefix“summarize"”应用于输入文本,T5模型有一个统一的结构, 任务是通过前缀prefix+input sequence方法。这看起来很简单,但NLP transformer模型更接近于此普遍训练和zero-shot下游任务
def summarize(text,ml):
preprocess_text = text.strip().replace("\n","")
t5_prepared_Text = "summarize: "+preprocess_text
print ("Preprocessed and prepared text: \n", t5_prepared_Text)
tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)
# summmarize
summary_ids = model.generate(tokenized_text,
num_beams=4,
no_repeat_ngram_size=2,
min_length=30,
max_length=ml,
early_stopping=True)
output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
return output
看起来很简单,对吧?从RNN和CNN到Transformers用了35年以上的时间。世界上一些最聪明的研究团队从为特定任务设计的transformers模型到多任务模型,几乎不需要微调。谷歌研究团队创建了一个标准格式的transformer的输入文本,其中包含一个前缀prefix,该前缀指示要解决的NLP问题,这是一个壮举!
A general topic sample
text="""
The United States Declaration of Independence was the first Etext
released by Project Gutenberg, early in 1971. The title was stored
in an emailed instruction set which required a tape or diskpack be
hand mounted for retrieval. The diskpack was the size of a large
cake in a cake carrier, cost $1500, and contained 5 megabytes, of
which this file took 1-2%. Two tape backups were kept plus one on
paper tape. The 10,000 files we hope to have online by the end of
2001 should take about 1-2% of a comparably priced drive in 2001.
"""
print("Number of characters:",len(text))
summary=summarize(text,50)
print ("\n\nSummarized text: \n",summary)
运行结果如下:
Number of characters: 534
Preprocessed and prepared text:
summarize: The United States Declaration of Independence was the first Etextreleased by Project Gutenberg, early in 1971. The title was storedin an emailed instruction set which required a tape or diskpack behand mounted for retrieval. The diskpack was the size of a largecake in a cake carrier, cost $1500, and contained 5 megabytes, ofwhich this file took 1-2%. Two tape backups were kept plus one onpaper tape. The 10,000 files we hope to have online by the end of2001 should take about 1-2% of a comparably priced drive in 2001.
Summarized text:
the united states declaration of independence was the first etext published by project gutenberg, early in 1971. the 10,000 files we hope to have online by the end of2001 should take about 1-2% of a comparably priced drive in
字符数:534
预处理和准备的文本:
概述:《美国独立宣言》是古腾堡计划于1971年初发布的第一份独立宣言。标题存储在一个电子邮件指令集中,该指令集需要一个磁带或磁盘包,并装载以供检索。该磁盘包的大小相当于蛋糕载体中的一块大蛋糕,价格为1500美元,包含5兆字节,其中该文件占1-2%。保留了两个磁带备份和一个纸带备份。我们希望在2001年年底之前在线上拥有的10000个文件将占2001年同等价格硬盘的1-2%。
摘要案文:
《美国独立宣言》是古腾堡计划于1971年初发表的第一份电子文本。我们希望在2001年年底之前在线拥有10000个文件,这将占同等价格硬盘的1-2%
The Bill of Rights sample
#Bill of Rights,V
text ="""
No person shall be held to answer for a capital, or otherwise infamous crime,
unless on a presentment or indictment of a Grand Jury, except in cases arising
in the land or naval forces, or in the Militia, when in actual service
in time of War or public danger; nor shall any person be subject for
the same offense to be twice put in jeopardy of life or limb;
nor shall be compelled in any criminal case to be a witness against himself,
nor be deprived of life, liberty, or property, without due process of law;
nor shall private property be taken for public use without just compensation.
"""
print("Number of characters:",len(text))
summary=summarize(text,50)
print ("\n\nSummarized text: \n",summary)
运行结果如下:
Number of characters: 591
Preprocessed and prepared text:
summarize: No person shall be held to answer for a capital, or otherwise infamous crime,unless on a presentment or indictment of a Grand Jury, except in cases arisingin the land or naval forces, or in the Militia, when in actual servicein time of War or public danger; nor shall any person be subject forthe same offense to be twice put in jeopardy of life or limb;nor shall be compelled in any criminal case to be a witness against himself,nor be deprived of life, liberty, or property, without due process of law;nor shall private property be taken for public use without just compensation.
Summarized text:
no person shall be held to answer for a capital, or otherwise infamous crime, unless ona presentment or indictment ofa Grand Jury. nor shall any person be subject for the same offense to be twice put
A corporate law sample
#Montana Corporate Law
#https://corporations.uslegal.com/state-corporation-law/montana-corporation-law/#:~:text=Montana%20Corporation%20Law,carrying%20out%20its%20business%20activities.
text ="""The law regarding corporations prescribes that a corporation can be incorporated in the state of Montana to serve any lawful purpose. In the state of Montana, a corporation has all the powers of a natural person for carrying out its business activities. The corporation can sue and be sued in its corporate name. It has perpetual succession. The corporation can buy, sell or otherwise acquire an interest in a real or personal property. It can conduct business, carry on operations, and have offices and exercise the powers in a state, territory or district in possession of the U.S., or in a foreign country. It can appoint officers and agents of the corporation for various duties and fix their compensation.
The name of a corporation must contain the word “corporation” or its abbreviation “corp.” The name of a corporation should not be deceptively similar to the name of another corporation incorporated in the same state. It should not be deceptively identical to the fictitious name adopted by a foreign corporation having business transactions in the state.
The corporation is formed by one or more natural persons by executing and filing articles of incorporation to the secretary of state of filing. The qualifications for directors are fixed either by articles of incorporation or bylaws. The names and addresses of the initial directors and purpose of incorporation should be set forth in the articles of incorporation. The articles of incorporation should contain the corporate name, the number of shares authorized to issue, a brief statement of the character of business carried out by the corporation, the names and addresses of the directors until successors are elected, and name and addresses of incorporators. The shareholders have the power to change the size of board of directors.
"""
print("Number of characters:",len(text))
summary=summarize(text,50)
print ("\n\nSummarized text: \n",summary)
Number of characters: 1816
Preprocessed and prepared text:
summarize: The law regarding corporations prescribes that a corporation can be incorporated in the state of Montana to serve any lawful purpose. In the state of Montana, a corporation has all the powers of a natural person for carrying out its business activities. The corporation can sue and be sued in its corporate name. It has perpetual succession. The corporation can buy, sell or otherwise acquire an interest in a real or personal property. It can conduct business, carry on operations, and have offices and exercise the powers in a state, territory or district in possession of the U.S., or in a foreign country. It can appoint officers and agents of the corporation for various duties and fix their compensation.The name of a corporation must contain the word “corporation” or its abbreviation “corp.” The name of a corporation should not be deceptively similar to the name of another corporation incorporated in the same state. It should not be deceptively identical to the fictitious name adopted by a foreign corporation having business transactions in the state.The corporation is formed by one or more natural persons by executing and filing articles of incorporation to the secretary of state of filing. The qualifications for directors are fixed either by articles of incorporation or bylaws. The names and addresses of the initial directors and purpose of incorporation should be set forth in the articles of incorporation. The articles of incorporation should contain the corporate name, the number of shares authorized to issue, a brief statement of the character of business carried out by the corporation, the names and addresses of the directors until successors are elected, and name and addresses of incorporators. The shareholders have the power to change the size of board of directors.
Summarized text:
a corporation can be incorporated in the state of Montana to serve any lawful purpose. the corporation has perpetual succession and can sue and be sued in its corporate name. it can conduct business, carry on operations, and have offices
星空智能对话机器人系列博客
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 多头注意力架构-通过Python实例计算Q, K, V
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 多头注意力架构 Concatenation of the output of the heads
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 位置编码(positional_encoding)
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 KantaiBERT ByteLevelBPETokenizer
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 KantaiBERT Initializing model
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 KantaiBERT Exploring the parameters
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 KantaiBERT Initializing the trainer
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 KantaiBERT Language modeling with FillMaskPipeline
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 GLUE Winograd schemas and NER
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Workshop on Machine Translation (WMT)
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Pattern-Exploiting Training (PET)
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 The philosophy of Pattern-Exploiting Training (PET)
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 It‘s time to make a decision
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Text completion with GPT-2
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Text completion with GPT-2 step3-5
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Text completion with GPT-2 step 6-8
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Text completion with GPT-2 step 9
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Training a GPT-2 language model
-
NLP星空智能对话机器人系列:论文学习 Do Transformers Really Perform Bad for Graph Representation
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Training a GPT-2 language model Steps 2 to 6
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Training a GPT-2 language model Steps 7 to 9
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Training a GPT-2 language model Steps 10
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 T5-large transformer model
-
NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Architecture of the T5 model