Coursera自然语言处理专项课程04:Natural Language Processing with Attention Models笔记 Week02

Natural Language Processing with Attention Models

Course Certificate

在这里插入图片描述

本文是学习这门课 Natural Language Processing with Attention Models的学习笔记,如有侵权,请联系删除。

在这里插入图片描述

文章目录

  • Natural Language Processing with Attention Models
  • Text Summarization
      • Learning Objectives
    • Transformers vs RNNs
    • Transformers overview
    • Transformer Applications
    • Scaled and Dot-Product Attention
    • Masked Self Attention
    • Multi-head Attention
    • Reading: Multi-head Attention
    • Lab: Attention
      • Background
      • Imports
      • Dot product attention
    • Lab: Masking
      • 1 - Masking
        • 1.1 - Padding Mask
        • 1.2 - Look-ahead Mask
    • Lab: Positional Encoding
      • 1. Positional Encoding
        • 1.1 - Sine and Cosine Angles
        • 1.2 - Sine and Cosine Positional Encodings
    • Transformer Decoder
    • Transformer Summarizer
    • Quiz: Text Summarization
  • Programming Assignment: Transformer Summarizer
    • Introduction
    • 1 - Import the Dataset
    • 2 - Preprocess the data
    • 3 - Positional Encoding
    • 4 - Masking
    • 5 - Self-Attention
      • Exercise 1 - scaled_dot_product_attention
    • 6 - Encoder
      • 6.1 Encoder Layer
      • 6.2 - Full Encoder
    • 7 - Decoder
      • 7.1 - Decoder Layer
      • Exercise 2 - DecoderLayer
      • 7.2 - Full Decoder
      • Exercise 3 - Decoder
    • 8 - Transformer
      • Exercise 4 - Transformer
    • 9 - Initialize the Model
    • 10 - Prepare for Training the Model
    • 11 - Summarization
      • Exercise 5 - next_word
    • 12 - Train the model
    • 13 - Summarize some Sentences!
    • Grades
  • 后记

Text Summarization

Compare RNNs and other sequential models to the more modern Transformer architecture, then create a tool that generates text summaries.

Learning Objectives


  • Describe the three basic types of attention
  • Name the two types of layers in a Transformer
  • Define three main matrices in attention
  • Interpret the math behind scaled dot product attention, causal attention, and multi-head attention
  • Use articles and their summaries to create input features for training a text summarizer
  • Build a Transformer decoder model (GPT-2)

Transformers vs RNNs

在这里插入图片描述

In the image above, you can see a typical RNN that is used to translate the English sentence “How are you?” to its French equivalent, “Comment allez-vous?”. One of the biggest issues with these RNNs, is that they make use of sequential computation. That means, in order for your code to process the word “you”, it has to first go through “How” and “are”. Two other issues with RNNs are the:

  • Loss of information: For example, it is harder to keep track of whether the subject is singular or plural as you move further away from the subject.
  • Vanishing Gradient: when you back-propagate, the gradients can become really small and as a result, your model will not be learning much.

In contrast, transformers are based on attention and don’t require any sequential computation per layer, only a single step is needed. Additionally, the gradient steps that need to be taken from the last output to the first input in a transformer is just one. For RNNs, the number of steps increases with longer sequences. Finally, transformers don’t suffer from vanishing gradients problems that are related to the length of the sequences.

We are going to talk more about how the attention component works with transformers. So don’t worry about it for now ????

Welcome, this week I’ll teach
you about the transformer model. It’s a purely attention based model
that was developed as Google to remedy some problems with RNNs. First, let me tell you
what these problems are so you understand why
the transformer model is needed. Let’s dive in. First, I will talk about
some problems related to recurrent neural networks using
some familiar architectures. After that, I’ll show you why pure attention
models help us solve those issues. In neural machine translation,
you use a neural architecture to translate from one language to another,
in this example, from English to French. Using an RNN, you have to take
sequential steps to encode your inputs. You start from the beginning
of your input, making computations at every
step until you reach the end. At that point, you decode the information
following a similar sequential procedure. As you can see here, you have to go
through every word in your inputs, starting with the first word, followed
by the second word, one after another. In a sequential matter in order to
start the translation that is done in a sequential way too. For that reason, there is not much
room for parallel computations here. The more words you have
in the input sentence, the more time it will take
to process that sentence. Let’s look closer at a more general
sequence-to-sequence architecture. In this case, to propagate information
from your first word to the last output, you have to go through
capital T sequential steps. Where capital T is an integer that
stands for the number of time steps that your model will go through to process
the inputs of one example sentence. If, for instance, you are inputting
a sentence that consists of five words, then the model will take five times
steps to encode that sentence, and in this example T equals to five.

在这里插入图片描述

And as you may recall from earlier in
the specialization with large sequences, the information tends to get
lost within the network and. Vanishing gradients problems arise related
to the length of your input sequences. LSTMs and GRUs help a little with these
problems, but even those architectures stop working well when they try to
process very long sequences due to the information bottleneck, as you
saw in the last week of this course. So to recap,
we said we have a loss of information and then we have the vanishing
gradients problem. Including attention in your model
is a way to tackle these problems. You already saw and implemented
a sequence-to-sequence architecture with attention similar to
the one depicted here. Recall that you relied on LSTMs for
your encoder and decoder, but you could also have used GRUs or
just vanilla RNNs. In contrast, transformers rely
only on attention mechanisms and don’t require the use of recurrent
networks in a transformer, attention is all you need. Well, some linear and non-linear
transformations are usually included, but you get the idea.

在这里插入图片描述

Now you understand why RNNs can be slow
and can have big problems with contexts. These are the cases where
transformers can help. Next, I’ll show you a concrete
overview of the transformers. Let’s go to the next video.

Transformers overview

There has been a lot of
hype with the transformers. In this video, I’ll give you
an overview of the transformers model. The transformer model was introduced
in 2017 by researchers at Google, including Lukasz Kaiser,
who helped us develop this course. Since then, the transformer architecture
has become the standard for large language models, including BERT, T5,
and GPT-3, which you’ll learn about later. The transformers revolutionized the field
of natural language processing. I suggest that you read
the first transformer paper, Attention is all you need. It’s the basis for all the models
presented in the rest of this course. You’ll see how each part of
the transformer model works in detail. But first, I want to give you a brief
overview of this architecture. Now, don’t worry if some of
its components aren’t clear, I’ll go more in depth on
the following lectures. The Transformer model uses
scale dot-product attention, which you saw in the first
week of this course. The first form of attention is very
efficient in terms of computation and memory due to it consisting of just
matrix multiplication operations. This mechanism is
the core of the model and it allows the transformer to grow larger
and more complex while being faster and using less memory than other
comparable model architectures.

在这里插入图片描述

In the transformer model, you will
use the multi-head attention layer. This layer runs in parallel and it has a number of scale dot-product
attention mechanisms and multiple linear transformations of
the input queries, keys, and values. In this layer, the linear transformations
are learnable parameters.

在这里插入图片描述

The transformer encoder starts
with a multi-head attention module that performed self attention
on the input sequence. That is, each word in the input attends
to every other word in the input. This is followed by a residual
connection and normalization, a feed forward layer, and another
residual connection and normalization. This entire block is one encoder layer and
is repeated N number of times. Thanks to self attention layer,
the encoder will give you a contextual representation
of each one of your inputs.

在这里插入图片描述

The decoder is constructed similarly
to the encoder with multi-headed attention modules,
residual connections, and normalization. The first attention module is
masked such that each position attends only to previous positions. It blocks leftward flowing information. The second attention module
takes the encoder output and allows the decoder to attend to all items. This whole decoder layer is also repeated
some number of times, one after another.

在这里插入图片描述

Transformers also incorporates
a positional encoding stage which encodes each input’s position in the sequence. This is necessary because transformers
don’t use recurrent neural networks, but the word order is relevant for
any language. Positional encoding can be learned or
fixed, just as with word embeddings. For instance, let’s suppose you want
to translate from the French phrase. Over here you have [FOREIGN], and then you want to capture
the sequential information. The transformers uses a positional
encoding to retain the position of the input sequence. The positional encoding has values that
are added to the embeddings so that for every input word you have information
about its order and position. In this case, a positional encoding
vector for each word, [FOREIGN].

在这里插入图片描述

Putting these parts together,
here’s the full model architecture. Briefly on the left,
the input sentence is first embedded and the positional encodings are applied. This goes to the encoder, which consists of multiple layers
of multi-head attention modules. On the right is the decoder,
which takes the output sentence, shifts it over one step to the right,
and the outputs from the encoder. The decoder output is turned
into output probabilities using a linear layer with a softmax activation. This architecture is easy to
parallelize compared to RNN models, and as such, can be trained much more
efficiently on multiple GPUs. It can also scale up to learn multiple
tasks on larger and larger datasets. I went through this quickly but
don’t worry, I’ll go in-depth on each
part in later videos.

在这里插入图片描述

In summary, RNNs have some problems that
come from their sequential structure. With RNNs, it is hard to fully exploit
the advantages of parallel computing. And for long sequences, important
information might get lost within the network and
vanishing gradient problems arise. But fortunately, recent research
has found ways to solve for the shortcomings of RNNs
by using transformers. Transformers are a great alternative
to RNNs that help overcome these problems in NLP and in many fields
that process sequential data. You now can see why everyone
is talking about transformers, they are indeed very useful. In the next video, I’ll talk about some
of the applications of transformers.

在这里插入图片描述

Transformer Applications

Transformer is one of the most
versatile deep learning models. It is successfully applied to a number
of tasks both in NLP and beyond. Let me show you a few examples. >> Speaker 2: In this video you will see a
brief overview of the diverse transformer applications in NLP. Also, you will learn about
some powerful transformers. First, I’ll mention the most popular
applications of transformers in NLP. Then you’ll learn what are the state
of the art transformer models, including the so called text to text
transfer transformer, T5 in shorthand. Finally, you will see how useful and
versatile T5 is. Since transformers can be generally
applied to any sequential task just like RNNs,
it has been widely used throughout NLP. One very interesting and popular application is
automatic text summarization. They’re also used for autocompletion,
named entity recognition, automatic question answering,
machine translation. Another application is chatbots and
many other NLP tasks like sentiment analysis and
market intelligence, among others.

在这里插入图片描述

Many variants of transformers
are used in NLP and as usual, researchers give their
models their very own names. For example, GPT-2 which stands for
generative pre-training for transformer, is a transformer
created by OpenAI with pretraining. It is so good at generating text that news
magazines the economists had a reporter ask the GPT-2 model questions as if
they were interviewing a person, and they published the interview
at the end in 2019. Bert, which stands for
bidirectional encoder representations from transformers and which was created
by the Google AI language team, is another famous transformer used for
learning text representations. T5, which stands for
text-to-text transfer transformer and was also created by Google,
is a multitask transformer that can do question answering among
a lot of different tasks.

在这里插入图片描述

Let’s dive a little bit
deeper into the T5 model. A single T5 model can learn to
do multiple different tasks. This is pretty significant advancement. For example, let’s say you want to
perform tasks such as translation, classification, and question answering. Normally, you would design and train
one model to perform translation, and then design and train a second model
to perform classification, and then design and train a third model
to perform question answering. But with transformers, you can train a single model that is
able to perform all of these tasks. For instance, to tell the T5 model that
you wanted to perform a certain task, you’ll give the model an input string of
text that includes both the task that you want it to do, as well as the data
that you want it to perform that task on. For example, if you want to translate
the particular English sentence, I am happy from English to French, you would use the input string translates
English into French colon I am happy. And the model would be able to
output the sentence [FOREIGN], which is the translation
of I’m happy in French. This is an example of classification over
here, where input sentences are classified into two classes, acceptable when
they make sense and unacceptable. In this example, the input string
starts with cola sentence, which the model understands
is asking it to classify the sentence that follows this
command as acceptable or unacceptable. For instance, the sentence he
bought fruits and is incomplete and then is classified as unacceptable. Meanwhile, if we give the T5
model this input cola sentence, he bought fruits and vegetables. The model classifies he bought fruits and
vegetables as an acceptable sentence. If we give the T5 model the input starting
with the word question over here, followed by a colon, the model then knows
that this is a question answering example. In this example, the question is which volcano in Tanzania
is the highest mountain in Africa? And your T5 will output the answer to that
question, which is Mount Kilimanjaro. And remember that all of these tasks
are done by the same model with no modification other than the input
sentences, how cool is that?

在这里插入图片描述

Even more, the T5 also performs tasks
of regression and summarization. Recall that a regression model is one
that outputs a continuous numeric value. Here you can see an example of regression
which outputs the similarity between two sentences. The start of the input string Stsb
will indicate to the model that it should perform a similarity
measurement between two sentences. The two sentences are denoted by
the words sentence1 and sentence2. The range of possible outputs for this
model is any numerical value ranging from zero to five, where zero indicates that
the sentences are not similar at all and five indicates that
the sentences are very similar. Let’s consider this example when
comparing the sentence 1, cats and dogs are mammals with sentence 2,
these are four known forces in nature, gravity, electromagnetic,
weak, and strong. The resulting similarity level is zero, indicating that the sentences
are not similar. Now let’s consider this other example. Sentence1, cats and dogs are mammals. And sentence 2, cats, and dogs,
and cows are domesticated. In this case, the similarity level may be 2.6 if you
use a range between zero and five. Finally, here you can see
an example of summarization. It is a long story about all
the events and details of an onslaught of severe weather in Mississippi,
which is summarized just as six people hospitalized after
a storm in Attala county.

在这里插入图片描述

This is a demo using T5 for
trivia questions so that you can compete
against a transformer. What makes this demo interesting is
that T5 was trained in a closed book setting without access to
any external knowledge. So these are examples where I
was playing against the trivia. All right, so in this video you saw what
are the transformers applications in NLP, which range from translations
to summarization. Some transformers include GPT,
BERT and T5. And I also showed you how versatile and
powerful T5 is, as it can perform multiple tasks
using tech representations.

在这里插入图片描述

Now you know why we need
transformers and where it can be applied. Isn’t it astounding this one model
can handle such a variety of tasks? I hope you are now eager to
learn how transformers works. And that’s what I will show you next. Let’s go to the next video.

在这里插入图片描述

Scaled and Dot-Product Attention

The main operation and transformer
is the scale dot product attention. You’ve already seen attention in
the first week of this course. In this video,
I’ll remind you how it works. First, I’ll remind you of the formula
used for scale dot products attention. Then you’ll see some
details about the math and the dimensions of the queries keys and
values. Recall that in scaled dot-products
attention, you have queries, keys and values. The attention layer outputs
contacts vectors for each query. And the context vectors are weighted
sums of the values where the similarity between the queries and keys determines
the weights assigned to each value. The SoftMax ensures that the weights add
up to 1 and the division by the square roots of the dimension of the key
factors is used to improve performance. The scale dot-product attention
mechanism is very efficient since it relies only on matrix multiplication and
SoftMax. Additionally, you could implement this
attention mechanism to run on GPUs or TPUs to speed up training.

在这里插入图片描述

To get the query, key and value matrices,
you must first transform the words in your sequences toward embedding’s,
let’s take the sentence. Je suis heureux are source for
the queries. You’ll need to get the embedding
vector for the word Je. Then for the words suis and
finally for the word heureux. The query matrix will contain all
of these embedding vectors as rows. Not that the matrix sizes given by
the size of the word embeddings and the length of the sequence. To get the key matrix let’s use
a source of the sentence I am happy. You will get the embedding for
each word in the sentence and stack them together to
form the key matrix. You will generally use the same
vectors used for the key matrix for the value matrix. But you could also transform them first. Note however, that the number of
vectors used to form the key and value matrix must be the same.

在这里插入图片描述

Now you can revisit the scale attention
formula that I showed you before to get a sense of the dimensions of
the matrices involved at every time step. First, you compute the matrix
products between the query and the transpose of the key matrix. You scale it by the inverse of the square
of the dimension of the key vectors, D sub K and calculate the SoftMax. This competition will give you a matrix
with the weights for each keeper query. Therefore the weight matrix
will have a total number of elements equal to the number of
queries times the number of keys. And thus matrix, the third element in
the second row would correspond to the weights assigned to the third key for
the second query. After the computation
of the weights matrix, you can multiply it with the value
matrix to get a matrix that has rows and the context vector
corresponding to each query. And the number of columns on this matrix
is equal to the size of the value vectors, which is often the same
as the embedding size.

在这里插入图片描述

Scale dot-product attention is
the heart and soul of transformers. In general terms,
this mechanism takes queries, keys and values as matrices of embedding’s. It is composed of just two matrix
multiplication and a SoftMax function. Therefore, you could
consider using GPUs and TPUs to speed up the training of
models that rely on this mechanism. Now you understand that
products attention very well. In the transformer decoder, we need an extended version
called the self-masked attention. I’ll teach you about it in the next video.

在这里插入图片描述

在注意力机制(Attention Mechanism)中,除以根号下 dk 的作用是对注意力权重进行缩放(scale)。这个缩放是为了使得注意力权重的分布更加稳定和平滑,有助于减少训练中的梯度消失或梯度爆炸问题。

在注意力机制中,计算注意力权重的一般公式为:

Attention ( Q , K , V ) = softmax ( Q K T d k ) V \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V Attention(Q,K,V)=softmax(dk QKT)V
其中,(Q)、(K) 和 (V) 分别代表查询(Query)、键(Key)和值(Value)的矩阵表示。( d k d_k dk) 是键的维度(dimension of key),也就是每个键的长度。在计算注意力权重时,将查询和键的点积除以 ( d k \sqrt{d_k} dk ) 进行缩放,最后使用 softmax 函数将缩放后的值转换为注意力权重。

缩放的主要目的是控制点积的值域,使得在计算 softmax 时更加稳定,避免出现很大的数值,从而提高模型的训练效果。

在注意力机制中,通常将注意力机制的输入表示为三个矩阵:查询矩阵 Q、键矩阵 K 和值矩阵 V。这些矩阵的行数通常表示输入序列的长度,列数则表示特征向量的维度。

在计算注意力权重时,QK 矩阵进行点积操作,得到的矩阵再除以 d k \sqrt{d_k} dk 进行缩放。这里的 d k d_k dk 通常指的是矩阵 K 的列数,也就是键矩阵中每个键的长度。这个操作的目的是使得点积的结果更加稳定,有助于控制梯度的大小。

Masked Self Attention

In this video, I’ll review the different types of attention mechanisms in
the transformer model, and you will see how to
compute masked self-attention. First, you will see what the three main ways of attention in the
transformer model are. Afterwards, I’ll show you a brief overview of
masked self-attention. One of the attention
mechanisms in a transformer model is the familiar
encoder-decoder attention. In that mechanism, the words in one sentence attend to all other words
in another one. That is, the queries come from one sentence while the keys
and values come from another. You’ve already used this
kind of attention in the translation task
from last week, where the words from
sentences in French attended towards from
sentences in English.

在这里插入图片描述

Self-attention, the queries, keys, and values come
from the same sentence. Every word attends to every
other word in the sequence. This type of attention
lets you get contextual representations
of your words. In other terms,
self-attention gives you a representation of the meaning of each word within
the sentence.

在这里插入图片描述

Finally, in masked
self-attention queries, keys and values also come
from the same sentence, but each query cannot attend
to keys on future positions. This attention
mechanism is present in the decoder from
the transformer model and ensures that predictions at each position depend only
on the known outputs.

在这里插入图片描述

Mathematically,
self-attention works precisely as the
encoder-decoder attention. The only difference
is the nature of the inputs for each mechanism. Let’s focus on masked
self-attention. Recall that the scale
dot-product attention requires the calculation of the softmax of the
scaled products between the queries and the
transpose of the key matrix. Then for mask self-attention, you add a mask matrix
within the softmax. The mask has a zero on
all of its positions, except for the elements
above the diagonal, which are set to minus infinity. Or in practice, a
huge negative number. After taking the softmax, this addition ensures
that the elements in the weights matrix are zero for all the keys and the subsequent
positions to the query. In the end, as with the
other types of attention, you multiply the weights
matrix by the value matrix to get the context vector for
each query, and that’s it. You only need to
add a matrix within the softmax to ensure that the queries don’t
attend to future positions.

在这里插入图片描述

当输入中包含负无穷时,Softmax 函数会使得对应的输出值接近于零。具体来说,如果输入向量中存在一个或多个元素为负无穷(即 ( − ∞ -\infty )),那么对应的 Softmax 输出值将会是零。

例如,考虑一个包含负无穷的输入向量 ([-1,$ -\infty$, -3]),经过 Softmax 函数计算后,我们得到:

softmax ( [ − 1 , − ∞ , − 3 ] ) = [ e − 1 e − 1 + e − ∞ + e − 3 , e − ∞ e − 1 + e − ∞ + e − 3 , e − 3 e − 1 + e − ∞ + e − 3 ] \text{softmax}([-1, -\infty, -3]) = \left[ \frac{e^{-1}}{e^{-1} + e^{-\infty} + e^{-3}}, \frac{e^{-\infty}}{e^{-1} + e^{-\infty} + e^{-3}}, \frac{e^{-3}}{e^{-1} + e^{-\infty} + e^{-3}} \right] softmax([1,,3])=[e1+e+e3e1,e1+e+e3e,e1+e+e3e3]
由于 ( e − ∞ e^{-\infty} e) 在数值上表示一个极小的值,可以近似为零,因此上述计算可以简化为:

softmax ( [ − 1 , − ∞ , − 3 ] ) = [ 0.8808 , 0 , 0.1192 ] \text{softmax}([-1, -\infty, -3]) = [0.8808, 0, 0.1192] softmax([1,,3])=[0.8808,0,0.1192]
其中第二个元素的值接近于零。这表明,当输入中存在负无穷时,Softmax 函数会使得对应的输出值趋近于零。

In this video, I showed you the three main
ways of attention. Encoder-decoder attention, self-attention, and
masked self-attention. In masked self-attention, queries and keys are contained
in the same sentence, but queries can not attend
to future positions. You have seen many types
of attention so far. In the next video, I’ll show you the
multi-headed attention. It is a very powerful form of attention that allows
for parallel computing.

在这里插入图片描述

假设我们使用了一个简单的自注意力机制来处理句子 “我 喜欢 你”。在这个机制中,我们计算了每个位置对其他位置的注意力权重,最终得到了一个注意力权重矩阵。这个矩阵的含义是,在生成每个词语时,模型应该给予句子中其他位置的词语多少注意力。

以句子 “我 喜欢 你” 为例,假设我们使用上面的方法计算了注意力权重矩阵:

Attention Weights = [ 0.7 0 0

上一篇:模板方法模式


下一篇:sass中的导入与部分导入