10_从Attention层到Transformer网络

文章目录

一、Multi-Head Attention(多头)

1.1 Single-Head Self-Attention(单头Self-Attention)

输入为一个序列(X1 ,X2 ,X3 ,···,Xm),Self-Attention层有三个参数矩阵(WQ ,WK ,WV),输出是一个序列(C:1 ,C:2 ,C:3 ,···,C:m

10_从Attention层到Transformer网络

1.2 Multi-Head Self-Attention(多头Self-Attention)

  • Using L single-head self-attentions (which do not share parameters.) (多头Self-Attention是由L个单头的Self-Attention组成的,它们各自有各自的参数,不共享参数)
    • A single-head self-attention has 3 parameter matrices: WQ ,WK ,WV 。(每个单头self-attention有三个参数矩阵)
    • Totally 3L parameters matrices.(因此的,多头self-attention共有3L个参数矩阵)

所有单头Self-Attention都有相同的输入,但是它们的参数矩阵各不相同,所以输出的C序列也各不相同,把 L个单头Self-Attention输出的序列做合并堆叠起来,作为多头Self-Attention的最终输出,堆叠起来的C向量变得更高。

  • Suppose single-head self-attentions’ outputs are dxm matrices. (如果每个单头的输出都是d×m的矩阵)
  • Multi-head’s output shape: (Ld) xm. (多头的输出是 (Ld)×m )

10_从Attention层到Transformer网络

1.3 Multi-Head Attention(多头Attention)

  • Using L single-head attentions (which do not share parameters.) (多头Self-Attention是由L个单头的Self-Attention组成的,它们各自有各自的参数,不共享参数)

  • Concatenating single-head attentions’ outputs. (合并堆叠单头attention的输出)

  • 输入为两个序列(X1 ,X2 ,X3 ,···,Xm),(X‘1 ,X’2 ,X‘3 ,···,Xt

10_从Attention层到Transformer网络

二、Stacked Self-Attention Layers(堆叠)

多头Self-Attention和全连接层搭建一个encoder网络

2.1 Self-Attention Layer+Dense Layer

  • 这几个全连接层是相同的,参数矩阵都是Wu
  • 搭建的是两层。
  • U:i 向量不仅仅依赖于Xi ,U:i 向量依赖于所有m个Xi 向量。改变任何Xi 向量,U:i 向量都会发生变化,对U:i 向量影响最大的还是Xi 向量。

10_从Attention层到Transformer网络

2.2 Stacked Self-Attention Layers

10_从Attention层到Transformer网络

2.3 Transformer’s Encoder

  • 一个block有两层,一个是self-attention层和全连接层。
  • 输入是512×m的矩阵,输出也是512×m的矩阵
  • m是输入序列X的长度,每个X向量都是512维的
  • Transformer的encoder网络一共有6个blocks,每个block有两层,每个block都有自己的参数,block之间不共享参数
  • 最终的输出是512×m的矩阵,输出与输入的大小是一样的

10_从Attention层到Transformer网络

三、Stacked Attention Layers

3.1 Stacked Attentions

  • Transformer is a Seq2Seq model (encoder + decoder).
  • Encoder’s inputs are vectors X1 ,X2 ,X3 ,···,Xm
  • Decoder’s inputs are vectors X‘1 ,X‘2 ,X’3 ,···,X‘t
  • Transformer’s encoder contains 6 stacked blocks。(Transformer的encoder包含6个blocks)
  • 1 block 1 multi-head attention layer + 1 dense layer.(1个block包含两层)
    10_从Attention层到Transformer网络

3.2 Transformer’s Decoder

  • decoder的一个block包含:Self-Attention层,Attention层,全连接层。

  • 这个block需要两个输入序列,两个序列都是512维的向量,两个序列的长度分别为m和t。
    10_从Attention层到Transformer网络

四、Transformer

10_从Attention层到Transformer网络

五、Summary(总结)

5.1 From Single-Head to Multi-Head

  • Single-head self-attention can be combined to form a multi-head self-attention.
  • Single-head attention can be combined to form a multi-head attention.

10_从Attention层到Transformer网络

5.2 Encoder Network of Transformer

  • 1 encoder block ≈ multi-head self-attention + dense.
  • Input shape: 512xm.
  • Output shape: 512xm.
  • Encoder network is a stack of 6 such blocks.(这些blocks之间不共享参数)

5.3 Decoder Network of Transformer

  • 1 decoder block ≈ multi-head self-attention + multi-head attention + dense.
  • Input shape: ( 512 x m,512 x t ).
  • Output shape:512 x t ;
  • Decoder network is a stack of 6 such blocks.(这些blocks之间不共享参数)

5.4 Transformer Model

  • Transformer is Seq2Seq model; it has an encoder and a decoder.(Transformer是Seq2Seq模型,它有encoder和decoder网络,可以用来做机器翻译)
  • Transformer model is not RNN.(Transformer不是RNN,它没有循环结构)
  • Transformer is based on attention and self-attention 。 (Transformer完全基于attention和全连接层,Transformer和RNN的输入输出大小完全一样)
  • Transformer outperforms all the state-of-the-art RNN models. (Transformer用在自然语言处理问题上效果非常好,可以完爆最好的RNN+attention,因此Transformer现在已经是业界标准)
上一篇:绿色环保的展开效果js代码


下一篇:transformer面试题的简单回答