这篇文章发布2015年,关于Attention的应用。
现在看来可能价值没那么大了,但是由于没读过还是要读一遍。
简介 Introduce
In parallel, the concept of “attention” has gained popularity recently in training neural networks, allowing models to learn alignments between different modalities, e.g., between image objects and agent actions in the dynamic control problem (Mnih et al., 2014), between speech frames and text in the speech recognition task, or between visual features of a picture and its text description in the image caption generation task (Xu et al., 2015). In the context of NMT, Bahdanau et al. (2015) has successfully applied such attentional mechanism to jointly translate and align words.
可以看到,attention机制对于翻译任务其实是他山之石。最早是用在视觉和语音等多模态领域的。
In this work, we design, with simplicity and fectiveness in mind, two novel types of attentionbased models: a global approach in which all source words are attended and a local one whereby only a subset of source words are considered at a time.
提出的两种注意力应用方法,核心目的是运算简单和效率:
- 全局注意力:所有词都被关注,和原始注意力模型类似,但更简单。
- 局部注意力:部分词被关注,比较新颖,作者认为这是一种***软硬注意力***的结合。
注意力机制模型 Attention-based Models
While these models differ in how the context vector ct is derived, they share the same subsequent steps.
注意力机制用法的关键在于如何生成Ct,其他步骤都是相同的。
全局注意力机制 Global Attention
h-表示是encoder的隐状态,h表示是decoder的隐状态。
具体做法是把decoder的t时刻隐状态逐个与encoder的所有隐状态做分数计算。
分数计算方法有三种,其中第三种就是前人使用的。
与前人不同的地方在于,前人用的是t-1时刻的隐状态配合注意力生成t时刻的隐状态,本文用的是t时刻的隐状态配合注意力去做预测。(这里写得比较绕,但是认真看原文确实是这样的,两篇文关注的阶段是不同的)这样做的优势是计算更简单
局部注意力 Local Attention
Our local attention mechanism selectively focuses on a small window of context and is differentiable. This approach has an advantage of avoiding the expensive computation incurred in the soft attention and at the same time, is easier to train than the hard attention approach. In concrete details, the model first generates an aligned position
pt for each target word at time t. The context vector ct is then derived as a weighted average over the set of source hidden states within the window [pt−D, pt+D]; D is empirically selected.Unlike the global approach, the local alignment vector at is now fixed-dimensional R=2D+1.
先给t时刻算出一个对齐起点位置Pt,然后假设t时刻的翻译内容仅与第Pt个原文词和周围D个词相关(显然这个假设并不是合理的,那么可以预见效果并不一定特别好)。
两种对齐起点的计算方法
- 朴素假设:译文中的第t个词和原文中的第t个词对齐(显然这个假设更不合理)
- 预测对齐:构建模型去预测对齐位置(方法见下,感觉也不太靠谱诶)
预测模型的输入只有decoder的隐状态,然后有两个可学习参数。
在计算比分的最后步骤加入了一个高斯系数,即越靠近对齐起点pt的词,得到的分数越高(感觉这个也很不靠谱诶)。
对齐覆盖问题 Input-feeding Approach
In our proposed global and local approaches, the attentional decisions are made independently, which is suboptimal. Whereas, in standard MT, a coverage set is often maintained during the translation process to keep track of which source words have been translated. Likewise, in attentional NMTs, alignment decisions should be made jointly taking into account past alignment information. To address that, we propose an input-feeding approach in which attentional vectors ˜ht are concatenated with inputs at the next time steps as illustrated in Figure 4.11 The effects of having such connections are two-fold: (a) we hope to make the model fully aware of previous alignment choices and (b) we create a very deep network spanning both horizontally and vertically.
- 作者这里提到了在传统机器翻译中,会维护一个***覆盖集***用来告诉模型:原文中哪些词已经被翻译过了(像我这种后生晚辈,肯定是从来没听说过这个东西的)。
- 因此希望在做注意力对齐的时候,注意力模型也能知道哪些词已经被对齐过了。所以提出了一种专门的输入方法,即在解码器计算下一时刻的隐状态时,将上一时刻的隐状态和上一时刻所对齐的输入向量同时输入。
实验
从现在的角度来看,这篇文章的实验结果已经没有多大意义了,这部分我就略过了。
分析 Analysis
同样也没有太多可以讲的。
总结
In this paper, we propose two simple and effective attentional mechanisms for neural machine translation: the global approach which always looks at all source positions and the local one that only attends to a subset of source positions at a time.
在我看来这篇文章直到今天还有价值的主要原因反倒是在于他优化了attention(全局)的计算,至于那个所谓局部注意力和特殊的输入方法可能并不一定多么好。