一、机器翻译
Encoder-Decoder
(可以应用在对话系统、生成式任务中)
encoder:输入到隐藏状态
decoder:隐藏状态到输出
class Encoder(nn.Module):
def __init__(self, **kwargs):
super(Encoder, self).__init__(**kwargs)
def forward(self, X, *args):
raise NotImplementedError
class Decoder(nn.Module):
def __init__(self, **kwargs):
super(Decoder, self).__init__(**kwargs)
def init_state(self, enc_outputs, *args):
raise NotImplementedError
def forward(self, X, state):
raise NotImplementedError
class EncoderDecoder(nn.Module):
def __init__(self, encoder, decoder, **kwargs):
super(EncoderDecoder, self).__init__(**kwargs)
self.encoder = encoder
self.decoder = decoder
def forward(self, enc_X, dec_X, *args):
enc_outputs = self.encoder(enc_X, *args)
dec_state = self.decoder.init_state(enc_outputs, *args)
return self.decoder(dec_X, dec_state)
Sequence to Sequence模型
模型:
训练
预测
具体结构:
Encoder
class Seq2SeqEncoder(d2l.Encoder):
def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
dropout=0, **kwargs):
super(Seq2SeqEncoder, self).__init__(**kwargs)
self.num_hiddens=num_hiddens
self.num_layers=num_layers
self.embedding = nn.Embedding(vocab_size, embed_size)
self.rnn = nn.LSTM(embed_size,num_hiddens, num_layers, dropout=dropout)
def begin_state(self, batch_size, device):
return [torch.zeros(size=(self.num_layers, batch_size, self.num_hiddens), device=device),
torch.zeros(size=(self.num_layers, batch_size, self.num_hiddens), device=device)]
def forward(self, X, *args):
X = self.embedding(X) # X shape: (batch_size, seq_len, embed_size)
X = X.transpose(0, 1) # RNN needs first axes to be time
# state = self.begin_state(X.shape[1], device=X.device)
out, state = self.rnn(X)
# The shape of out is (seq_len, batch_size, num_hiddens).
# state contains the hidden state and the memory cell
# of the last time step, the shape is (num_layers, batch_size, num_hiddens)
return out, state
encoder = Seq2SeqEncoder(vocab_size=10, embed_size=8,num_hiddens=16, num_layers=2)
X = torch.zeros((4, 7),dtype=torch.long)
output, state = encoder(X)
output.shape, len(state), state[0].shape, state[1].shape
----------------------------
(torch.Size([7, 4, 16]), 2, torch.Size([2, 4, 16]), torch.Size([2, 4, 16]))
损失函数
import torch
X_len=torch.tensor([1,2])
print(torch.arange(3)[None, :].to(X_len.device))
print(torch.tensor([1,2])[:, None])
a=torch.arange(3)[None, :].to(X_len.device)<torch.tensor([1,2])[:, None]
print(a)
import torch
def SequenceMask(X, X_len,value=0):
maxlen = X.size(1)
mask = torch.arange(maxlen)[None, :].to(X_len.device) < X_len[:, None]
print(torch.arange(maxlen)[None, :].to(X_len.device),X_len[:, None])
X[~mask]=value
return X
X = torch.tensor([[1,2,3], [4,5,6]])
SequenceMask(X,torch.tensor([1,2]))
----------------------------------
tensor([[0, 1, 2]]) tensor([[1],
[2]])###通过广播机制进行拓展比较大小
tensor([[1, 0, 0],
[4, 5, 0]])
class MaskedSoftmaxCELoss(nn.CrossEntropyLoss):
# pred shape: (batch_size, seq_len, vocab_size)
# label shape: (batch_size, seq_len)
# valid_length shape: (batch_size, )
def forward(self, pred, label, valid_length):
# the sample weights shape should be (batch_size, seq_len)
weights = torch.ones_like(label)
weights = SequenceMask(weights, valid_length).float()
self.reduction='none'
output=super(MaskedSoftmaxCELoss, self).forward(pred.transpose(1,2), label)
return (output*weights).mean(dim=1)
loss = MaskedSoftmaxCELoss()
loss(torch.ones((3, 4, 10)), torch.ones((3,4),dtype=torch.long), torch.tensor([4,3,0]))
----------------------------------
tensor([2.3026, 1.7269, 0.0000])
Beam Search
简单greedy search:
维特比算法:选择整体分数最高的句子(搜索空间太大)
集束搜索:
二、注意力机制
在“编码器—解码器(seq2seq)”⼀节⾥,解码器在各个时间步依赖相同的背景变量(context vector)来获取输⼊序列信息。当编码器为循环神经⽹络时,背景变量来⾃它最终时间步的隐藏状态。将源序列输入信息以循环单位状态编码,然后将其传递给解码器以生成目标序列。然而这种结构存在着问题,尤其是RNN机制实际中存在长程梯度消失的问题,对于较长的句子,我们很难寄希望于将输入的序列转化为定长的向量而保存所有的有效信息,所以随着所需翻译句子的长度的增加,这种结构的效果会显著下降。
与此同时,解码的目标词语可能只与原输入的部分词语有关,而并不是与所有的输入有关。例如,当把“Hello world”翻译成“Bonjour le monde”时,“Hello”映射成“Bonjour”,“world”映射成“monde”。在seq2seq模型中,解码器只能隐式地从编码器的最终状态中选择相应的信息。然而,注意力机制可以将这种选择过程显式地建模。
注意力机制框架
Attention 是一种通用的带权池化方法,输入由两部分构成:询问(query)和键值对(key-value pairs)。ki∈Rdk,vi∈Rdv. Query q∈Rdq, attention layer得到输出与value的维度一致o∈Rdv. 对于一个query来说,attention layer 会与每一个key计算注意力分数并进行权重的归一化,输出的向量o则是value的加权求和,而每个key计算的权重与value一一对应。
为了计算输出,我们首先假设有一个函数α 用于计算query和key的相似性,然后可以计算所有的 attention scores a1,…,an by
ai=α(q,ki).
我们使用 softmax函数 获得注意力权重:
b1,…,bn=softmax(a1,…,an).
最终的输出就是value的加权求和:
o=i=1∑nbivi.
不同的attetion layer的区别在于score函数的选择,在本节的其余部分,我们将讨论两个常用的注意层 Dot-product Attention 和 Multilayer Perceptron Attention;随后我们将实现一个引入attention的seq2seq模型并在英法翻译语料上进行训练与测试。
Softmax屏蔽
在深入研究实现之前,我们首先介绍softmax操作符的一个屏蔽操作。
def SequenceMask(X, X_len,value=-1e6):
maxlen = X.size(1)
#print(X.size(),torch.arange((maxlen),dtype=torch.float)[None, :],'\n',X_len[:, None] )
mask = torch.arange((maxlen),dtype=torch.float)[None, :] >= X_len[:, None]
#print(mask)
X[mask]=value
return X
import torch
def masked_softmax(X, valid_length):
# X: 3-D tensor, valid_length: 1-D or 2-D tensor
softmax = nn.Softmax(dim=-1)
if valid_length is None:
return softmax(X)
else:
shape = X.shape
if valid_length.dim() == 1:
try:
valid_length = torch.FloatTensor(valid_length.numpy().repeat(shape[1], axis=0))#[2,2,3,3]
except:
valid_length = torch.FloatTensor(valid_length.cpu().numpy().repeat(shape[1], axis=0))#[2,2,3,3]
else:
valid_length = valid_length.reshape((-1,))
# fill masked elements with a large negative, whose exp is 0
X = SequenceMask(X.reshape((-1, shape[-1])), valid_length)
print(X)
return softmax(X).reshape(shape)
a=torch.rand((2,2,4),dtype=torch.float)
print(a)
print(a.reshape((-1, shape[-1])))
shape=a.shape
b=torch.FloatTensor([2,3]).numpy().repeat(shape[1], axis=0)
print(b)
masked_softmax(a, torch.FloatTensor([2,3]))
---------------------------------------------------
tensor([[[0.5287, 0.8391, 0.4102, 0.8059],
[0.9937, 0.6773, 0.4107, 0.5688]],
[[0.8143, 0.8013, 0.3830, 0.4471],
[0.5256, 0.7490, 0.0073, 0.3204]]])
tensor([[0.5287, 0.8391, 0.4102, 0.8059],
[0.9937, 0.6773, 0.4107, 0.5688],
[0.8143, 0.8013, 0.3830, 0.4471],
[0.5256, 0.7490, 0.0073, 0.3204]])
[2. 2. 3. 3.]
tensor([[ 5.2868e-01, 8.3909e-01, -1.0000e+06, -1.0000e+06],
[ 9.9370e-01, 6.7730e-01, -1.0000e+06, -1.0000e+06],
[ 8.1430e-01, 8.0128e-01, 3.8302e-01, -1.0000e+06],
[ 5.2560e-01, 7.4897e-01, 7.3193e-03, -1.0000e+06]])
tensor([[[0.4230, 0.5770, 0.0000, 0.0000],
[0.5784, 0.4216, 0.0000, 0.0000]],
[[0.3793, 0.3744, 0.2464, 0.0000],
[0.3514, 0.4393, 0.2093, 0.0000]]])
点积注意力
The dot product 假设query和keys有相同的维度, 即 ∀i,q,ki∈Rd。 通过计算query和key转置的乘积来计算attention score,通常还会除去 d 减少计算出来的score对维度