Attension Is All You Need




“This is what attention does, it extracts information from the whole sequence, a weighted sum of all the past encoder states”


Self-attention is a sequence-to-sequence operation: a sequence of vectors goes in, and a sequence of vectors comes out. Let’s call the input vectors x1x2,…, xt and the corresponding output vectors y1y2,…, yt. The vectors all have dimension k. To produce output vector yi, the self attention operation simply takes a weighted average over all the input vectors, the simplest option is the dot product.


Q, K, V:

Every input vector is used in three different ways in the self-attention mechanism: the Query, the Key and the Value. In every role, it is compared to the other vectors to get its own output yi(Query), to get the j-th output yj(Key) and to compute each output vector once the weights have been established (Value).


Attension Is All You Need

上一篇:三剑客之三 Awk小记

下一篇:剑指offer57 和为s的两个数字