2021-09-13

2024-02-06 11:57:28

简单说length normalization就是在TFIDF统计词在文本中匹配的次数的时候文本长度的影响。比如给定一个query和一长一短两个文本，如果那个长文本贼长，那它就有更大的可能性匹配上这个query。length normalization用文本长度归一化函数来panelize一个长文本。具体还有不能过度惩罚和pivoted length normalizer（ROSS: pivot！pivot！pivot！）[length normalization](https://yxkemiya.github.io/2016/06/07/coursera-TextRetrievalAndSearchEngines-week2-2/)这篇写的非常清楚。

这个training的目的就是得到"linear transformation matrices Wx and Wz so the mapped embeddings XWx and ZWz are in the same cross-lingual space."

1. embedding normalization 这步的目的是得到一个similarity的表示方法。the dot product of any two embeddings is equivalent to their cosine similarity and directly related to their euclidean distance and can be taken as a measure of their similarity. 两个embedding的点乘就是euclidean距离，就是相似性。

2. 其实你想mapping的难点是什么。就是mapping不上。还是用两个embedding X和Z举例，X里面的第i个词和Z里面的第i个词可能并不是对应词意的词。而且即使找到了X和Z里面对应的词，那他们的dimension也不一样。用文中话说就是row和column都对不上（not aligned）。解决办法就是用[相似矩阵](https://zhuanlan.zhihu.com/p/80208967)，然后每行排序。
"assuming that the embedding spaces are perfectly isometric, the similarity matrics Mx and Mz would be equivalent up to a permutation of their rows and columns."这里解释了在两个word embeding matrices里面match到同一个词的方法，就是这个词在两个embeding里面应该是isometric（等距）的。
"In practice, the isometry requirement will not hold exactly. "这个就解释了我之前的疑问，就是一个同样含义的词，在两个languages的embeding里面也不太可能是完全一样的wordvec。

码农公寓

相关文章