原文 https://arxiv.org/abs/1908.10084
Abstract
STS semantic textual similarity
BERT结构不适合语义相似搜索,非监督的任务聚类等
SBERT Sentence-BERT
finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT.
finding which of the over 40 million existent questions of Quora is the most similar for a new question could be modeled as a pair-wise comparison with BERT, however, answering a single query would require over 50 hours
1 Introduction
By using optimized index structures, finding the most similar Quora question can be reduced from 50 hours to a few milliseconds (Johnson et al., 2017).
Quora一个新问题要50小时才能找到答案,SBERT降到毫秒
3 Model
SBERT adds a pooling operation to the output of BERT / RoBERTa to derive a fixed sized sentence embedding
增加了池化层,固定embedding的尺寸
We experiment with three pooling strategies: Using the output of the CLS-token, computing the mean of all output vectors (MEAN strategy), and computing a max-over-time of the output vectors (MAX-strategy). The default configuration is MEAN.
三种池化策略
3.1 Training Details
We fine-tune SBERT with a 3-way softmax classifier objective function for one epoch. We used a batch-size of 16, Adam optimizer with learning rate 2e−5, and a linear learning rate warm-up over 10% of the training data. Our default pooling strategy is MEAN.
三种方式和fine-tune参数
4.1 Unsupervised STS
SBERT和其他模型在STS任务结果对比,SRoBERT相对于SBERT提高有限
4.2 Supervised STS
4.3 Argument Facet Similarity
AFS Argument Facet Similarity
STS data is usually descriptive, while AFS data are argumentative excerpts from dialogs. To be considered similar, arguments must not only make similar claims, but also provide a similar reasoning.
AFS数据判断相似比STS数据更难
6 Ablation Study
classification任务用 (u,v,|u-v|) 在softmax classifier中效果比较好
When trained with the classification objective function on NLI data, the pooling strategy has a rather minor impact. The impact of the concatenation mode is much larger.
When trained with the regression objective function, we observe that the pooling strategy has a large impact.
不同任务对应不同的影响
7 Computational Efficiency
For improved computation of sentence embeddings, we implemented a smart batching strategy: Sentences with similar lengths are grouped together and are only padded to the longest element in a mini-batch. This drastically reduces computational overhead from padding tokens.
降低计算成本使用策略,相似长度的放在一起,只和mini-batch中对齐