For tasks at the intersection of vision and language, there lacks such pre-trained generic feature representations.
To better exploit the generic representation, we pre-train VL-BERT at both large visual-linguistic corpus and text-only datasets. The pre-training loss on the visual-linguistic corpus is incurred via predicting randomly masked words or RoIs. Such pre-training sharpens the capability of VL-BERT in aggregating and aligning visual-linguistic clues. While the loss on the text-only corpus is of the standard MLM loss in BERT, improving the generalization on long and complex sentences.
- 值得一提的是,预训练语料不仅包含双模态数据,还包含纯文本数据。纯文本数据是为了提升模型对于长难句子的处理能力。
The authors of ViLBERT claim that such two-stream design is superior than a single-stream unified model.
- 像LXMERT那种二合一形式的模型叫做 two-stream
- 像本文这种模型叫做 single-stream unified
- 本文作者认为 single-stream unified的*度更高,对于attention的范围和方式不做任何限制是更优秀的
there are three noticeable differences between VL-BERT and other concurrent works in pre-training.
- (1) We found the task of Sentence-Image Relationship Prediction used in all of the other concurrent works is of no help in pre-training visual-linguistic representations.
- (2) We pre-train VL-BERT on both visual-linguistic
and text-only datasets.- (3) In VL-BERT, the parameters of Fast R-CNN, deriving the visual features, are also updated.
- (4) To avoid visual clue leakage in the pre-training task of Masked RoI Classification with Linguistic Clues, the masking operation is conducted on the input raw pixels, other than the feature maps produced by layers of convolution.
- Sentence-Image Relationship Prediction这个预训练任务被取消了,理由是没有实际作用。(但是别的文章应该是有做消融实验的)
- 这个纯文本数据的作用可以理解。
- 目标检测的网络Fast R-CNN的参数也是更新的(在别的文章里这个步骤有不少是作为数据前处理存在的,不参与训练)
- 在对图片进行Mask的时候,不是mask特征,而是将原图的像素区域置零(这个操作和LXMERT中完全相反)
It is worth noting that the input formats vary for different visual-linguistic tasks (e.g., <Caption,Image> for image captioning, and <Question, Answer, Image> for VQA and VCR ).
For each input element, its embedding feature is the summation of four types of embedding, namely, token embedding, visual feature embedding, segment embedding, and sequence position embedding.
- 四种embedding,文本、图片、分段、位置:
- Token Embedding:文本部分和BERT没有区别,图片部分都是[IMG]
- Visual Feature Embedding:在视觉部分对应的是ROI特征,在文本部分对应的是整张图的特征。值得一提的是,此特征与位置特征的级联,再经过线性变换后才是最终的Visual Feature Embedding。位置特征是下面的4维向量经过sin和cos曲线变换得到的,参考的是Relation Network的成果。还有一点很关键的就是如果输入为纯文本数据,那么对应的Visual Feature Embedding是一个可学习的Embedding。>- Segment Embedding:ABC三个类别,AB是文本的意思,C是图片的意思。AB是为了有两段输入文本时做区分的,平时A就够用了。>- Position Embedding:文本部分与BERT一样,视觉部分因为没有先后之分,就都按照相同的位置去处理了。(这里不太合理)
Task #1: Masked Language Modeling with Visual Clues
Task #2: Masked RoI Classification with Linguistic Clues
- MLM:只遮盖文本,进行预测(应用很广泛,不再多讲了)
- MRC:只遮盖图片,进行分类预测。遮盖时遮盖原图,Fast-R-CNN的分类结果被作为ground-truth