【论文笔记】VL-BERT: PRE-TRAINING OF GENERIC VISUAL- LINGUISTIC REPRESENTATIONS

For tasks at the intersection of vision and language, there lacks such pre-trained generic feature representations.


motivation:这篇文章和unified的思想很接近,希望训练出能够适应各类下游任务的通用表示模型。

简介

To better exploit the generic representation, we pre-train VL-BERT at both large visual-linguistic corpus and text-only datasets. The pre-training loss on the visual-linguistic corpus is incurred via predicting randomly masked words or RoIs. Such pre-training sharpens the capability of VL-BERT in aggregating and aligning visual-linguistic clues. While the loss on the text-only corpus is of the standard MLM loss in BERT, improving the generalization on long and complex sentences.


这篇文章与类似原版BERT的相似度非常之高,类似的工作也很多,有比较多的内容我并没有记录。

  • 值得一提的是,预训练语料不仅包含双模态数据,还包含纯文本数据。纯文本数据是为了提升模型对于长难句子的处理能力。

相关工作

The authors of ViLBERT claim that such two-stream design is superior than a single-stream unified model.


这里对两类模型做了界定:

  • 像LXMERT那种二合一形式的模型叫做 two-stream
  • 像本文这种模型叫做 single-stream unified
  • 本文作者认为 single-stream unified的*度更高,对于attention的范围和方式不做任何限制是更优秀的

there are three noticeable differences between VL-BERT and other concurrent works in pre-training.

  • (1) We found the task of Sentence-Image Relationship Prediction used in all of the other concurrent works is of no help in pre-training visual-linguistic representations.
  • (2) We pre-train VL-BERT on both visual-linguistic
    and text-only datasets.
  • (3) In VL-BERT, the parameters of Fast R-CNN, deriving the visual features, are also updated.
  • (4) To avoid visual clue leakage in the pre-training task of Masked RoI Classification with Linguistic Clues, the masking operation is conducted on the input raw pixels, other than the feature maps produced by layers of convolution.

有很多操作都和我目前看过的几篇是完全相反的:

  • Sentence-Image Relationship Prediction这个预训练任务被取消了,理由是没有实际作用。(但是别的文章应该是有做消融实验的)
  • 这个纯文本数据的作用可以理解。
  • 目标检测的网络Fast R-CNN的参数也是更新的(在别的文章里这个步骤有不少是作为数据前处理存在的,不参与训练)
  • 在对图片进行Mask的时候,不是mask特征,而是将原图的像素区域置零(这个操作和LXMERT中完全相反)

VL-BERT

【论文笔记】VL-BERT: PRE-TRAINING OF GENERIC VISUAL- LINGUISTIC REPRESENTATIONS

It is worth noting that the input formats vary for different visual-linguistic tasks (e.g., <Caption,Image> for image captioning, and <Question, Answer, Image> for VQA and VCR ).


值得注意的是针对不同的任务,输入会有所不同,那么排列的方式也会发生微小的变化。不过得益于Transformer对于位置信息的不敏感,排列方式的影响不大,而且只需要安排好Position-Embedding和Segment-Embedding就可以很好的解决。

For each input element, its embedding feature is the summation of four types of embedding, namely, token embedding, visual feature embedding, segment embedding, and sequence position embedding.
【论文笔记】VL-BERT: PRE-TRAINING OF GENERIC VISUAL- LINGUISTIC REPRESENTATIONS


  • 四种embedding,文本、图片、分段、位置:
  • Token Embedding:文本部分和BERT没有区别,图片部分都是[IMG]
  • Visual Feature Embedding:在视觉部分对应的是ROI特征,在文本部分对应的是整张图的特征。值得一提的是,此特征与位置特征的级联,再经过线性变换后才是最终的Visual Feature Embedding。位置特征是下面的4维向量经过sin和cos曲线变换得到的,参考的是Relation Network的成果。还有一点很关键的就是如果输入为纯文本数据,那么对应的Visual Feature Embedding是一个可学习的Embedding。【论文笔记】VL-BERT: PRE-TRAINING OF GENERIC VISUAL- LINGUISTIC REPRESENTATIONS>- Segment Embedding:ABC三个类别,AB是文本的意思,C是图片的意思。AB是为了有两段输入文本时做区分的,平时A就够用了。>- Position Embedding:文本部分与BERT一样,视觉部分因为没有先后之分,就都按照相同的位置去处理了。(这里不太合理)

Task #1: Masked Language Modeling with Visual Clues
Task #2: Masked RoI Classification with Linguistic Clues


两种预训练任务:

  • MLM:只遮盖文本,进行预测(应用很广泛,不再多讲了)
  • MRC:只遮盖图片,进行分类预测。遮盖时遮盖原图,Fast-R-CNN的分类结果被作为ground-truth

实验

VCR任务

【论文笔记】VL-BERT: PRE-TRAINING OF GENERIC VISUAL- LINGUISTIC REPRESENTATIONS

VQA任务

【论文笔记】VL-BERT: PRE-TRAINING OF GENERIC VISUAL- LINGUISTIC REPRESENTATIONS

REFERRING EXPRESSION COMPREHENSION

【论文笔记】VL-BERT: PRE-TRAINING OF GENERIC VISUAL- LINGUISTIC REPRESENTATIONS

消融实验

【论文笔记】VL-BERT: PRE-TRAINING OF GENERIC VISUAL- LINGUISTIC REPRESENTATIONS

上一篇:二、快速掌握 Git 之 Visual Studio 的使用


下一篇:Microsoft Visual Studio Installer Projects 2022的安装与使用