Title:基于最大化互信息对比学习的自监督bert?(Contrastive Multi-View Representation Learning on language)
Abstract:在此论文中,我们在 deep infomax,infoNet等互信息最大化的基础上,使用对比学习思路,构建了一个基于孪生Bert的无监督预训练模型,这是一种在文本蕴含和文本复述任务上的学习通用句子嵌入的自监督方法(a self-supervised method for learning universal sentence embeddings that transfer to a wide variety of natural language processing (NLP) tasks)
(We demonstrate that our objective can be used to pretrain transformers to state-of-the-art performance on SentEval, a popular benchmark for evaluating universal sentence embeddings, outperforming existing supervised, semi-supervised and unsupervised methods.)
Related Work:
1.预训练模型的介绍(Pretraining of Transformers for Language Understanding
近年来,大量的研究表明基于大型语料库的「预训练模型」(PTM)可以学习通用的语言表示,有利于下游 NLP 任务,同时能够避免从零开始训练模型。随着计算能力的发展,深度模型的出现(即 Transformer)和训练技巧的增强使得 PTM 不断发展,由浅变深。
总的来看,PTM 的发展可以分为两个时代。第一代的 PTM 旨在学习「单一的词嵌入」,例如 Skip-Gram 和 GloVe。这些模型并不会用于下游任务,通常为了计算效率而保持较浅。虽然这些预训练的词嵌入可以捕获词语的语义,但是它们与上下文无关,无法捕捉到上下文中更高层次的概念。第二代的 PTM 聚焦于学习「上下文相关的词嵌入」,例如 CoVe、ELMo、OpenAI GPT 和 BERT。这些学习到的编码器在下游任务中也会用于表示词语。此外,各种各样的预训练任务也被提出以基于不同的目的学习 PTM
Transformer(Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017))
- Contrastive Self-supervised Learning
- Context-Context Contrast
- Mutual Information MAXIMIZATION
- Representation Learning NLP+SSL
Sentence-BERT (SBERT) (Reimers and Gurevych, 2019), which achieves state-of-the-art performance for various sentence embeddings task. SBERT is based on transformer models like BERT (Devlin et al., 2018) and applies mean pooling on the output.
Deep InfoMax(DIM; Hjelm et al.,2019)is a mutual information maximization based representation learning method for images. DIM shows that maximizing the mutual information between an image representation and local regions of the image improves the quality of the representation.The complete objective function that DIM maximizes consists of multiple terms. Here, we focus on a term in the objective that maximizes the mutual information between local features and global features. We describe the main idea of this objective for learning representations from a one-dimensional sequence, although it is originally proposed to learn from a two-dimensional object.
InfoNCE((Logeswaran & Lee, 2018; van den Oord et al., 2019), has been shown to work well in practice
InfoNCE is defined as:
It is based on Noise Contrastive Estimation (NCE; Gutmann & Hyvarinen, 2012)
AMDIM (Bachman, P., Hjelm, R. D., & Buchwalter, W. (2019))
AMDIM [6] enhances the DIM through randomly choosing another view of the image to produce
the summary vector (除此之外,还有用了不同的view,但是和CMC的view不一样)
contrastive predictive coding (CPC)
CPC maximize the association between a segment of audio and its context audio. To improve data effificiency, it takes several negative context vectors at the same time. Later on, CPC has also been applied
in image classifification.
Deep InfoMax provides us with a new paradigm and boosts the development of self-supervised learning. The
fifirst inflfluential follower is Contrastive Predictive Coding (CPC) [101] for speech recognition.(文字取自Liu, X., Zhang, F., Hou, Z., Wang, Z., Mian, L., Zhang, J., & Tang, J. (2020).)
CERT(Bert on MoCo)MoCo提出一种对比损失函数名为InfoNCE(With similarity measured by dot product, a form of a contrastive loss function, called InfoNCE)
InfoWord(Kong, L., d'Autume, C. D. M., Ling, W., Yu, L., Dai, Z., & Yogatama, D. (2019))
In language pre-training, InfoWord [76] proposes to maximize the mutual information between a global representation of a sentence and n-grams in it. The context is induced from the sentence with selected n-grams being masked, and the negative contexts are randomly picked out from the corpus.
Our analysis on Skip-Gram, BERT, and XLNet shows that their objective functions are different instances of InfoNCE.
(Alternative bounds include Donsker-Vardhan representation (Donsker & Varadhan, 1983) and Jensen
Shannon estimator (Nowozin et al., 2016), but we focus on InfoNCE here. )
DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations(这篇里面有sentEval的完整引用)
1. 先从无标注的文档中以beta分布中抽样anchor片段,在从这一篇相同的文档以不同的beta分布抽样出positive样本对。
2. 之后分别将anchor片段和positive片段经过两个相同架构共享权值的编码器,生成对应的token embedding。
3. 再将token embedding进行pooler操作,即将所有的token embedding平均生成同一维度的sentence embedding。
4. 计算对比学习的损失函数。,计算了两个片段信息之间的距离。表示温度超参。
5. 在计算出对比学习的loss之后,再加入MLM的loss,对模型进行反向梯度传播更新参数。
数据集:OpenWebText corpus,有495243个至少长度为2048的文档。
- Local+global multi information
- Input output multi information
- 图片数据集里的multi-view,或者剪切的操作,应该可以对应于NLP里语言的蕴涵关系()
- Adversal AutoEncoder Decoder
- Multi model?
Model description:
Model |
Type |
Self-supervision |
Pretext Task |
Deep InfoMAX |
InfoWord |
MI Maximization |
Moco |
Context-Context |
MR(movie review):电影评论片段的情感预测,二分类
CR(product review):顾客产品评论的情感预测,二分类
SUBJ(subjectivity status):电影评论和情节摘要中句子的主观性预测,二分类
SST(Stanford sentiment analysis):斯坦福情感树库,二分类
TREC(question-type classification):来自TREC的细粒度问题类型分类,多分类
MRPC:Microsoft Research Paraphrase Corpus from parallel news sources,释义检测。
对比学习:The main idea behind contrastive learning is to divide an input data into multiple (possibly overlapping) views and maximize the mutual information between encoded representations of these views, using views derived from other inputs as negative samples.
Contrasive self-supervised learning
基于 contrastive 的方法。 这类方法并不要求模型能够重建原始输入,而是希望模型能够在特征空间上对不同的输入进行分辨,就像上面美元的例子。
- 在 feature space 上构建距离度量;
- 通过特征不变性,可以得到多种预测结果;
- 使用 Siamese Network;
- 不需要 pixel-level 重建。
正因为这类方法不用在 pixel-level 上进行重建,所以优化变得更加容易。当然这类方法也不是没有缺点,因为数据中并没有标签,所以主要的问题就是怎么取构造正样本和负样本。
目前基于 contrastive 的方法已经取得了很好的进展,在分类任务上已经接近监督学习的效果,同时在一些检测、分割的下游任务上甚至超越了监督学习作为 pre-train的方法。
其中 supervised learning 的特点如下:
1.对于每一张图片,机器预测一个 category 或者是 bounding box
3.每个样本只能提供非常少的信息(比如 1024 个 categories 只有 10 bits 的信息)
与此对比的是,self-supervised learning 的特点如下:
Contrastive learning vs. pretext tasks. Various pretext tasks can be based on some form of contrastive loss func-tions. The instance discrimination method[61] is related to the exemplar-based task[17]and NCE[28].The pretext task in contrastive predictive coding (CPC)[46] is a form of context auto-encoding [48], and in contrastive multiview coding (CMC)[56]it is related to colorization [64].
要区分好文本蕴涵和文本复述:文本蕴含的研究范畴要和复述(Paraphrasing)进行区分。复述,通常用来表示两个文本片段包含的相同的语义。所以严格来讲,复述可以认为是一种语义上的对等(Textual Equivalence)关系,或者叫做双向蕴含关系(Bi-directional Textual Entailment)。而文本蕴含关系是单向推理关系
Augmented Multiscale DIM:
- local DIM
原文要做的就是训练表征学习函数(即编码器)以最大化其输入和输出之间的互信息,在此基础上提出了DEEP INFOMAX(DIM)模型。本文模型主要有四方面:
- 不仅考虑整体输入与输出的互信息,而且将局部输入与输出的互信息考虑进去优化。
(2)采用了噪声对比估计(Noise Contrastive Estimation,NCE)方法训练鉴别器。在NLP任务 中它还有另外一个名字--'负采样',后面会介绍它的用处。
(4)引入了两种新的表征质量的度量,一种基于 MINE,另一种是 Brakel&Bengio 研究的的依赖度量,研究者用它们来比较不同无监督方法的表示。
infomax原理(Linsker, 1988;Bell & Sejnowski, 1995)
来自Deep InfoMax中
Mutual-information estimation Methods based on mutual information have a long history in
unsupervised feature learning. The infomax principle (Linsker, 1988; Bell & Sejnowski, 1995),
as prescribed for neural networks, advocates maximizing MI between the input and output. This
is the basis of numerous ICA algorithms, which can be nonlinear (Hyvarinen & Pajunen, 1999;
Almeida, 2003) but are often hard to adapt for use with deep networks.
ICML2018:MINE: Mutual Information Neural Estimation
作者认为,利用神经网络的梯度下降法可以实现高维连续随机变量之间互信息的估计,提出了Mutual Information Neural Estimator (MINE),在维度和样本量上都是线性可伸缩的,可以通过反向传播进行训练,并且具有高度一致性。
通俗理解NCE loss:
Intuitive explanation
NCE loss的直观想法:把多分类问题转化成二分类。
二分类问题群众喜闻乐见,直接上logistic regression估算一下概率。
关于Contrastive Predictive Coding
3.从标题可以看到,这个方法的重点在于representation learning
Tips:可能会用到的文字:我们的数据具有维度高、label相对少的特性,我们并不希望浪费掉没有label的那部分data。所以在label少的时候,unsupervised learning可以帮助我们学到data本身的high-level information,这些information能够对downstream task有很大的帮助。
关于互信息的文字(Mutual Information)表示两个变量 X和Y之间的关系,可以解释为由X的引入而使得Y的不确定度减小的量,I(X,Y)越大说明两者关系越密切
噪声对抗估计(Noise Contrastive Estimation, NCE):在NLP任务中一种降低计算复杂度的方法,将语言模型估计问题简化为一个二分类问题。
负采样(Negative Sampling, NEG):表示负采样,是NCE的一个简化版本,目的是提高训练速度,改善所得词向量的质量。采用了相对简单的随机负采样,本文中选择数据集中一个是正样本,其他均为负样本。
DIM(Deep InfoMAX)主要使用最大化互信息的思想,同一张图的局部特征和全局特征应高度相关,另一张的局部特征不相关。采用NCEloss得到score
Augmented Multi-scale DIM (AMDIM)。这篇文章提出可以用不同的增强数据的方式,定义局部和全局的互信息损失。在DIM是一个视图生成的“Real”和“Fake”之间的对比,而在AMDIM则是在不同增强视图之间“Real”和“Fake”之间的对比,也就是更好地利用全局信息。
Contrastive Self-Supervised Learning
当前研究的重心主要偏向对比式的自监督学习。这些突破性的工作主要有Deep InfoMax、MoCo、SimCLR
对比学习最初是想通过Noise Contrastive Estimation(NCE)学习目标对象之间的差别。目标对象之间的区别其实就是相似程度,相似程度是一个比较主观的概念,其实是同任务有关的。通常我们说的挖掘信息,就是在增加衡量相似程度的指标
而Deep InfoMax 则是在NCE的基础上,走出了另一个道路,其目标为:
InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language inference data and generalizes well to many different tasks.
We provide our pre-trained English sentence encoder from our paper and our SentEval evaluation toolkit.
Recent changes: Removed train_nli.py and only kept pretrained models for simplicity. Reason is I do not have time anymore to maintain the repo beyond simple scripts to get sentence embeddings.
两个view之间互信息逐渐增大,当小于 的时候,与任务y相关的特征并没有完全的表示出来,所以性能是欠佳的。而当 大于 的时候,除了任务相关的信息被表示出来之后,其他干扰项也存在两个view之间,那么就无法确定通过这两个view学到的一定是任务相关的信息了。比如要找到图像中哪些是猫,哪些是狗,而两个表示不仅仅把类别区分开,还把颜色也区分开来了,两个view都是把猫和黄色绑定在一定了,那么就可能导致模型认为黄色的才是猫,其他的不是。
所以就很明显的当达到sweet point的时候,模型才有最佳的transfer performance。
关于self-supervised contrastive loss的一些说法:
回顾一下self-supervised contrastive loss。我尝试用自己的语言简单概括一下:所谓self-supervised contrastive loss,也即一没有label信息,二是通过对比构建出loss,完全通过对比一个个无label的data,从而对data学习出一个有效的representation。而既然要对比,就有高低之分,也即我们需要定义一个representation之间的similarity,对于是vector的representation,最直接的度量就是欧式距离,或者可以用其他类似余弦距离或者inner product(本文采用了inner product)这样的度量
直接把self-supervised contrastive loss放出来,大家可能看得更直观,简单一点来说,对于一个data sample: x,通过data augmentation(像moco v1用的random crop就是一种augmentation)得到两个x_i(anchor),x_j(positive sample)。我们要拉近x_i和x_j的representation的距离,同时拉远x_i和其他数据(negative sample)的representation的距离
如上图,分子即是x_i和x_j的representation距离,分母即是x_i和所有数据(negative sample和positive sample)的representation的距离,最小化这个loss。目的也就达成了