MoCo v3: An Empirical Study of Training Self-Supervised Vision Transformers

2023-12-12 18:24:04

论文｜代码

论文主要工作

开辟ViT的自监督领域
探究ViT的instability的原因和解决方案

Self-supervised Transformer for vision

Masks and reconstructs patches
Contrastive/Siamese methods

MoCo v3

改动1：去掉了 memory queue
原因：batch size 足够大 (>4096) 时带来的增益不明显

改动2：吸取了BYOL的经验，\(f_q\)多加了一个prediction head，backbone使用ResNet或ViT

\(f_q\)：backbone+projection head+prediction head
\(f_k\)：backbone+projection head

Stability of Self-Supervised ViT Training

因为模型总会给出一个decent accuracy而不是catastrophic failure，不稳定性所导致的degradation in accuracy很难观察到(1%-3%)

Empirical Observations on Basic Factors

MoCo v3: An Empirical Study of Training Self-Supervised Vision Transformers

Batch Size

Batch size>2048时会出现不稳定的情况，更大的话不稳定性就会更明显

We hypothesize that the training is partially restarted and jumps out of the current local optimum, then seeks a new trajectory. As a consequence, the training does not diverge, but the accuracy depends on how good the local restart is.

Learning Rate

In practice, the learning rate is often scaled when the batch size increases.

随着lr的增大，准确率先增大再减小，但会逐渐不稳定。在lr还小的时候，训练很稳定，但是还处于under- fitting状态。当lr=1.5e-4时，准确率降低了，此时准确率受不稳定性影响。

Optimizer

本文选取的是AdamW优化，因为如果要选LAMB(an AdamW-counterpart of LARS)的话，就要谨慎的选取lr才能获得相当的效果。

A Trick for Improving Stability

作者发现，梯度的暴增导致了训练的不稳定，并且梯度的暴增出现在第一层即patch projection。

基于这个观察，作者选择在训练时冻结patch projection，随机初始化后就固定住，不训练。

We use a ﬁxed random patch projection layer to embed the patches, which is not learned. This can be easily done by applying a stop-gradient operation right after this layer.

We note that freezing the ﬁrst layer does not change the architecture, and it actually narrows down the solution space. This indicates that the underlying problem is on optimization.

这个方法可以alleviate instability instead of solving

虽然这个方法效果不错，但还遗留了一些问题：

It is an interesting observation that it is not necessary to train the patch projection layer. In this case, random projection should be sufﬁcient to preserve the information of the original patches.

作者提到其实第一层并不是影响不稳定性的最关键因素，其实所有层都有参与到其中。只不过因为第一层是唯一non- transformer层，方便独立的处理，期待未来会有更好的解决方案。

展望

1.Self-supervised Transformer can achieve strong results using a contrastive learning framework.

2.去掉ViT中的position embedding只对准确度有很小的影响，说明：

ViT can learn strong representations without the positional inductive bias.
Positional information has not been sufficiently exploited.

3.期待有更好的解决instability的方法

4.Close the gap of pre-training methodology between vision and language

码农公寓

Self-supervised Transformer for vision

MoCo v3

Stability of Self-Supervised ViT Training

Empirical Observations on Basic Factors

A Trick for Improving Stability

展望

相关文章