论文主要工作
- 开辟ViT的自监督领域
- 探究ViT的instability的原因和解决方案
Self-supervised Transformer for vision
- Masks and reconstructs patches
- Contrastive/Siamese methods
MoCo v3
改动1:去掉了 memory queue
原因:batch size 足够大 (>4096) 时带来的增益不明显
改动2:吸取了BYOL的经验,\(f_q\)多加了一个prediction head,backbone使用ResNet或ViT
\(f_q\):backbone+projection head+prediction head
\(f_k\):backbone+projection head
Stability of Self-Supervised ViT Training
因为模型总会给出一个decent accuracy而不是catastrophic failure,不稳定性所导致的degradation in accuracy很难观察到(1%-3%)
Empirical Observations on Basic Factors
Batch Size
Batch size>2048时会出现不稳定的情况,更大的话不稳定性就会更明显
We hypothesize that the training is partially restarted and jumps out of the current local optimum, then seeks a new trajectory. As a consequence, the training does not diverge, but the accuracy depends on how good the local restart is.
Learning Rate
In practice, the learning rate is often scaled when the batch size increases.
随着lr的增大,准确率先增大再减小,但会逐渐不稳定。在lr还小的时候,训练很稳定,但是还处于under- fitting状态。当lr=1.5e-4时,准确率降低了,此时准确率受不稳定性影响。
Optimizer
本文选取的是AdamW优化,因为如果要选LAMB(an AdamW-counterpart of LARS)的话,就要谨慎的选取lr才能获得相当的效果。
A Trick for Improving Stability
作者发现,梯度的暴增导致了训练的不稳定,并且梯度的暴增出现在第一层即patch projection。
基于这个观察,作者选择在训练时冻结patch projection,随机初始化后就固定住,不训练。
We use a fixed random patch projection layer to embed the patches, which is not learned. This can be easily done by applying a stop-gradient operation right after this layer.
We note that freezing the first layer does not change the architecture, and it actually narrows down the solution space. This indicates that the underlying problem is on optimization.
这个方法可以alleviate instability instead of solving
虽然这个方法效果不错,但还遗留了一些问题:
It is an interesting observation that it is not necessary to train the patch projection layer. In this case, random projection should be sufficient to preserve the information of the original patches.
作者提到其实第一层并不是影响不稳定性的最关键因素,其实所有层都有参与到其中。只不过因为第一层是唯一non- transformer层,方便独立的处理,期待未来会有更好的解决方案。
展望
1.Self-supervised Transformer can achieve strong results using a contrastive learning framework.
2.去掉ViT中的position embedding只对准确度有很小的影响,说明:
- ViT can learn strong representations without the positional inductive bias.
- Positional information has not been sufficiently exploited.
3.期待有更好的解决instability的方法
4.Close the gap of pre-training methodology between vision and language