1 BatchNorm、InstanceNorm和LayerNorm的理解
[1] Batch Normalization, Instance Normalization, Layer Normalization: Structural Nuances
• Transformer的Encoder使用了Layer Normalization
• 还有个Group Normalization,可以参考《全面解读Group Normalization》
2 BatchNorm
2.1 momentum参数在计算running mean和running variance中起到importance factor的作用
[2] https://stats.stackexchange.com/questions/219808/how-and-why-does-batch-normalization-use-moving-averages-to-track-the-accuracy-o
[3] Batch Normlization Explained
running_mean = momentum * running_mean + (1-momentum) * new_mean
running_var = momentum* running_var + (1-momentum) * new_var
Momentum is the importance given to the last seen mini-batch, a.k.a “lag”. If the momentum is set to 0, the running mean and variance come from the last seen mini-batch. However, this may be biased and not the desirable one for testing. Conversely, if momentum is set to 1, it uses the running mean and variance from the first mini-batch. Essentially, momentum controls how much each new mini-batch contributes to the running averages.
Ideally, the momentum should be set close to 1 (>0.9) to ensure slow learning of the running mean and variance such that the noise in a mini-batch is ignored.
2.2 torch.utils.checkpoint对batch normalization的处理
[4] Trading compute for memory in PyTorch models using Checkpointing
Batch normalization layer maintains the running mean and variance stats depending on the current minibatch and everytime a forward pass is run, the stats are updated based on the momentum value. In checkpointing, running the forward pass twice on a model segment in the same iteration will result in updating mean and stats value. In order to avoid this, use the new_momentum = sqrt(momentum) as the momentum value.
3 AdaIN(Adaptive Instance Normalization)
AdaIN是style transfer中经常用到的一种normalization
\[\operatorname{AdaIN}(x, y)=\sigma(y)\left(\frac{x-\mu(x)}{\sigma(x)}\right)+\mu(y) \]AdaIN receives a content input x and a style input y, and simply aligns the channel- wise mean and variance of x to match those of y. Unlike BN, IN or CIN, AdaIN has no learnable affine parameters.
IBN-Net对Instance Normalization和Batch Normalization的一个推论
IN learns features that are invariant to appearance changes, such as colors, styles, and virtuality/reality, while BN is essential for preserving content related information
IBN-Net在ReID模型中用得比较多。