阅读笔记：Self-supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation

2024-02-05 19:19:46

Self-supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation

基于等变注意力机制的弱监督语义分割

论文来源：https://arxiv.org/abs/2004.04581v1

源码：https://github.com/YudeWang/SEAM

1 概述

语义分割是计算机视觉领域一个基础性任务。它通过预测图像的像素级分类结果，以形成图像语义分割的mask，从而为后续处理提供基础。然而，与目标分类和检测相比，生成用于语义分割的像素级类别标签要更加费时和复杂。因此，最近研究者们致力于借助图像级分类标签（classification labels）和object bounding box来实现对图像的弱监督语义分割。

目前，常用的弱监督语义分割以类激活图（Class Activation Map, CAM）[1]为基础。通过图像分类标签训练的CAM图可以有效定位图像中的对象。然而，CAM图往往只能覆盖对象最具区分度的部分（most discriminative part of the object），通常，还会错误的激活背景区域。更进一步来说，当图像数据集进行增广后，同一图像经过不同仿射变换后，其CAM图通常不一致（如图1所示）。

2 核心思想

论文提出了一种自监督等变注意力机制（SEAM）来缩小弱监督与全监督之间的差距。SEAM的核心思想包含两个部分：一是对图像不同变换后生成的CAM图进行一致性约束（consistency regularization）为网络学习提供自监督；二是引入像素相关模块（pixel correlation module, PCM），借助该模块通过每个像素的情景信息（context appearance information）生成亲密度注意力图（affinity attention maps），以此来修正原始的CAM图。

为实现SEAM机制，论文设计了孪生网络（如图2），并利用等变交叉约束损失实现对两个不同分支中原始CAM图和修正CAM图的约束。（The SEAM is implemented by a siamese network with equivariant cross regularization (ECR) loss, which regularizes the original CAMs and the revised CAMs on different branches. ）

3 基本方法

3.1 等变一致性约束

通过对CAM图的等变一致性约束（consistency regularization on CAM）来实现自监督。

如图2所示，孪生网络中的一个网络，输入原始图像，对网络输出进行仿射变化。另一个网络对输入图像仿射变换。最终，最小化两个网络的ER loss。

3.2 像素相关模块（PCM）

像素相关模块（pixel correlation module ，PCM）获取每个像素的情景信息（context appearance information），通过学习生成亲密度注意力图（affinity attention maps），以此来修正原始的CAM图（如图3所示）。

该模块是一个典型的自注意力模块。其原理可以用公式（4）表示（原文中公式（1）和（2）是对自注意力机制的更一般化表示）

式中，y_hat表示原始CAM图，y表示修正后的CAM图。三个特征映射函数θ; φ; g 可以分别用三个1*1卷积实现。（the original CAM is embedded into residual space by function g）。论文中对公式（4）进行了简化并利用余弦距离来度量特征间相似性（如公式（5）、（6）所示），同时从孪生网络图中可以看出，PCM的训练也会受到ER约束的监督（trained by the supervision from equivariant regularization）。

The similarities are activated by ReLU to suppress negative values. The final CAM is the weighted sum of the original CAM with normalized similarities.

对自注意机制简化的解释：

Compared to classical self-attention, PCM removes the residual connection to keep the same activation intensity of the original CAM. Moreover, since the other network branch provides pixel-level supervision for PCM, which is not as accurate as ground truth, we reduce parameters by removing embedding function φ and g to avoid overfitting on inaccurate supervision. We use ReLU activation function with L1 normalization to mask out irrelevant pixels and generate an affinity attention map which is smoother in relevant regions.

4 损失函数

4.1 分类损失

分类损失与其它任务的分类损失函数基本一致，采用multi-label soft margin loss方式进行衡量。

由于孪生网络的存在，预测的分类结果有两个，因此，在分类损失中，考虑二者的共同损失。

4.2 等变约束损失

However, in our early experiments, the output maps of PCM fall into the local minimum quickly that all pixels in the image are predicted the same class. Therefore, we propose an equivariant cross regularization (ECR) loss as: