In this paper, we aim to obtain improved attention for a visual question answering (VQA) task. It is challenging to provide supervision for attention. An observation we make is that visual explanations as obtained through class activation mappings (specifically Grad-CAM) that are meant to explain the performance of various networks could form a means of supervision. However, as the distributions of attention maps and that of Grad-CAMs differ, it would not be suitable to directly use these as a form of supervision. Rather, we propose the use of a discriminator that aims to distinguish samples of visual explanation and attention maps. The use of adversarial training of the attention regions as a two-player game between attention and explanation serves to bring the distributions of attention maps and visual explanations closer. Significantly, we observe that providing such a means of supervision also results in attention maps that are more closely related to human attention resulting in a substantial improvement over baseline stacked attention network (SAN) models. It also results in a good improvement in rank correlation metric on the VQA task. This method can also be combined with recent MCB based methods and results in consistent improvement. We also provide comparisons with other means for learning distributions such as based on Correlation Alignment (Coral), Maximum Mean Discrepancy (MMD) and Mean Square Error (MSE) losses and observe that the adversarial loss outperforms the other forms of learning the attention maps. Visualization of the results also confirms our hypothesis that attention maps improve using this form of supervision.



作者解决视觉问题解答(VQA)的方法的主要重点是利用从视觉解释方法(例如Grad-CAM)获得的监督来提高注意力。 与现有的VQA体系结构相比,该体系结构中的关键区别是在对抗性设置中使用视觉解释和注意块,(如图1所示)与仅使用注意力相比,使用Grad-CAM作为注意力显示了更高的性能。

使用卷积神经网络(CNN)获得了图像 解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》 的嵌入 解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》 。类似地,我们使用LSTM网络获得了查询问题 解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》 的问题特征嵌入解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》,这些输入到注意力网络,该注意力网络使用加权的softmax函数将图像和问题嵌入相结合,并生成加权的输出注意力向量 解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》 。有多种建模注意力网络的方法,作者分别用SAN和MCB进行了评估,而Grad-CAM是使用(Selvaraju等人,2017)提出的模型,Grad-CAM使用最后一个卷积层的梯度信息来可视化每个像素在预测结果中的贡献。

解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》

        使用两个玩家(P1,P2)之间的零和对抗游戏,其中一组玩家是生成器网络,另一组是鉴别器网络。他们从各自的决策集解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》中选择一个决策。在我们的案例中,注意力网络是生成器网络,而“真实”分布是Grad-CAM网络的输出。我们称之为“对抗性注意网络”(AAN)。游戏目标解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》设定玩家的效用。具体来说,通过选择合适的策略解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》,P1的效用为解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》,而P2的效用为解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》。P1 / P2的目标是最大化其最坏情况的效用; 从而,

解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》

        上述公式提出了一个问题,即是否存在两个参与者可以共同收敛的解解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》。该问题的解决方案是获得纳什均衡,使判别器无法区分发生器网络的代与“实际”分布,即:解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》


解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》

其中解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》的分布,解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》的分布,在零和对抗游戏中,生成器损失和鉴别器损失之和始终为零,即发生器损失是解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》零和游戏的解决方案称为极小极大值解,其中目标是最小化最大损失。

我们可以通过说明损失函数为解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》(即判别式收益)来总结整个游戏,因此,最小极大目标为

解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》

        为简单起见,我们删除下标 解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》 。 解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》 是样本的Grad-cam网络 解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》 的输出,解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》 和 解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》 是注意力网络的输出。鉴别器想要最大化目标(即它的回报),使得解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》接近1,解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》接近于零。生成器想要最小化目标(即其损失),以使解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》接近1。具体来说,鉴别器是一组CNN层,后面是使用二进制交叉熵损失函数的线性层。

        网络的最终代价函数将注意力网络的对抗性损失和求解VQA时的交叉熵损失结合起来。用于获得注意力网络的参数解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》,分类网络的解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》和鉴别器的解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》的最终成本函数如下:

解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》

其中n是示例数,η= 10是超参数,使用验证集进行了微调,而 解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》 是标准交叉熵损失。我们使用此成本函数训练模型,直到模型收敛为止,以便参数解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》提供鞍点函数。

解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》


解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》


解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》


解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》

作者也实验了所提出方法的变体,分别是Maximum Mean Discrepancy (MMD) Net和CORAL Net,下面介绍一下。

Maximum Mean Discrepancy (MMD) Net:在此变体中,我们使用基于MMD的标准分布距离度量来最小化此距离。

CORAL Net:在此变体中,我们使用基于CORAL损耗的标准分布距离度量,将注意力的二阶统计量(协方差)与Grad-CAM mask 之间的距离最小化。

我们直接训练了变体MMD和CORAL,而没有对抗损失,以使基于Grad-CAM的伪分布更接近注意力分布。 最后,我们用对抗性损失代替了MMD和CORAL。



解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》

解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》


解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》


解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》

图2,epoch-10, epoch-50, epoch-100, epoch-200迭代后的热力图可视化。

解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》


解释与注意:用于视觉问答的一场获得注意的两人游戏模型《Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA》


In this paper we have proposed a method to obtain surrogate supervision for obtaining improved attention using visual explanation. Specifically, we consider the use of Grad-CAM. However, other such modules could also be considered. We show that the use of adversarial method to use the surrogate supervision performs best with the pixel-wise adversarial method (PAAN) performing better against other methods of using this supervision. The proposed method shows that the improved attention indeed results in improved results for the semantic task such as VQA or Visual dialog. Our method provides an initial means for obtaining surrogate supervision for attention and in future we would like to further investigate other means for obtaining improved attention.

在本文中,作者提出了一种通过视觉解释获得替代监督的方法,以获得更好的关注。 具体来说,作者考虑使用Grad-CAM。 但是,也可以考虑其他此类模块。 结果显示,使用对抗方法来使用代理监督效果最好,而使用像素监督对抗方法(PAAN)则优于使用该监督的其他方法。 所提出的方法表明,注意力的提高确实导致语义任务(例如VQA或可视对话框)的结果得到改善。 作者的方法提供了一种获得代理监督关注的初始方法,将来作者将进一步研究其他方法来获得关注的改进。

