As the part of theory is hard to understand, I will start from method part and complete theory part in the future
method
The most difficult part is the meaning of C: confounder set. The
c
i
c_i
ci is the average of all
X
M
X_M
XM which is the processed CAMs. I am not confident to this.
The
α
\alpha
α is a measurement of similarity, whereby get the weight of each class-specific entry c. The
W
1
W_1
W1 and
W
2
W_2
W2 is two learnable matrix.
Another question is how M take part in the calculate of network.
where s i = f ( X , M t , θ t i ) s_i=f(X, M_t, \theta^i_t) si=f(X,Mt,θti), f consist of a shared convolutional network and a class-specific fully-connected network. The input of the first part is concat(X, M t M_t Mt) on channel-wise.