Attentional Factorization Machine（AFM）复现笔记

2023-11-06 10:32:04

声明：本模型复现笔记记录自己学习过程，如果有错误请各位老师批评指正。

之前学习了很多关于特征交叉的模型比如Wide&Deep、Deep&Cross、DeepFM、NFM。
对于特征工程的特征交叉操作，这些模型已经做的非常好了，模型进一步提升的空间已经很小了，所以很多研究者继续探索更多"结构"上的尝试，比如注意力机制、序列模型、强化学习等在其他领域表现得很好的模型结构也逐渐进入推荐系统。

本周就学习了注意力机制在推荐系统中的应用。

注意力机制不懂的同学学一下李沐老师的注意力机制这一节课

AFM的贡献在于，将注意力机制引入到特征交叉的模块。回顾NFM，NFM在Bi-Interaction Layer中所有交叉项的权重都是一样的，而AFM在对二阶交叉特征进行加和池化（Sum Pooling）操作时，是使用注意力机制得到的注意力权重为所有交叉项分配权重。

NFM相当于“一视同仁”地对待所有交叉特征，不考虑不同特征对结果的影响程度，其实NFM浪费了很多有价值的信息。
注意力机制就可以解决这个“一视同仁”的问题。

举个例子：如果我们要预测一位男性用户是否会购买篮球鞋的可能性，那么“性别为男性 & 购买历史买过篮球”这一交叉特征，很可能比“性别为男性 & 用户年龄为25”这一交叉特征更重要，所以我们有必要在模型中投入更多的“注意力”，为那些比较重要的交叉特征分配更多的权重，为那些不是很重要的交叉特征分配更少的权重。

AFM公式：

乍一看很像FM模型，多了P向量和权重a i,j , 其实他就是对二阶交叉特征做了一个重要性选择。

另一个角度也是NFM的升级版，优化为把NFM的Bi-Interaction Pooling Layer的Sum Pooling（图1）修改为相应位置注意力权重（分数）的加权和（图2）。

图1

图2

设计模型

模型我还是设计的双塔模型，做了两个，一个是没有DNN的AFM，另一个是加了DNN的AFM，按照理论来说，后者要比前者的效果要好很多，但我用小数据训练出来后者要比前者的效果要差。（我不用大的数据集原因是GPU只有4G，容易卡掉）。

左塔为线性模型

右塔为带注意力的多层结构

不带DNN的AFM

#线性模型
class Linear(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(Linear,self).__init__()
        self.Linear = nn.Linear(input_dim, output_dim)
    def forward(self, x):
        output = self.Linear(x)
        return output
#注意力模型
class Attention_layer(nn.Module):
    def __init__(self, att_units):
        #att_units[embed_dim, att_vector]
        super(Attention_layer, self).__init__()
        
        self.att_w = nn.Linear(att_units[0], att_units[1])#W和b
        self.att_dense = nn.Linear(att_units[1], 1)#h
        
    def forward(self, bi_interaction):
        a = self.att_w(bi_interaction)
        a = F.relu(a)
        att_scores = self.att_dense(a)
        att_weight = F.softmax(att_scores, dim = 1)
        
        att_out = torch.sum(att_weight * bi_interaction ,dim = 1)
        return att_out
#AFM模型
class AFM(nn.Module):
    def __init__(self, feature_columns, att_vector = 8, dropout = 0.):
        super(AFM, self).__init__()

        self.dense_feature_cols, self.sparse_feature_cols = feature_columns

        #Embedding
        self.embed_layers = nn.ModuleDict({
            'embed_'+str(i):nn.Embedding(num_embeddings=feat['feat_num'],embedding_dim=feat['embed_dim'])
            for i, feat in enumerate(self.sparse_feature_cols)
        })

        self.attention = Attention_layer([self.sparse_feature_cols[0]['embed_dim'], att_vector])
        self.fea_num = att_vector
        self.linear = Linear(len(self.dense_feature_cols) , self.fea_num )
        self.final_linear = nn.Linear(self.fea_num, 1)

    def forward(self, x):
        dense_inputs, sparse_inputs = x[:,:len(self.dense_feature_cols)], x[:,len(self.dense_feature_cols):]
        sparse_inputs = sparse_inputs.long()
        sparse_embeds = [self.embed_layers['embed_'+str(i)](sparse_inputs[:,i]) for i in range(sparse_inputs.shape[1])]
        sparse_embeds = torch.stack(sparse_embeds)
        sparse_embeds = sparse_embeds.permute((1, 0, 2))

        first = []
        second = []
        for f,s in itertools.combinations(range(sparse_embeds.shape[1]), 2): 
            first.append(f)
            second.append(s)
        
        p = sparse_embeds[:,first,:]
        q = sparse_embeds[:,second,:]
        #两两交叉池化层
        bi_interaction = p * q
        #注意力机制池化层                 
        att_out = self.attention(bi_interaction)
        #线性层
        li_output = self.linear(dense_inputs)
        #注意力机制池化层+线性层送入最终线性层
        outputs = torch.add(att_out, li_output)
        
        outputs =  F.sigmoid(self.final_linear(outputs))
        
        return outputs

效果：batch_size= 256，加入了L2正则化（0.0001），dropout（0.1）

加了DNN的AFM

加了DNN的AFM跑的效果不好，个人感觉参数没调好。

#线性模型
class Linear(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(Linear,self).__init__()
        self.linear = nn.Linear(input_dim, output_dim)
    def forward(self, x):
        output = self.linear(x)
        return output
#DNN模型
class Dnn(nn.Module):
    def __init__(self, hidden_units ,dropout=0.):
        super(Dnn, self).__init__()
        self.dnn_network = nn.ModuleList([nn.Linear(layer[0],layer[1]) for layer in list(zip(hidden_unit[:-1],hidden_units[1:]))])
        self.dropout = nn.Dropout(p = dropout)
        
    def forward(self , x):
        
        for layer in self.dnn_network:
            x = layer(x)
            x = F.relu(x)
        x = self.dropout(x)
        
        return x
#注意力模型
class Attention_layer(nn.Module):
    def __init__(self, att_units):
        
        #att_units:[embed_dim, att_vector]
        
        super(Attention_layer, self).__init__()
        
        self.att_w = nn.Linear(att_units[0], att_units[1])#W和b
        self.att_dense = nn.Linear(att_units[1], 1)#h
        
    def forward(self, x):
        a = self.att_w(x)
        a = F.relu(a)
        att_scores = self.att_dense(a)
        att_weight = F.softmax(att_scores, dim = 1)
        att_out = torch.sum(att_weight * x ,dim = 1)
        return att_out
#AFM模型
class AFM(nn.Module):
    def __init__(self, feature_columns, hidden_unit, att_vector = 8, dropout = 0.):
        super(AFM, self).__init__()

        self.dense_feature_cols, self.sparse_feature_cols = feature_columns

        #Embedding
        self.embed_layers = nn.ModuleDict({
            'embed_'+str(i):nn.Embedding(num_embeddings=feat['feat_num'],embedding_dim=feat['embed_dim'])
            for i, feat in enumerate(self.sparse_feature_cols)
        })

        self.attention = Attention_layer([self.sparse_feature_cols[0]['embed_dim'], att_vector])
        self.fea_num = self.sparse_feature_cols[0]['embed_dim']
        hidden_unit.insert(0, self.fea_num)
        self.dnn_network = Dnn(hidden_unit, dropout)
        self.Linear = Linear(len(self.dense_feature_cols), att_vector)
        self.final_linear = nn.Linear(hidden_unit[-1], 1)

    def forward(self, x):
        dense_inputs, sparse_inputs = x[:,:len(self.dense_feature_cols)], x[:,len(self.dense_feature_cols):]
        sparse_inputs = sparse_inputs.long()
        sparse_embeds = [self.embed_layers['embed_'+str(i)](sparse_inputs[:,i]) for i in range(sparse_inputs.shape[1])]
        sparse_embeds = torch.stack(sparse_embeds)
        sparse_embeds = sparse_embeds.permute((1, 0, 2))

        first = []
        second = []
        for f,s in itertools.combinations(range(sparse_embeds.shape[1]), 2): 
            first.append(f)
            second.append(s)
        
        p = sparse_embeds[:,first,:]
        q = sparse_embeds[:,second,:]
        
        #两两特征交叉层[256,325,8] 
        bi_interaction = p * q
        #注意力机制池化层 [256,8]                
        att_out = self.attention(bi_interaction)
        #DNN层
        dnn_output = self.dnn_network(att_out)
        #线性层
        li_output  = self.Linear(dense_inputs)
        #输出层
        outputs =  F.sigmoid(self.final_linear(torch.add(dnn_output,li_output)))
        
        return outputs

效果：batch_size= 256，加入了L2正则化（0.00013），dropout（0.23）

模型对比：

对比Wide&Deep、DeepFM、NFM、AFM、AFM_Deep

小的cretio数据集（选取了train 20000， test 5000）

较大的cretio数据集（选取了train 480000， test 120000）

同系列模型NFM与AFM模型，AFM确实比NFM效果要好。
对于AFM与AFM_Deep来说，带有深度神经网络的AFM（AFM_Deep）效果要比不带DNN的AFM效果要好，但是该实验中AFM_Deep跑的并不好，如果模型没问题的话，个人感受是参数调的不好。

码农公寓

AFM公式：

设计模型

不带DNN的AFM

加了DNN的AFM

模型对比：

相关文章