AutoInt 网络介绍与源码浅析-CFANZ编程社区

AutoInt 网络介绍与源码浅析

前言 (与正文无关可以忽略~)

好几个星期没有写博客了, 表层原因是最近这段时间确实比较忙, 但深层原因是本质上放松了对自己的要求, 好在论文还是抽空看了几篇, 所以这里就做个简单的总结~ 另外提醒下自己, 写过的博客也应该经常查阅和复习呀, 现在琐事太多了, 记忆力明显不像原来那么有活力了, 即使是最近学习的内容, 也会将一些细节忘记…好吧, 我承认不止只忘一些细节 (核心原理都快忘了) ???????????? 所以啊, 记录下来至少能证明我曾经学过 ????????????

广而告之

可以在微信中搜索 “珍妮的算法之路” 或者 “world4458” 关注我的微信公众号；另外可以看看知乎专栏 PoorMemory-机器学习, 以后文章也会发在知乎专栏中；

AutoInt

文章信息

论文标题: AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks
论文地址:https://arxiv.org/abs/1810.11921
代码地址:https://github.com/DeepGraphLearning/RecommenderSystems/
发表时间: 2018
论文作者: Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, Jian Tang
作者单位: Peking University

核心观点

本文模型 AutoInt 的核心模块为 multi-head self-attentive 层, 本质上是想利用 Attention 层来评估特征之间的相关性, 给模型带来比较好的解释性. 有的观点在 xDeepFM 也有所体现,即 DNN 学习出来的高阶特征是隐式的, 不知道具体是几阶, 不具有比较好的解释性, 此外, DNN 还不能高效的学习特征间的乘积关系;
而本文采用 multi-head self-attention 机制, 先用Query, Key, Value 三个矩阵对原始的 emb 进行转义, 然后通过 attention function (比如可以采用简单的 DNN 或者直接用 inner product) 定义 Query 和 Key 矩阵映射出来的 emb 之间的相似性, 得到相似性系数 , 然后再对 Value 矩阵映射出来的 emb 使用进行加权求和, 输出结果即为 multi-head self-attentive 层学习的交叉特征;
当然, 作者还考虑使用 residual learning, 将原始 emb 经过

核心观点介绍

AutoInt 网络结构如下图所示:

AutoInt 网络介绍与源码浅析_python_04

其核心模块为 Multi-head Self-Attention 层, 并且使用了 Residual Learning 来组合低阶和高阶特征.

设输入的稀疏特征 , 其中为 Field 的总个数, 而为第个 field 的特征表达. 如果第个 field 表示的是类别特征, 那么就是一个 OneHot 向量, 而如果第个 field 表示的是数值特征, 那么

AutoInt 网络介绍与源码浅析_CTR_13

即对于类别特征, 使用 embedding matrix 处理 filed 的特征(

若

其中表示一个样本的第

其中为第个 field 所对应的 embedding 向量,

之后采用 Multi-head Self-Attention 网络对输入 embedding 进行处理, 并采用了 Key-Value Attention 机制. 比如对于特征 , 首先定义在特定的 Attention Head 下, 它和特征

其中为 Attention Function, 它定义了特征和特征之间的相似性, Attention Function 可以使用神经网络或者内积来表示, 本文采用后者. 而为转换矩阵, 用于将原始的 embedding 转换到新的特征空间.
之后更新特征在子空间内的表达, 对个相关特征使用

上述的过程可以用下图来形象的描述:

AutoInt 网络介绍与源码浅析_python_39

注意上述的操作得到的是特征在子空间 (或者说在第

其中表示 concatenation 操作符, 而表示总的 Head 的数量.

为了组合高阶组合特征和原始的一阶特征, AutoInt 使用 residual connections 将二者相加起来:

其中用来将映射到和

即为 Multi-head Self-Attention 网络最终输出, 表示特征

网络的输出层结果为:

整个网络的参数为:

AutoInt 网络的核心模块已经介绍完了, 下面分析一下源码, 以便对上面的概念有清晰的理解, 比如 Self-Attention, Multi-Head 等.

AutoInt 定义在 https://github.com/DeepGraphLearning/RecommenderSystems/blob/master/featureRec/autoint/model.py 文件中, AutoInt 的主要逻辑集中在下面的代码中:

# ---------- main part of AutoInt-------------------
self.y_deep = self.embeddings # None * M * d
for i in range(self.blocks):   
   self.y_deep = multihead_attention(queries=self.y_deep,
                                     keys=self.y_deep,
                                     values=self.y_deep,
                                     num_units=self.block_shape[i],
                                     num_heads=self.heads,
                                     dropout_keep_prob=self.dropout_keep_prob[0],
                                     is_training=self.train_phase,
                                     has_residual=self.has_residual)

其中 self.embeddings 表示输入特征对应的 embedding, 大小为 [None, M, d], 而
self.blocks 表示堆叠的 Attention 层的数量, 由于是 Self-Attention, 可以看到 queries, keys 以及 values 的输入均为 self.y_deep; 下面再来介绍核心模块 multihead_attention, 定义如下:

def multihead_attention(queries,
                        keys,
                        values,
                        num_units=None,
                        num_heads=1,
                        dropout_keep_prob=1,
                        is_training=True,
                        has_residual=True):
    """
    queries, keys, values 的 shape 均为 [None, M, d], 假设 num_units 为 k
    """
    if num_units is None:
        num_units = queries.get_shape().as_list[-1]

    # Linear projections
    ## 假设 num_units 为 k, 那么此时 Q, K, V 的 shape 均为 [None, M, k]
    Q = tf.layers.dense(queries, num_units, activation=tf.nn.relu)
    K = tf.layers.dense(keys, num_units, activation=tf.nn.relu)
    V = tf.layers.dense(values, num_units, activation=tf.nn.relu)
    if has_residual:
        V_res = tf.layers.dense(values, num_units, activation=tf.nn.relu)

    # Split and concat
    ## 注意这里, 拿 Q 举例子, 对 axis=2 处进行 split, 相当于将 Q
    ## 划分为 n 个大小为 [None, M, k/n] 的矩阵, 之后再对 axis=0 进行 concat,
    ## 此时 Q_ 的 shape 为 [None * n, M, k/n]; 
    Q_ = tf.concat(tf.split(Q, num_heads, axis=2), axis=0)
    K_ = tf.concat(tf.split(K, num_heads, axis=2), axis=0)
    V_ = tf.concat(tf.split(V, num_heads, axis=2), axis=0)

    # Multiplication
    ## 注意此时 Q_, K_, 和 V_ 大小均为 [None * n, M, k/n]
    ## weights 的大小为 [None * n, M, M]
    weights = tf.matmul(Q_, tf.transpose(K_, [0, 2, 1]))

    # Scale, 原因可以参考: http://nlp.seas.harvard.edu/2018/04/03/attention.html
    ## 中 Attention 这一节
    weights = weights / (K_.get_shape().as_list()[-1] ** 0.5)

    # Activation
    ## 大小为 [None * n, M]
    weights = tf.nn.softmax(weights)


    # Dropouts
    weights = tf.layers.dropout(weights, rate=1-dropout_keep_prob,
                                        training=tf.convert_to_tensor(is_training))

    # Weighted sum
    ## 输入结果大小为 [None * n, M, k/n]
    outputs = tf.matmul(weights, V_)

    # Restore shape
    ## 恢复 shape, 相当于收集所有子空间(所有的 Head)的输入结果, 并 concat 起来
    ## outputs 的大小为 [None, M, k]
    outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2)

    # Residual connection
    if has_residual:
        outputs += V_res

    outputs = tf.nn.relu(outputs)
    # Normalize
    outputs = normalize(outputs)
        
    return

所有该说的都写在代码注释中了~ 另外关于 Attention, 还可以看看我的这篇博客: DIN 深度兴趣网络介绍以及源码浅析; 注释中提到的 http://nlp.seas.harvard.edu/2018/04/03/attention.html 是 Attention 相关的非常强悍的资料, 强推.

总结

周末的 Flag~ 居然没有倒下!!! 完成 9 月的第一篇博客啦 ???? ???? ????