Do Transformers Really Perform Badfor Graph Representation?
1 Introduction
作者们发现关键问题在于如何补回Transformer模型的自注意力层丢失掉的图结构信息!不同于序列数据(NLP, Speech)或网格数据(CV),图的结构信息是图数据特有的属性,且对图的性质预测起着重要的作用。
There are many attempts of leveraging Transformer into the graph domain, but the only effective way is replacing some key modules (e.g., feature aggregation) in classic GNN variants by the softmax attention[47,7,22,48,58,43,13]
- [47] Graph attention networks. ICLR, 2018.
- [7] Graph transformer for graph-to-sequence learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7464–7471, 2020.
- [22] Heterogeneous graph transformer. In Proceedings of The Web Conference 2020, pages 2704–2710, 2020.
- [48] Direct multi-hop attention based graph neural network.arXiv preprint arXiv:2009.14332, 2020.
- [58] Graph-bert: Only attention is needed forlearning graph representations.arXiv preprint arXiv:2001.05140, 2020.
- [43] Self-supervised graph transformer on large-scale molecular data. Advances in Neural Information ProcessingSystems, 33, 2020.
- [13] generalization of transformer networks to graphs. AAAI Workshop on Deep Learning on Graphs: Methods and Applications, 2021
- Centrality Encoding: capture the node importance in the graph. In particular, we leverage the degree centrality for the centrality encoding, where a learnable vectoris assigned to each node according to its degree and added to the node features in the input layer.
- Spatial Encoding: capture the structural relation between nodes.
- Edge Encoding
2 Graphormer
2.1 Structural Encodings in Graphormer
2.1.1 a Centrality Encoding
In Graphormer, we use the degree centrality, which is one of the standard centrality measures inliterature, as an additional signal to the neural network. To be specific, we develop a Centrality Encoding which assigns each node two real-valued embedding vectors according to its indegree and outdegree.
2.1.2 a Centrality Encoding
An advantage of Transformer is its global receptive field.
Spatial Encoding:
In this paper, we choose φ(vi,vj) to be the distance of the shortest path (SPD) between vi and vj if the two nodes are connected. If not, we set the output ofφto be a special value, i.e., -1. We assign each (feasible) output value a learnable scalar which will serve as a bias term in the self-attention module. Denote Aij as the (i,j)-element of the Query-Key product matrix A, we have:
2.1.3 Edge Encoding in the Attention
In many graph tasks, edges also have structural features.
In the first method, the edge features areadded to the associated nodes’ features [21,29].
- [21] Open graph benchmark: Datasets for machine learning on graphs.arXiv preprintarXiv:2005.00687, 2020.
- [29] Deepergcn: All you need to train deepergcns.arXiv preprint arXiv:2006.07739, 2020
In the second method, for each node, its associated edges’ features will be used together with the node features in the aggregation [15,51,25].
- [51] How powerful are graph neural networks?InInternational Conference on Learning Representations, 2019.
- [25] Semi-supervised classification with graph convolutional networks.arXiv preprint arXiv:1609.02907, 2016
However, such ways of using edge feature only propagate the edge information to its associated nodes, which may not be an effective way to leverage edge information in representation of the whole graph.
a new edge encoding method in Graphormer:
3.2 Implementation Details of Graphormer
Graphormer Layer:
- MHA: multi-head self-attention (MHA)
- FFN: the feed-forward blocks
- LN: the layer normalization
Special Node:
生成一个VNODE连接图中所有的点,而它与所有节点的 spatial encodings 是 a distinct learnable scalar
3 Experiments
3.1 OGB Large-Scale Challenge
3.2 Graph Representation