AffinityNet: Semi-Supervised Few-Shot Learning for Disease Type Prediction
- Publication:
- Code:
- Dataset:
- Introduction
- Related Work
- Affinity Network Model (AffinityNet)
- Experiments
- Conclusions
Publication:
The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19)
Code:
AffinityNet:https://github.com/BeautyOfWeb/AffinityNet
Scikit-learn:http://scikit-learn.org
Dataset:
Harmonized kidney and uterus cancer gene expression datasets were downloaded from Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov) (Grossman et al. 2016).
Introduction
Patients, drugs, networks, etc., are all complex objects with heterogeneous features or attributes. Complex objects usually have heterogeneous features with unclear structures. Deep learning models such as Convolutional Neural Networks (CNNs) cannot be directly applied to complex objects whose features are not ordered structurally.
One critical challenge in cancer patient clustering problem is the “big p, small N” problem. We do not have an “ImageNet” (Russakovsky et al. 2015) to train deep learning models that can learn good representations from raw features. These features are not “naturally” ordered. Thus, we cannot directly use convolutional neural networks with small filters to extract abstract local features.
For a clustering/classification task, nodes/objects belonging to the same cluster should have similar representations that are near the cluster centroid. So, we developed the k-nearest-neighbor (kNN) attention pooling layer, which applies the attention mechanism to learning node representations. With the kNN attention pooling layer, each node’s representation is decided by its k-nearest neighbors as well as itself, ensuring that similar nodes will have similar learned representations.
we propose the Affinity Network Model (AffinityNet) that consists of stacked kNN attention pooling layers to learn the deep representations of a set of objects. Similar to Graph Attention Model (GAM) (Veliˇckovi´c et al. 2017), but GAM is designed to tackle representation learning on graphs and it does not directly apply to data without a known graph, our AffinityNet model generalizes GAM to facilitate representation learning on any collections of objects with or without a known graph.
In addition to learning deep representations for classifying objects, feature selection is also important in biomedical research. In order to facilitate feature selection in a “deep learning” way, we propose a feature attention layer, a simple special case of the kNN attention pooling layer which can be incorporated into a neural network model and directly learn feature weights using backpropagation.
We performed experiments on both synthetic and real cancer genomics data. The results demonstrated that our AffinityNet model has better generalization power than conventional neural network models for few-shot learning.
Related Work
In graph learning, a graph has a number of nodes and edges (both nodes and edges can have features). Combining node features with graph structure can do a better job than using node features alone. However, all these graph learning algorithms require that a graph is known. Many algorithms also require the input to be the whole graph, and thus do not scale well to large graphs. Our proposed AffinityNet model generalizes graph learning to a collection of objects with or without known graphs.
As the key component of AffinityNet, kNN attention pooling layer is also related to normalization layers in deep learning. All these normalization layers use batch statistics or feature statistics to normalize instance features, while kNN attention pooling layers apply the attention mechanism to the learned instance representations to ensure similar instances will have similar representations.
Our proposed kNN attention pooling layer applies pooling on node representations instead of individual features. kNN attention pooling layer combines normalization, attention and pooling, making it more general and powerful. It can serve as an implicit regularizer to make the network generalize well for semi-supervised few-shot learning.
Affinity Network Model (AffinityNet)
One key ingredient for the success of deep learning is its ability to learn a good representation through multiple complex nonlinear transformations. While conventional deep learning models often perform well when lots of training data is available, our goal is to design new models that can learn a good feature transformation in a transparent and data efficient way. We developed the kNN attention pooling layer, and used it to construct the AffinityNet Model.
In a typical AffinityNet model as shown in Fig. 1, the input layer is followed by a feature attention layer (a simple special case of kNN attention pooling layer used for raw feature selection), and then followed by multiple stacked kNN attention pooling layers (Fig. 1 only illustrates one kNN attention pooling layer).
The output of the last kNN attention pooling layer will be the newly learned network representations, which can be used for classification or regression tasks.
图表 1 AffinityNet Overview
Though it is possible to train AffinityNet with only a few labeled examples, it is more advantageous to use it as a semi-supervised learning framework (i.e., using both labeled and unlabeled data during training).
kNN attention pooling layer
A good classification model should have the ability to learn a feature transformation. As an object’s k-nearest neighbors should have similar feature representations, we propose the kNN attention pooling layer to incorporate neighborhood information using attention-based pooling (Eq. 1):
h
i
h_{i}
hi —— input feature representations for object i
h
i
′
h_{i}^{'}
hi′—— transformed feature representations for object i
N
(
i
)
Ν(i)
N(i) —— neighborhood of object i 【If a graph is given, we can use the given graph to determine the neighborhood; If the given graph is very large with a high degree, we can randomly sample k (k is a fixed small number) neighbors】
f
(
⋅
)
f(⋅)
f(⋅) —— a nonlinear transformation, for example, an affine layer with weight W and bias b followed by ReLU() nonlinear activation:
a
i
j
=
a
(
h
i
,
h
j
)
a_{ij}=a(h_{i},h_{j})
aij=a(hi,hj) —— the normalized attention from object i to object j.
a
(
⋅
,
⋅
)
a(·,·)
a(⋅,⋅) —— the attention kernel
Attention kernels
Objects belonging to the same class should be clustered together in the learned feature space. kNN attention pooling layer uses weighted pooling to “attract” similar objects together in the transformed feature space. Attention kernels essentially calculate the similarities among objects to facilitate weighted pooling.
There are many choices of attention kernels:
- Cosine similarity:
- Inner product (Vaswani et al. 2017):
- Perceptron affine kernel (Veliˇckovi´c et al. 2017):
- Inverse distance with weighted L2 norm (w is the feature weight):
In order to calculate a weighted average of new representations, we can use the Softmax function to normalize the attention (other normalization is also feasible). Therefore, the normalized attention kernel is:
If the graph is not given, in order to determine N(i), we can use attention kernel to calculate an affinity/similarity graph (i.e., the similarities among all the objects), and then use this affinity graph to decide the neighborhood N(i).
Layer-specific dynamic affinity graph
The kNN attention pooling layer can be applied to a collection of objects regardless of whether a graph is given or not. Regardless of whether a graph is given or not, we can always calculate an affinity graph G_n based on node features using some similarity metric including the aforementioned attention kernels. As our AffinityNet model contains multiple kNN pooling layers stacked together, we can calculate a layer-specific dynamic affinity graph using the learned node feature representations from each layer during training.
Also, we can use the graph calculated using features from the previous layer to determine the k-nearest-neighborhood for the next layer. This can be seen as an implicit regularizer.
For layer l, we can calculate a layer-specific dynamic affinity graph
G
(
l
)
G^{(l)}
G(l) using Eq.8:
G
e
G_e
Ge——the given graph if available. When not available, set λ=0(0≤λ≤1)
G
n
(
l
)
G_n^{(l)}
Gn(l) and
G
n
(
l
−
1
)
G_n^{(l-1)}
Gn(l−1)—— node-feature-derived affinity graphs for the current layer l and the previous layer l-1
If the input of the AffinityNet model consists of N objects, then we will learn dynamic affinity graphs for these N objects during training. After training, the final learned affinity graph from the last layer can also be used for spectral clustering. So, we also call our framework affinity network learning.
Semi-supervised few-shot learning
Semi-supervised few-shot learning only allows using very few labeled instances to train a model and requires the model to generalize well. It is especially useful for cancer patient clustering problems. If we can obtain a few labeled training examples, we can use the AffinityNet model for semi-supervised learning.
The input of the AffinityNet model is the patient-feature matrix consisting of all patients, and the output of the model is the newly learned patient representations as well as class labels. We only backpropagate the classification error for those labeled patients.
AffinityNet can utilize unlabeled instances for calculating kNN attention-based representations in the whole sample pool. In a sense, the kNN attention pooling layer performs both nonlinear transformation and “clustering” (attracting similar instances together in the learned feature space) during training. Even though the labels of most patients are unknown, their feature representations can be used for learning a global affinity graph, which is useful to cluster or classify all patients in the cohort.
Our AffinityNet model can also be used for data distillation (Radosavovic et al. 2017). When dealing with very large graphs, we can feed a small batch of instances (i.e., a partial graph) at a time to the AffinityNet model. Though each batch may contain different instances, the kNN pooling layer can still work well with the attention mechanism. Our PyTorch implementation of AffinityNet can even handle the extreme case where only one instance is fed into the model at a time, in which case the AffinityNet model operates as a conventional deep learning model to only learn a nonlinear transformation without kNN attention pooling operation.
Feature Attention Layer
Deep neural networks can learn good hierarchical local feature extractors (such as convolutional filters or inception modules (Szegedy et al. 2017)) automatically through gradient descent. But local feature operations such as convolutions require features to be ordered structurally. Therefore we cannot directly learn a local feature extractor. Instead, we have to learn a feature selector that can select important individual features.
In addition, there can be many redundant, noisy, or irrelevant features, and the Euclidean distance between objects using all the features may be dominated by the irrelevant ones. This motivates us to develop a feature attention layer as a simple special case of kNN attention pooling layer.
h
i
∈
R
p
h_i∈R^p
hi∈Rp —— the feature vector of object i
w
∈
R
p
w∈R^p
w∈Rp —— the feature attention (i.e., weight)
Instead of the commonly used affine transformation followed by ReLU() nonlinearity as in Eq. 2, the feature attention layer performs element-wise multiplication (Eq. 10, ⊙ is element-wise multiplication operator) with the weight constraint (Eq. 9). This is the only difference between the feature attention layer and the kNN attention pooling layer.
Before transformation, the learned distance between object i and j is
d
i
j
d_{ij}
dij (Eq. 11), which can be skewed by noisy and irrelevant features.
After transformation, the distance
d
i
j
′
d_{ij}^{'}
dij′ (Eq. 12) can be more informative for classification tasks.
Note the kNN attention pooling (Eq. 1) is still used after the feature transformation (Eq. 10). The feature attention layer can help select important individual features much easier than the fully connected layer, and can increase the generalization power of a neural network model in certain cases with very few training examples.
The feature attention layer only has parameter w (Eq. 9), which directly corresponds to the learned feature weight. Because of the constraint on w (Eq. 9), the feature attention layer also learns a weighted Euclidean metric during training.
Experiments
General experimental part
Simulations
generate simulated data: We sampled 1000 points as the true signal. We then appended the true signal with 40-dimensional Gaussian noise. Thus, each point has 42 dimensions, with the first two containing the true signal, and the rest being random noise.
we plotted the true signal (i.e., the first two dimensions) in Fig. 2a;
in Fig. 2b, the corrupted signal is dominated by the added irrelevant features and the clusters are no longer obvious.
We constructed two models to predict class labels: “NeuralNet”; “AffinityNet”.
We randomly selected 1% of data (40 out of 4000 points) for training two models and compared accuracies on the test set. “AffinityNet” model with feature attention layer can successfully select the true signal features and achieve 98.2% accuracy on the test set. A plain neural network model only achieved 46.9% accuracy on the test set.
(a)(b) Even though both models achieve 100% training accuracy within a few iterations, the “AffinityNet” model generalizes better than the plain neural network model (there is a big gap between training and test accuracy curves for “NeuralNet” model when training data is small).
the good generalization of our model partly relies on the success of the feature attention layer picking up the true signals from the noise.
c)(d) The weights of the true signal are much higher than those noise in AffinityNet, while “NeuralNet” did not select the true signal very well.
Tumor disease type classification
Dataset: Harmonized kidney and uterus cancer gene expression datasets were downloaded from Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov). We preprocessed the data and selected top 1000 most variant gene expression features as the model input. We are trying to classify each tumor sample into its disease type for them separately
We compared our model (“AffinityNet”) with five other methods: “NeuralNet” (conventional deep learning model), “SVM”, “Naive Bayes”, “Random Forest”, and “Nearest Neighbors” (kNN).
For kNN attention pooling layer, we use “cosine similarity” kernel and set the number of nearest neighbors k = 2 (kidney cancer) and k = 3 (uterus cancer). For both “AffinityNet” and “NeuralNet”, we use ReLU() nonlinear activation in the hidden layer. We used the implementation from scikit-learn (http://scikit-learn.org) for “Naive Bayes”, “SVM”, “Nearest Neighbors”, and “Random Forest” with default settings.
We progressively increased the training portion from 1% to 50%, and reported the adjusted mutual information (AMI) on the test set (Table 1 and Table 2).
We ran experiments 20 times with different random seeds to generate different training and test sets. For both cancer types, our model clearly outperformed all other models, especially when the training portion is small. This suggests our model is highly data efficient. (One reason for this is that kNN attention pooling layer is in a sense performing “clustering” during training, and it is less likely to overfit a small number of training examples; the other reason is the input of kNN attention pooling layers can contain not only labeled training examples but also unlabeled examples. It performs semi-supervised learning with a few labeled examples as a guide for finding “clusters” among all the data points.)
For kidney cancer, unlike other methods, our model did not improve with more training data, partly because there are a few very hard cases in kidney cancer dataset, while all other cases are almost linearly separable. Our model can easily pick up the linearly separable clusters with only a few training examples, but it is hard to separate very hard cases even when more training data is available.
Unique experiments in the improved model
Semi-supervised clustering
If we can obtain label information for a few samples, we can use “AffinityNet” for semi-supervised clustering. “AffinityNet” and “NeuralNet” can learn a new feature representation through multiple nonlinear transformations.
For “AffinityNet”, we can use all the data points during training with kNN attention pooling, but only backpropagate on labeled training examples. We get the learned new representations for all the data points once the training process is finished.
For conventional neural network models, since each data point is independently trained, we only use labeled examples during training. After training, we have to use the learned model to generate new feature representations for all the data points.
In order to evaluate the quality of the learned feature representations with a few training examples, we performed clustering using these transformed features and using the original features, and compared them with groundtruth class labels.
We compared the performance using “AffinityNet” and “NeuralNet” on kidney data set. We randomly selected 1% of data for training, and ran experiments 30 times. After training, we performed spectral clustering on the learned patient-feature matrices. We also performed spectral clustering on the original patient-feature matrix as a baseline method (AMI = 0.71, blue dotted line in the figure).
Our model outperformed the “NeuralNetwork” model and the baseline (the “Neural Network” model is slightly below the baseline because it probably had overfitted the training examples).
only “AffinityNet” can learn a good feature transformation by facilitating semi-supervised few-shot learning with feature attention and kNN attention pooling layers.
Combine with Cox model for survival analysis
For many cancer genomics studies, cancer subtype information is not known, but patient survival information is available. We replaced the last layer (i.e., linear classifier) in the model with a regression layer following the Cox proportional hazards model. We used backpropagation to learn model parameters that maximize partial likelihood in the Cox model.
We performed experiments on kidney cancer dataset that has more than 600 samples. We progressively increased the training portion from 10% to 40%. We used 30% of data as validation and the remaining as the test set. As a baseline method, we used age, gender and known disease types as covariates to fit a Cox model. We ran experiments 20 times with random seeds, and reported the concordance index on the test set for both our model and the baseline Cox model.
the light blue boxplots——the baseline method
the light green —— our model.
Our model outperformed the baseline model by a significant margin.
Conclusions
Deep learning has achieved great success in computer vision, natural language processing, where features are well structured and a large amount of training data is available. However, in biomedical research, the training sample size is usually small while the feature dimension is very high, where deep learning models tend to overfit the training data but fail to generalize.
To alleviate this problem, we propose the AffinityNet model that contains stacked feature attention and kNN attention pooling layers to facilitate semi-supervised few-shot learning.
Regardless of whether a graph is given or not, kNN attention pooling layer can use attention kernels to calculate dynamic affinity graphs during training. The affinity graphs are used for selecting k-nearest neighbors for attention-based pooling.
kNN attention pooling layers essentially add a “clustering” operation (“forcing” similar objects to have similar representations through attention-based pooling) after the nonlinear feature transformations, which can serve as an implicit regularizer for classification-related tasks. kNN attention pooling layers can be plugged into a deep learning model as a basic building block just like convolutional layers.
Feature attention layer is a simple special case of kNN attention pooling layer. It is useful for selecting important individual input features automatically with a normalized non-negative weight learned for each feature.
We have conducted extensive experiments using AffinityNet on two cancer genomics datasets and achieved satisfactory results.
AffinityNet alleviates the problem of lack of a sufficient amount of labeled training data by utilizing unlabeled data with kNN attention pooling, and can be used to analyze a large bulk of cancer genomics data for patient clustering and disease subtype discovery.
Future work may focus on designing deep learning modules that can incorporate biological knowledge for various tasks.
【有对应自制版展示ppt】