Paper is all you need

Paper is all you need

End also is a Beginning

This is an important date to be written down 2018-10-29
Today I have finished the first paper during my PHD project. From today, I will update the thoughts after reading paper weekly on website.

Good words

Adjective

  1. tractable 易于处理的
  2. well-studied 充分研究的
  3. incompatible 不相容的
  4. they make impractical assumptions 不切实际
  5. Extensive experiments 广泛的 大量的
  6. diverse 多种多样的
  7. despite its undeniable success 不可争辩的
  8. empirical insights 以经验和实验为基础的深入观察(洞察)
  9. a large amount of 大量的

Adverb

  1. viceversa 反之亦然
  2. projection domain shift exists inherently in the regression of GZSL (固有地存在于)
  3. We revisit the dissimilarity representation [9] in the new context of GZSL.
  4. explicitly interpretable information
  5. Log-Euclidean distance (LED), and further derive a kernel function that explicitly
    (显式函数)maps the covariance matrix from the Riemannian manifold to a Euclidean space
  6. regarding 关于
  7. Concretely 具体地 == specifically

Verb

  1. Compared to 。。。,our results __ comparing。。。, we。。。
  2. compare with
  3. Unfortunately, these ways suffer from the problem of error accumulation, as they undergo two-stages probabilistic inference so that probability errors are accumulated.
  4. 表示 由什么组成: is a composition of , is comprised of
  5. The major limitation originates from that the classical filters are invariant at each location. 源自于
  6. elaborate 阐述 详细解释 v.
  7. More specifically, HGNN is a general framework which can incorporate with multi-modal data and complicated data correlations. 合并(not only combine)
  8. Pair-based metric learning often generates a large amount of 大量 pair-wise samples, which are highly redundant and include many uninformative samples. Training with random sampling can be overwhelmed 压倒,不堪重负 by these redundant samples, which significantly degrade 降级 the model capability and also slows the convergence. Therefore, sampling plays a key role in pair-based metric learning.
  9. complement each other 互补
  10. exploits 利用 ===utilizes == employs == leverages == take full advantage of

Conjunction

  1. So, Consequently, Thus, Therefore, As a result, As a consequence, To this end
  2. subsequently 后来
  3. More concretely, 更具体地来说
  4. As such,
  5. Originally, 起初
  6. Altogether 总而言之
  7. In the light of this, 鉴于此
  8. I was caught off-guard by the “conference numbers” 猝不及防
  9. according to == based on

Specific for mathematics

  1. vertex 顶点
  2. adjacency Matrix 邻接矩阵

None

  1. Graph convolutional neural networks have shown superiority on representation learning compared with traditional neural networks due to its ability of using data graph structure. 优胜,优越
  2. Under such circumstances, 环境

Abstract

Such == this

Beginning of Abstract

  1. To understand the processing of information underlying these counterintuitive properties, we visualize the features of shape and texture that underlie identity decisions. Then, we shed a light of information processing into the black box and demonstrate how the hidden layers represent features for decision, and characterize the invariance of these representations to changes of 3D pose.
  2. Face recognition has witnessed significant progresses due to the advances of deep convolutional neural networks (CNNs), the central challenge of which, is feature discrimination. To address it, one group tries to exploit miningbased strategies (e.g., hard example mining and focal loss) to focus on the informative examples. The other group devotes to designing margin-based loss functions (e.g., angular, additive and additive angular margins) to increase the feature margin from the perspective of ground truth class. Both of them have been well-verified to learn discriminative features. However, they suffer from either the ambiguity of hard examples or the lack of discriminative power of other classes. In this paper, we design a novel loss function, namely support vector guided softmax loss (SV-Softmax), which adaptively emphasizes the mis-classified points (support vectors) to guide the discriminative features learning. So the developed SV-Softmax loss is able to eliminate the ambiguity of hard examples as well as absorb the discriminative power of other classes, and thus results in more discrimiantive features.
  3. Beyond that, we conduct an exhaustive analysis on the role of training data on
    performance.
  4. To the best of our knowledge, this is the first attempt to inherit the advantages of mining-based and margin-based losses into one framework.
  5. Symmetric Positive Definite (SPD) matrix learning methods have become popular in many image and video processing tasks, thanks to their ability to learn appropriate statistical representations while respecting Riemannian geometry of underlying SPD manifolds.
  6. In particular, we devise bilinear mapping layers to transform input SPD matrices
    to more desirable SPD matrices, exploit eigenvalue rectification(动词rectify) layers to apply a non-linear activation function to the new SPD matrices, and design an eigenvalue logarithm layer
    to perform Riemannian computing on the resulting SPD matrices for regular output layers.
  7. We utilize the process of attribute detection to
    generate corresponding attribute-part detectors, whose in
    variance to many influences
    like poses and camera views can be guaranteed.
  8. In this paper, unlike most existing methods simply taking attribute learning as a classification problem, we perform it in a different way with the motivation that attributes are related to specific local regions, which refers to the perceptual ability of attributes.
  9. In this work, we explore how to harness the similar natural characteristics existing in the samples from the target domain for learning to conduct person re-ID in an unsupervised manner. 利用

End of Abstract

  1. Experimental results on several benchmarks have demonstrated the effectiveness of our approach over state-of-the-arts.
  2. Extensive experiments demonstrate the superior performance of our algorithm over several state-of-the-art algorithms on small-scale datasets and comparable performance on large-scale re-ID datasets
  3. Person re-identification (re-id) is a fundamental technique to associate various person images, captured by different surveillance cameras, to the same person.
  4. Extensive experiments demonstrate that by simply substituting OLM for standard linear module without revising any experimental protocols, our method largely improves the performance of the state-of-the-art networks, including Inception and residual networks on CIFAR and ImageNet datasets.

Introduction

Begin of Introduction

  1. Such hierarchy and deep architectures equip DNNs with large capacity to represent complicated relationships between inputs and outputs.
  2. the dependencies amplify as the network becomes deeper as 表示 随着
  3. Person attribute learning has been studied a lot in recent
    years, and has been proven beneficial for the person Re-ID task.

Intermidiate Sentences of Introduction

  1. A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions [3]. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets.
  2. Recent efforts toward reducing these overheads involve pruning and compressing the weights of various layers without hurting original accuracy.
  3. That is, the limitation seems to lie in the difficulty of optimisation rather than in the network size
  4. A starting point to understand information processing in CNNs (and the brain) is to identify the features represented across their respective computational hierarchies
  5. An open question remains whether a well-constrained CNN (i.e. constrained by architecture, time, representation, function and so forth) could learn the mid-to-high-level features that flexibly represent task-dependent visual categories in the human visual.
  6. There are some pitfalls existed in this paradigm. Firstly, the image features φ(x) either crafted manually or from a pre-trained CNN model may be not representative enough for zero-short recognition task. Though the features from a pre-trained CNN model are learned, yet restricted to a fixed set of images (e.g., ImageNet [24]), which is not optimal for a particular ZSL task.
  7. Secondly, the user-defined attributes (y) are semantically descriptive, but they are not exhaustive, thus limiting its discriminativeness in classification. There may exist discriminative visual clues not reflected by the pre-defined attributes in ZSL datasets, e.g., the huge mouths of hippos. On the other hand, as shown in Figure 1, the annotated attributes, such as big, strong and ground, are shared in many object categories. This is desired for knowledge transfer between categories, especially from seen to unseen categories. However, if two categories (e.g. cheetah and tiger) share too many (user-defined) attributes, they will be hardly distinguishable in the space of attribute vectors.
  8. Thirdly, low-level feature extraction and embedding space construction in existing ZSL approaches are treated separately, and usually carried out in isolation. Therefore, few existing work ever considers those two components in a unified framework. To address those pitfalls, we propose an end-to-end model capable of learning latent discriminative features (LDF) for ZSL in both visual and semantic space. Specifically, our contributions are:
  9. A problem thus arises: A DNN comprises multiple feature extraction layers stacked one on top of each other; and it is widely acknowledged [20, 39, 14] that, when progressing from the bottom to the top layers, the visual concepts captured by the feature maps tend to be more abstract and of higher semantic level.
  10. To enhance performance of CNNs, recent researches have mainly investigated three important factors of networks: depth, width, and cardinality.
  11. They empirically show that cardinality not only saves the total number of parameters but also results in stronger representation power than the other two factors: depth and width
  12. Through empirical statistics on the classification errors, we find that the network is able to predict several candidate categories that include the correct one with high confidence. However, making the correct final decision on the single category is difficult for the network based models, due to the distraction from other candidate categories. Motivated by the above observations, we propose a novel “Learning by Rethinking” (LR) algorithm in this paper: instead of making the final decision based on one-pass of the data through the network, we introduce feedback connections and allow the network based models to “re-think” the decision and take the high-level feedback information into feature extraction. Benefiting from the feedback, the model is able to extract more discriminative low-level features with the guidance from the high-level information.
  13. we may make accurate Re-ID more tractable.
  14. Specifically, we flow global contextual information obtained at top sides into bottom sides. The top
    contextual information will learn to guide the bottom sides to construct the contextual features at fine spatial scales only emphasizing salient objects. Hence the obtained contexts are different from side-output features or some combinations of them which only contain or at least emphasize local representations for an image.
  15. Machine learning on visual recognition greatly relies on many manually labeled images. However, labeling images is a costly work, especially for fine-grained annotation in specifc domains.
  16. Inspired by humans’ ability that human can classify visual objects of unseen classes according
    to their previous knowledge
    , GZSL is proposed. GZSL is to classify objects of unseen classes within the whole scope of classes [23, 24]. If the classifcation is just within the scope of unseen classes, it is
    known as zero-shot learning (ZSL) [4, 14]. As GZSL is more practical
    and valuable than ZSL, it has gradually attracted more attention.
  17. how to guide the learning process to weaken the effect of projection domain shift becomes a key factor
  18. Learning with no data**, a.k.a.,** (又称)Zero-Shot Learning (ZSL), has been proved to be an effective way to tackle the increasing difculty posed by insufcient training samples

End of Introduction

  1. The experimental results demonstrate our simple idea can favorably outperform recent state-of-the-art methods that use heavily engineered networks, **especially for fine-grained annotation in specifc domains. **
  2. Since our ultimate goal is classification

Contribution

  1. We build an hourglass network with intermediate supervision to learn hierarchical contexts, which are generated with the guidance of global contextual information and thus only emphasize salient objects at different scales
  2. We extensively compare our method with recent stateof-the-art methods on six popular datasets. Our simple method favorably outperforms these competitors under various metrics.
  3. We propose a hierarchical context aggregation module to ensure the network is optimized from the top sides to bottom sides. We aggregate the learned hierarchical contexts at different scales to perform accurate salient
    object detection unlike previous studies [16, 55, 43] that fuse side-output features or some complex combinations of side-outputs

Related Work

Beginning of Related Work

  1. It reveals the potential of making the region feature extraction step learnable. However, its form still resembles the regular grid based pooling. The learnable part is limited to bin offsets only.
  2. Most (if not all) previous region feature extraction methods are shown to be specialization of this formulation by specifying the weights in different ways, mostly hand-crafted.
  3. We present an acceleration method for CNNs, where we prune filters from CNNs that are identified as having a small effect on the output accuracy. By removing whole filters in the network together with their connecting feature maps, the computation costs are reduced significantly. In contrast to pruning weights, this approach does not result in sparse connectivity patterns.
  4. Zhao et al. [58] added a pyramid pooling module for global context construction upon the final
    layer of the deep network, by which they significantly improved the performance of semantic segmentation.

End of Related Work

  1. Hence deep learning based methods have dominated this fields due to their powerful representation capability.
  2. The full literature review of salient object detection is out the scope of this paper. Please refer to [2, 8, 12] for a more comprehensive survey. In this paper, we focus on the context learning rather than previous multi-level feature fusion for the improvement of saliency detection. Different from [43] that uses multiple networks, each of which has a pyramid pooling module [58] at the top, we propose an elegant single network. Different from [59] that uses multi-scale inputs, we use single-scale inputs to extract multi-level contexts. The resulting model is simple yet effective.

Proposed method

Beginning of Proposed method

  1. In this section, we will elaborate our proposed framework for salient object detection. We first introduce our base network in Section 3.1. Then, we present a Mirrorlinked Hourglass Network (MLHN) in Section 3.2. A detailed description of the Hierarchical Context Aggregation (HCA) module is finally provided in Section 3.3. We show an overall network architecture in Figure 2.
  2. To tackle the salient object detection, we follow recent studies [5, 43, 16] to use fully convolutional networks.
  3. Specifically, we use the well-known VGG16 network [38] as our backbone net, whose final fully connected
    layers are removed to serve for image-to-image translation.
  4. To this end, we remain the final pooling layer as in [16] and follow [3] to transform the last two fully connected layers to convolution layers, one of which has the kernel size of 3 × 3 with 1024 channels and another of which has the kernel size of 1 × 1 with 1024 channels as well.
  5. Following these observations, we hypothesize that despite
    ReLU erasing negative linear responses, the first few convolution layers of a deep CNN manage to capture both negative and positive phase information through learning
    pairs or groups of negatively correlated filters. This conjecture implies that there exists a redundancy among the filters from the lower convolution layers.

Intermediate Sentences of Proposed method

solid is the professional representation

  1. Specifically, we use the well-known VGG16 network [38] as our backbone net, whose final fully connected
    layers are removed to serve for image-to-image translation
  2. a projection H˜ = 2H # 1 is exploited to transform H to H˜ 2 [#1, 1]m⇥k
  3. each hash bit is generated on the basis of the whole (and the same) input image feature vector, which may inevitably result in redundancy among the hash bits
  4. where α, β, γ are weights that control the interaction of the loss terms
  5. two feature vectors from source and target domain are concatenated to a 2,048-dim vector
  6. For similarity learning, we employ the triplet loss used in [15], which is formulated as,
  7. Given a certain small length of binary codes, the redundancy lies in different bits would badly affect its performance.
    1. a source image and its translated image should contain the same ID, i.e., self-similarity, and 2) the translated image should be of a different ID with any target image, i.e., domain dissimilarity. Note: the source and target domains contain entirely different IDs.
  8. but we want a representation which is conducive to training strong classifiers
  9. In unsupervised adaptation, we assume access to source images Xs and labels Ys drawn from a source domain distribution ps(x; y), as well as target images Xt drawn from a target distribution pt(x; y), where there are no label observations.
  10. Our goal is to learn a target representation, Mt and classifier Ct that can correctly classify target images into one of K categories at test time, despite the lack of in domain annotations.
  11. Since direct supervised learning on the target is not possible, domain adaptation instead learns a source representation mapping, Ms, along with a source classifier, Cs, and then learns to adapt that model for use in the target domain
  12. In adversarial adaptive methods, the main goal is to regularize the learning of the source and target mappings, Ms and Mt, so as to minimize the distance between the empirical source and target mapping distributions: Ms(Xs) and Mt(Xt)
  13. If this is the case then the source classification model, Cs, can be directly applied to the target representations, elimating the need to learn a separate target classifier and instead setting, C = Cs = Ct.
  14. In the case of learning a source mapping Ms alone it is clear that supervised training through a latent space discriminative loss using the known labels Ys results in the best representation for final source recognition.
  15. However, given that our target domain is unlabeled, it remains an open question how best to minimize the distance between the source and target mappings. Thus the first choice to be made is in the particular parameterization of these mappings.
  16. Both g1 and g2 are realized as multilayer perceptrons
  17. This constraint forces the high-level semantics to be decoded in the same way in g1 and g2.
  18. Likewise, let Eb(xb; θb) represent the shared encoder function, parameterized by θb which maps an
    image xb to the encoder output hb, where hb ∼ HB.
  19. This notation simply states that at each output location u of the channel c, the gather operator has a receptive field of the input that lies within a single channel and has an area bounded by (2e − 1)2. If the field envelops the full input feature map, we say that the gather operator has global extent.
  20. Note that the architecture in principle contains multiple scales and for clarity, we illustrate the network with two scales as an example.
  21. Different from traditional ZSL approaches, the parameters of FNet are jointly trained with other parts in our framework; thus the obtained features are regulated well with the embedding component. We show that this leads to an performance improvement.
  22. However, there exist identity-discriminative but view-invariant visual appearance characteristics or factors that can be exploited for person Re-ID
  23. automatically learn the space of multi-level discriminative visual factors that are insensitive to viewing condition changes
  24. We propose two new types of layers – the “feedback” layer and the “emphasis” layer – to serve as the channel for transferring the feedback information.
  25. We consider Se6 as the top valve (顶阀) that controls the overall contextual information flow in the network.
  26. The resolution of feature maps in each convolution block is the half of the preceding one. Following [16, 48], the side-output of each convolution block means the connection from the last layer of this block.
  27. existing SPD matrix learning approaches typically flatten SPD manifolds(压平流形压平tensorTOvector) via tangent space approximation
  28. While, in principle, this could be handled by using the strategy of Section 3.2 with a small D˜, this would incur a loss of information that reduces the network capacity too severely.
  29. The objective is to construct a graph to jointly take the target pairs and the context information into consideration, and eventually outputs the similarity score.

End Sentences of Proposed method

  1. To this end, one can simply remove the fully-connected layers of the first-order
    CNN and connect the resulting output to a CDU. The output of the CDU being a vector, one can then simply pass it to a fully-connected layer, which, after a softmax activation, produces class probabilities. Since, as discussed above, all our new layers are differentiable, the resulting network can be trained in an end-to-end manner.

Equation Deascription

  1. For a clear presentation, this can be formulated as
  2. The (D × D) covariance matrix of such features can then be expressed as (没有逗号)(接equation), where

Figure Description

  1. Hierarchical Context Aggregation (HCA) module used in our proposed network. All sides of the backbone have intermediate supervision to ensure that the optimization is performed from high sides to lower sides, so that every side can learn the contextual information. The hierarchical contexts from all sides
    are concatenated
    for final saliency map prediction
  2. Figure 2. Overall framework of our proposed method. Our effort starts from the VGG16 network [38]. We add an additional convolution block at the end of the convolution layers of VGG16, resulting in six convolution blocks in total. The contexts at each convolution block are learned in a high-to-low manner to ensure that each block is guided by all higher layers to generate scale-aware contexts. The Hierarchical Context Aggregation (HCA) module can guarantee the optimization order is high-to-low and aggregate the generated hierarchical contexts
    to predict the final saliency maps.
  3. The proposed Riemannian network is conceptually illustrated in Fig.1.

Experiments

Experiments Configuration

Architectural Analyses

  1. Due to the nature of the multi-scale and multi-level learning in deep neural networks, there have emerged a large number of architectures that are designed to utilize the hierarchical deep features. For example, multi-scale learning can use skip-layer connections [13, 31] which is widely accepted owning their strong capabilities to fuse hierarchical deep features inside the networks. On the other hand, multi-scale learning can use encoder-decoder networks that progressively decode the hierarchical deep representation
    learned in the encoder backbone net. We have seen these two structures applied in various vision tasks.
  2. We continue our discussion by briefly categorizing inside multi-scale deep learning into five classes: hyper feature learning, FCN style, HED style, DSS style and encoderdecoder networks. An overall illustration of them is summarized in Figure 4. Our following discussion of them will clearly show the differences between our proposed HCA network and previous efforts on multi-scale learning.
  3. Our network architecture is shown in Figure 2. Firstly, the concept generator is particularly designed to have multiple
    fully connected layers in order to obtain enough capacity to generate image-analogous concepts which are highly
    heterogeneous from the input attribute. Details are shown in Table 1. Secondly, our concept discriminator is also a com-
    bination of fully connected layers, each followed by batch normalization and leaky reLU, except for the output layer,
    which is processed by the Sigmoid non-linearity. Finally, the concept extractor is obtained by removing the last Softmax
    classification layer of Resnet-50 and adding a 128-D fully connected layer. We regard the feature produced by the FC
    layer as the image concept. Note that the dimension of the last layer in the concept generator is also set to 128.

Ablation Atudy

Compare with the state-of-the-art

Conclusion

  1. Add FC layer fellowing pool5 is good for cross-domain re-id but decrease the accuracy of rank1 for supervised person re-id.

Professional description collection

  1. The domain adversarial similarity loss [7, 8] is used to train a model to produce representations such that a classifier cannot reliably predict the domain of the encoded representation. Maximizing such “confusion” is achieved via a Gradient Reversal Layer (GRL) and a domain classifier trained to predict the domain producing the hidden representation. The GRL has the same output as the identity function, but reverses the gradient direction. Formally, for some function f(u), the GRL is defined as Q (f(u)) = f(u) with a gradient dduQ(f(u)) = − dduf(u). The domain classifier Z(Q(hc); θz) ! d^parameterized by θz maps a shared representation vector hc = Ec(x; θc) to a prediction of the label d^2 f0; 1g of the input sample x. Learning with a GRL is adversarial in that θz is optimized to increase Z’s ability to discriminate between encodings of images from the source or target domains, while the reversal of the gradient results in the model parameters θc learning representations from which domain classification accuracy is reduced;
  2. With a discriminative base model, input images are mapped into a feature space that is useful for a discriminative task such as image classification. For example, in the case of digit classification this may be the standard LeNet model. However, Liu and Tuzel achieve state of the art results on unsupervised MNIST-USPS using two generative adversarial networks [13]. These generative models use random noise as input to generate samples in image space—generally, an intermediate feature of an adversarial discriminator is then used as a feature for training a task-specific classifier.
  3. Note that this information flow direction is opposite to that in a discriminative deep neural network [6] where the first layers extract low-level features while the last layers extract high-level features.
  4. They can materialize the shared high-level representation differently for fooling the respective discriminators.
  5. We describe a version of capsules in which each capsule has a logistic unit to represent the presence of an entity and a 4x4 matrix which could learn to represent the relationship between that entity and the viewer (the pose)
  6. A capsule in one layer votes for the pose matrix of many different capsules in the layer above by multiplying its own pose matrix by trainable viewpoint-invariant transformation matrices that could learn to represent part-whole relationships. Each of these votes is weighted by an assignment coefficient. These coefficients are iteratively updated for each image using the Expectation-Maximization algorithm such that the output of each capsule is routed to a capsule in the layer above that receives a cluster of similar votes. The transformation matrices are trained discriminatively by backpropagating through the unrolled iterations of EM between each pair of adjacent capsule layers. On the smallNORB benchmark, capsules reduce the number of test errors by 45% compared to the state-of-the-art. Capsules also show far more resistance to white box adversarial attacks than our baseline convolutional neural network.
  7. For cumbersome models that learn to discriminate between a large number of classes, the normal training objective is to maximize the average log probability of the correct answer, but a side-effect of the learning is that the trained model assigns probabilities to all of the incorrect answers and even when these probabilities are very small, some of them are much larger than others.
  8. While there are often many solutions (deep network parameter settings) that generate zero train error, some of these generalise better than others due to being in wide valleys rather than narrow crevices [4, 9] – so that small perturbations do not change the prediction efficacy drastically;and that deep networks are better than might be expected at finding these good solutions [26], but that the tendency towards finding robust minima can be enhanced by biasing deep nets towards solutions with higher posterior entropy
  9. In this paper, we propose a new network module, named “Convolutional Block Attention Module”. Since convolution operations extract informative features by blending cross-channel and spatial information
    together, we adopt our module to emphasize meaningful features along those two principal dimensions: channel and spatial axes. To achieve this, we sequentially apply channel and spatial attention modules (as shown in Fig. 1), so that each of the branches can learn ‘what’ and ‘where’ to attend in the channel and spatial axes respectively. As a result, our module efficiently helps the information flow within the network by learning which information to emphasize or suppress.
  10. We produce a channel attention map by exploiting the inter-channel relationship of features. As each channel of a feature map is considered as a feature detector [31], channel attention focuses on ‘what’ is meaningful given an input image. To compute the channel attention efficiently, we squeeze the spatial dimension of the input feature map. For aggregating spatial information, average-pooling has been commonly adopted so far. Zhou et al. [32] suggest to use it to learn the extent of the target object effectively and Hu et al. [28] adopt it in their attention module to compute spatial statistics. Beyond the previous works, we argue that max-pooling gathers another important clue about distinctive object features to infer finer channel-wise attention. Thus, we use both average-pooled and max-pooled features simultaneously. We empirically confirmed that exploiting both features greatly improves representation power of networks rather thanusing each independently (see Sec. 4.1), showing the effectiveness of our design choice. We describe the detailed operation below.
  11. We first aggregate spatial information of a feature map by using both average-pooling and max-pooling operations, generating two different spatial context descriptors: Fc avg and Fc max, which denote average-pooled features and max-pooled features respectively. Both descriptors are then forwarded to a shared network to produce our channel attention map Mc 2 RC×1×1. The shared network is composed of multi-layer perceptron (MLP) with one hidden layer. To reduce parameter overhead, the hidden activation size is set to RC=r×1×1, where r is the reduction ratio. After the shared network is applied to each descriptor, we merge the output feature vectors using element-wise summation. In short, the channel attention is computed as:
  12. We generate a spatial attention map by utilizing the inter-spatial relationship of features. Different from the channel attention, the spatial attention focuses on ‘where’ is an informative part, which is complementary to the channel attention. To compute the spatial attention, we first apply average-pooling and max-pooling operations along the channel axis and concatenate them to generate an efficient feature descriptor. Applying pooling operations along the channel axis is shown to be effective in highlighting informative regions [33]. On the concatenated feature descriptor, we apply a convolution layer to generate a spatial attention map Ms(F) 2 RH×W which encodes where to emphasize or suppress. We describe the detailed operation below. We aggregate channel information of a feature map by using two pooling operations, generating two 2D maps: Fs avg 2 R1×H×W and Fs max 2 R1×H×W . Each denotes average-pooled features and max-pooled features across the channel. Those are then concatenated and convolved by a standard convolution layer, producing our 2D spatial attention map. In short, the spatial attention is computed as:
  13. Furthermore, convolutional features naturally retain spatial information which is lost in fully-connected
    layers, so we can expect the last convolutional layers to have the best compromise between high-level semantics and detailed spatial information
  14. Inspired by above evidences, we present a novel Feedback Convolutional Neural Network architecture in this paper. It achieves this selectivity by jointly reasoning outputs of class nodes and activations of hidden layer neurons during the feedback loop.
  15. From a machine learning perspective, the proposed feedback networks add extra flexibility to Convolutional Networks, to help in capturing visual attention and improving feature detection
  16. Compared with traditional bottom-up strategies [11, 13], which aim to regularize the network training, the proposed feedback framework adds flexibilities to the model inference from high-level concepts down to the receptive field.
  17. We mimic the human visual recognition process that human may focus to recognize objects in a complicated image after a first time glimpse as the procedure “Look and Think Twice” for image classification. We utilize the weakly supervised object localization during the “first glimpse” to make guesses of ROIs, then make the network refocused on those ROIs and give final classifications list.
  18. As in Biased Competition Theory [1, 6], feedback, which passes the high-level semantic information down to the lowlevel perception, controls the selectivity of neuron activations in an extra loop in addition to the feedforward process. This results in the “Top-Down” attention in human cognition. Hierarchical probabilistic computational models [19] are proposed to characterize feedback stimuli in a top-down manner, which are further incorporated into deep neural networks, for example, modeling feedback as latent variables in DBM [31], or using selectivity to resolve fine-grained classification [21], et al…
  19. Inspired by visualizations of CNNs [33, 24], a more feasible and cognitive manner for detection / localization could be derived by utilizing the saliency maps generated in feedback visualizations.
  20. However, if possible, the challenge lies on取决于 how to obtain semantically meaningful salience maps with high quality for each concept. That’s the ultimate goal of our work presented in this paper
  21. Be interpreting ReLU and Max-Pooling layers as “gates” controlled by input x, the network selects information during feedforward phases in a bottom-up manner, and eliminates signals with minor contributions in making decisions. However, the activated neurons could be either helpful or harmful for classification, and involve too many noises, for instance, cluttered backgrounds in complex scenes.
  22. Since the model opens all gates and allow maximal information getting through to ensure the generalization, to increase the discriminability within feature level, it is feasible to turn off those gates that provide irrelevant information when targeting at particular semantic labels.
  23. However, all the existing methods merely apply shallow learning, with which traditional methods are typically surpassed by recent popular deep learning methods in many contexts in artificial intelligence and visual recognition.
  24. A new backpropagation is derived to train the proposed network with exploiting a stochastic gradient descent optimization algorithm on Stiefel manifolds.
  25. Analogously to the well-known convolutional network (ConvNet), the proposed SPD matrix network (SPDNet)
    also designs fully connected convolution-like layers and rectified linear units (ReLU)-like layers, named bilinear mapping (BiMap) layers and eigenvalue rectification (ReEig) layers respectively. In particular, following the classical manifold learning theory that learning or even preserving the original data structure can benefit classification, the BiMap layers are designed to transform the input SPD matrices,
    that are usually covariance matrices derived from the data,
    to new SPD matrices with a bilinear mapping. As the classical ReLU layers, the proposed ReEig layers introduce a non-linearity to the SPDNet by rectifying the resulting SPD matrices with a non-linear function. Since SPD matrices reside on non-Euclidean manifolds, we have to devise an
    eigenvalue logarithm (LogEig) layer to carry out Riemannian computing on them to output their Euclidean forms for any regular output layers.
  26. The normalized and de-correlated activation is well known for
    improving the conditioning of the Fisher information matrix and accelerating the training of deep neural
    networks
    [20, 6, 37].
  27. This trick can recover the representation capacity of orthogonal weight layer to some extent, that is practical in shallow neural networks but for deep CNNs, it is unnecessary based on our observation.
  28. We target to update proxy parameters V, and therefore it is necessary to back-propagate the gradient
    information through the transformation φ(V)

Latex general problems

有时候我们用latex编译论文的时候,会遇到和bib相关的问题,如下所示:

Something’s wrong–perhaps a missing \item. \end{thebibliography}

同样的latex文档,在windows下编译没问题,但放到mac上就编译不能通过。

根本问题在于*.tex所在目录下的*.bbl文件。这个文件在windows和mac上的处理方式不同。当文章中没有引用任何文献的时候,windows可以编译通过,但是mac就不能编译通过。

所以解决的办法是:

(1)先关闭*.tex文件,然后删除*.bbl文件;

(2)打开*.tex文件,在文章中的任何地方加上\cite{*}这条语句;

(3)再次编译,就没有问题了

上一篇:车道线-论文阅读: Learning Lightweight Lane Detection CNNs by Self Attention Distillation


下一篇:GTC2018参会小结