论文研读 —— 5. FaceNet A Unified Embedding for Face Recognition and Clustering (2/3)

文章目录

3. Method

FaceNet uses a deep convolutional network. We discuss two different core architectures: The Zeiler&Fergus [22] style networks and the recent Inception [16] type networks. The details of these networks are described in section 3.3.

FaceNet 使用了深度卷积网络。我们讨论了两种不同的核心架构:Zeiler & Fergus [22] 的网络和最近的 Inception [16] 的网络。 这些网络的细节将在 3.3 节中描述。

论文研读 —— 5. FaceNet A Unified Embedding for Face Recognition and Clustering (2/3)
Figure 2. Model structure. Our network consists of a batch input layer and a deep CNN followed by L2 normalization, which results in the face embedding. This is followed by the triplet loss during training.
译:
图2: 模型结构。 我们的网络由一个批量输入层和一个深度 CNN 组成,然后是 L2 归一化并用于人脸嵌入。接下来是训练期间的三元组损失。

Given the model details, and treating it as a black box (see Figure 2), the most important part of our approach lies in the end-to-end learning of the whole system. To this end we employ the triplet loss that directly reflects what we want to achieve in face verification, recognition and clustering. Namely, we strive for an embedding f ( x ) f(x) f(x), from an image x x x into a feature space R d \mathbb{R}^d Rd, such that the squared distance between all faces, independent of imaging conditions, of the same identity is small, whereas the squared distance between a pair of face images from different identities is large.

给定模型细节将其视为黑盒(见图 2),我们方法中最重要的部分在于整个系统中端到端学习。 为此,我们采用了三元组损失,它直接反映了我们想要在人脸验证、识别和聚类中实现的目标。 即,我们将图像 x x x 中的 f ( x ) f(x) f(x) 嵌入到特征空间 R d \mathbb{R}^d Rd 中,使得所有面部之间的平方距离,与成像条件无关,相同的特征距离很小,而来自不同身份的人脸图像之间的特征距离很大。

论文研读 —— 5. FaceNet A Unified Embedding for Face Recognition and Clustering (2/3)
Figure 3. The Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity.
译:
图3: anchor 和 positive 有相同的特征,和 negative 有不同的特征。Triplet Loss 在训练过程中会缩小 anchor 和 postive 的距离,并最大化 anchor 和 negative 的距离。

Although we did not a do direct comparison to other losses, e.g. the one using pairs of positives and negatives, as used in [14] Eq. (2), we believe that the triplet loss is more suitable for face verification. The motivation is that the loss from [14] encourages all faces of one identity to be a projected onto a single point in the embedding space. The triplet loss, however, tries to enforce a margin between each pair of faces from one person to all other faces. This allows the faces for one identity to live on a manifold, while still enforcing the distance and thus discriminability to other identities.

虽然我们没有比较其他损失,例如使用正负对的方法,在等式(2) [14] 中那样。我们认为三元组损失更适合人脸验证。之所以这样认为,是因为损失鼓励将所有面部的某个特征投影到嵌入空间的某个单点上 [14]。三元组损失会试图对每组人脸与其他人脸之间强制执行一个边距。这使得基于某类特征的人脸存在于一个流行上,同时仍保持距离,从而达到与其他身份进行区分的目的。

The following section describes this triplet loss and how it can be learned efficiently at scale.

接下来的章节详细的说明了三元组损失,以及怎么大规模有效地学习它。

3.1. Triplet Loss

The embedding is represented by f ( x ) ∈ R d f(x) \in \mathbb{R}^d f(x)∈Rd . It embeds an image x x x into a d-dimensional Euclidean space. Additionally, we constrain this embedding to live on the d-dimensional hypersphere, i.e. ∥ f ( x ) ∥ 2 = 1 \left \| f(x) \right \|_2 = 1 ∥f(x)∥2​=1. This loss is motivated in [19] in the context of nearest-neighbor classification. Here we want to ensure that an image x i a ( a n c h o r ) x_i^a (anchor) xia​(anchor) of a specific person is closer to all other images x i p ( p o s i t i v e ) x_i^p (positive) xip​(positive) of the same person than it is to any image x i n ( n e g a t i v e ) x_i^n (negative) xin​(negative) of any other person. This is visualized in Figure 3.

我们使用 f ( x ) ∈ R d f(x) \in \mathbb{R}^d f(x)∈Rd 表示嵌入。 它将图像 x x x 嵌入到 d d d 维欧几里得空间中。 此外,我们将这个嵌入限制在 d d d 维超球面上,即 ∥ f ( x ) ∥ 2 = 1 \left \| f(x) \right \|_2 = 1 ∥f(x)∥2​=1。 这种损失是在[19]中在最近邻分类的背景下导致的。 在这里,我们要确保特定人的图像 x i a ( a n c h o r ) x_i^a (anchor) xia​(anchor) 与同一人的所有其他图像 x i p ( p o s i t i v e ) x_i^p (positive) xip​(positive) 相比,它比其他任何图像 x i n ( n e g a t i v e ) x_i^n (negative) xin​(negative) 的距离更近。 这在图 3 中可视化。

Thus we want,

因此,我们希望

Eq. 1

∥ x i a − x i p ∥ 2 2 + α < ∥ x i a − x i n ∥ 2 2 ,   ∀ ( x i a , x i p , x i n ) ∈ τ \left \| x_i^a - x_i^p \right \|_2^2 + \alpha < \left \| x_i^a - x_i^n \right \|_2^2, \ \forall (x_i^a, x_i^p, x_i^n) \in \tau ∥xia​−xip​∥22​+α<∥xia​−xin​∥22​, ∀(xia​,xip​,xin​)∈τ

where α \alpha α is a margin that is enforced between positive and negative pairs. τ \tau τ is the set of all possible triplets in the training set and has cardinality N N N.

其中 α \alpha α 是在正负对之间强制执行的边距。 τ \tau τ 是训练集中所有可能的三元组的集合,具有基数 N N N。

The loss that is being minimized is then L = L = L=

需要求解的最小化损失为

Eq. 2

∑ i N [ ∥ f ( x i a ) − f ( x i p ) ∥ 2 2 − ∥ f ( x i a ) − f ( x i n ) ∥ 2 2 + α ] \sum_{i}^{N} \left [ \left \| f(x_{i}^{a}) - f(x_{i}^{p}) \right \|_2^2 - \left \| f(x_{i}^{a}) - f(x_{i}^{n}) \right \|_2^2 + \alpha \right ] i∑N​[∥f(xia​)−f(xip​)∥22​−∥f(xia​)−f(xin​)∥22​+α]

Generating all possible triplets would result in many triplets that are easily satisfied (i.e. fulfill the constraint in Eq. (1)). These triplets would not contribute to the training and result in slower convergence, as they would still be passed through the network. It is crucial to select hard triplets, that are active and can therefore contribute to improving the model. The following section talks about the different approaches we use for the triplet selection.

生成所有可能的三元组(即满足等式(1)中的约束条件)。 这些三元组对训练没有贡献,并且会导致收敛速度变慢。所以,选择活跃的硬三元组至关重要,并有助于改进模型。 在下面的章节里,我们将探讨不同的三元组选择方法。

3.2. Triplet Selection

In order to ensure fast convergence it is crucial to select triplets that violate the triplet constraint in Eq. (1). This means that, given x i a x_i^a xia​ , we want to select an x i p x_i^p xip​ (hard positive) such that a r g m a x x i p ∥ f ( x i a ) − f ( x i p ) ∥ 2 2 argmax_{x_i^p} \left \| f (x_i^a ) − f (x_i^p ) \right \|_2^2 argmaxxip​​∥f(xia​)−f(xip​)∥22​ and similarly x i n x_i^n xin​ (hard negative) such that a r g m i n x i n ∥ f ( x i a ) − f ( x i n ) ∥ 2 2 argmin_{x_i^n} \left \| f(x_i^a ) − f(x_i^n) \right \|_2^2 argminxin​​∥f(xia​)−f(xin​)∥22​.

为了确保快速收敛,选择违反方程式(1)中有关三元组的约束条件至关重要。这意味着,给定 x i a x_i^a xia​ ,我们要选择一个 x i p x_i^p xip​ (hard positive),使得 a r g m a x x i p ∥ f ( x i a ) − f ( x i p ) ∥ 2 2 argmax_{x_i^p} \left \| f (x_i^a ) − f (x_i^p ) \right \|_2^2 argmaxxip​​∥f(xia​)−f(xip​)∥22​ 和类似的 x i n x_i^n xin​ (hard negative)使得 a r g m i n x i n ∥ f ( x i a ) − f ( x i n ) ∥ 2 2 argmin_{x_i^n} \left \| f(x_i^a ) − f(x_i^n) \right \|_2^2 argminxin​​∥f(xia​)−f(xin​)∥22​。

It is infeasible to compute the argmin and argmax across the whole training set. Additionally, it might lead to poor training, as mislabelled and poorly imaged faces would dominate the hard positives and negatives. There are two obvious choices that avoid this issue:

在整个训练集中计算 argmin 和 argmax 是不可行的。 此外,它可能会导致训练不足,因为错误标记和图像不佳的人脸会影响hard positives 和 negatives。而我们有两个选择可以避免这个问题:

  • Generate triplets offline every n steps, using the most recent network checkpoint and computing the argmin and argmax on a subset of the data.
  • Generate triplets online. This can be done by selecting the hard positive/negative exemplars from within a mini-batch.
  • 每 n 步离线生成三元组,使用最近的网络检查点并在数据子集上计算 argmin 和 argmax。
  • 在线生成三元祖。 这可以通过从小批量中选择硬positive/negative样本来完成。

Here, we focus on the online generation and use large mini-batches in the order of a few thousand exemplars and only compute the argmin and argmax within a mini-batch.

在这篇文章里,我们专注于在线生成技术,并使用几千个样本量级的大型 mini-batch,并且只计算 mini-batch 中的 argmin 和 argmax。

To have a meaningful representation of the anchor-positive distances, it needs to be ensured that a minimal number of exemplars of any one identity is present in each mini-batch. In our experiments we sample the training data such that around 40 faces are selected per identity per mini-batch. Additionally, randomly sampled negative faces are added to each mini-batch.

为了对 anchor-positive 距离进行有意义的表示,需要确保每个小批量中存在最小数量的任何一个特征的样本。 在我们的实验中,我们对训练数据进行采样,使得每个小批量的每个特征包含大约 40 个人脸。 此外,随机采样的 negative 被添加到每个小批量中。

Instead of picking the hardest positive, we use all anchor-positive pairs in a mini-batch while still selecting the hard negatives. We don’t have a side-by-side comparison of hard anchor-positive pairs versus all anchor-positive pairs within a mini-batch, but we found in practice that the all anchor-positive method was more stable and converged slightly faster at the beginning of training.

我们没有选择采用 hardest positive,而是在一个小批量中使用所有 anchor-positive 对,同时仍然选择 hard negatives。 我们没有对小批量中的 hard anchor-positive 对与所有 anchor-positive 对逐一比较,但我们在实践中发现,anchor-positive 方法更稳定,从训练开始它的收敛速度更快。

We also explored the offline generation of triplets in conjunction with the online generation and it may allow the use of smaller batch sizes, but the experiments were inconclusive.

我们还探索了,结合了在线与离线生成三元组的方式,它可能允许使用较小的批量,但实验没有结论。

Selecting the hardest negatives can in practice lead to bad local minima early on in training, specifically it can result in a collapsed model (i.e. f ( x ) = 0 f(x) = 0 f(x)=0). In order to mitigate this, it helps to select x i n x_i^n xin​ such that

在实践中,选择 hardest negatives 可能会在训练早期导致局部最小值,特别是它可能导致模型崩溃(即 f ( x ) = 0 f(x) = 0 f(x)=0)。 为了减轻这种情况,可以选择 x i n x_i^n xin​ 使得

Eq. 3

∥ f ( x i a ) − f ( x i p ) ∥ 2 2 < ∥ f ( x i a ) − f ( x i n ) ∥ 2 2 \left \| f(x_i^a) - f(x_i^p) \right \|_2^2 < \left \| f(x_i^a) - f(x_i^n) \right \|_2^2 ∥f(xia​)−f(xip​)∥22​<∥f(xia​)−f(xin​)∥22​

We call these negative exemplars semi-hard, as they are further away from the anchor than the positive exemplar, but still hard because the squared distance is close to the anchor-positive distance. Those negatives lie inside the margin α \alpha α.

我们称这些 negative 为 semi-hard 样本,因为它们比 positive 样本离锚点更远,但仍然很难,因为平方距离接近 anchor-positive 样本的距离。这些 negatives 样本位于 α \alpha α 内。

As mentioned before, correct triplet selection is crucial for fast convergence. On the one hand we would like to use small mini-batches as these tend to improve convergence during Stochastic Gradient Descent (SGD) [20]. On the other hand, implementation details make batches of tens to hundreds of exemplars more efficient. The main constraint with regards to the batch size, however, is the way we select hard relevant triplets from within the mini-batches. In most experiments we use a batch size of around 1,800 exemplars.

如前所述,正确的三元组选择对于快速收敛至关重要。 一方面,我们希望使用小型 mini-batch,因为它们往往会在随机梯度下降 (SGD) [20] 期间提高收敛性。 另一方面,实现时使用数十到数百个样本的批次更有效率。然而,关于批量大小的主要限制是我们从小批量中选择硬相关三元组的方式。 在大多数实验中,我们使用大约 1,800 个样本的批量大小。

3.3. Deep Convolutional Networks

In all our experiments we train the CNN using Stochastic Gradient Descent (SGD) with standard backprop [8, 11] and AdaGrad [5]. In most experiments we start with a learning rate of 0.05 which we lower to finalize the model. The models are initialized from random, similar to [16], and trained on a CPU cluster for 1,000 to 2,000 hours. The decrease in the loss (and increase in accuracy) slows down drastically after 500h of training, but additional training can still significantly improve performance. The margin α \alpha α is set to 0.2.

在我们所有的实验中,我们使用带有标准反向传播 [8, 11] 和 AdaGrad [5] 的随机梯度下降 (SGD) 来训练 CNN。 在大多数实验中,我们从 0.05 的学习率开始,并降低以最终确定模型。模型从随机初始化,类似于 [16],并在 CPU 集群上训练 1,000 到 2,000 小时。训练 500 小时后,损失的减少(和准确性的增加)急剧下降,但额外的训练仍然可以显着提高性能。 边距 α \alpha α 设置为 0.2。

We used two types of architectures and explore their trade-offs in more detail in the experimental section. Their practical differences lie in the difference of parameters and FLOPS. The best model may be different depending on the application. E.g. a model running in a datacenter can have many parameters and require a large number of FLOPS, whereas a model running on a mobile phone needs to have few parameters, so that it can fit into memory. All our models use rectified linear units as the non-linear activation function.

我们使用了两种类型的架构,并在实验部分更详细地探讨了它们的权衡。 它们的实际区别在于参数和 FLOPS 的不同。 最佳模型可能因应用而异。 例如。 在数据中心运行的模型可以有很多参数,需要大量的 FLOPS,而在手机上运行的模型需要很少的参数,以便它可以放入内存。 我们所有的模型都使用整流线性单元作为非线性激活函数。

The first category, shown in Table 1, adds 1 × 1 × d 1 \times 1 \times d 1×1×d convolutional layers, as suggested in [9], between the standard convolutional layers of the Zeiler&Fergus [22] architecture and results in a model 22 layers deep. It has a total of 140 million parameters and requires around 1.6 billion FLOPS per image.

对于第一类,如表1所示,在 Zeiler&Fergus [22] 的结构里增加了 1 × 1 × d 1 \times 1 \times d 1×1×d 卷积层,如[9]所建议的那样,产生了 22 层深的模型。它总共有 1.4 亿个参数,每张图像需要大约 16 亿次 FLOPS。

论文研读 —— 5. FaceNet A Unified Embedding for Face Recognition and Clustering (2/3)
Table 1. NN1. This table show the structure of our Zeiler&Fergus [22] based model with 1 × 1 1 \times 1 1×1 convolutions in- spired by [9]. The input and output sizes are described in r o w s × c o l s × f i l t e r s rows \times cols \times filters rows×cols×filters. The kernel is specified as r o w s × c o l s rows \times cols rows×cols, stride and the maxout [6] pooling size as p = 2 p = 2 p=2.
译:
表 1. NN1。 该表显示了我们基于 Zeiler&Fergus [22] 的模型的结构,该模型具有受 [9] 启发的 1 × 1 1 \times 1 1×1 卷积。 输入和输出大小以 r o w s × c o l s × f i l t e r s rows \times cols \times filters rows×cols×filters 描述。 内核被指定为 r o w s × c o l s rows \times cols rows×cols、stride 和 maxout [6] 池大小为 p = 2 p = 2 p=2。

The second category we use is based on GoogLeNet style Inception models [16]. These models have 20 × 20 \times 20× fewer parameters (around 6.6M-7.5M) and up to 5× fewer FLOPS (between 500M-1.6B). Some of these models are dramatically reduced in size (both depth and number of filters), so that they can be run on a mobile phone. One, NNS1, has 26M parameters and only requires 220M FLOPS per image. The other, NNS2, has 4.3M parameters and 20M FLOPS. Table 2 describes NN2 our largest network in detail. NN3 is identical in architecture but has a reduced input size of 160 × 160 160 \times 160 160×160. NN4 has an input size of only 96 × 96 96 \times 96 96×96, thereby drastically reducing the CPU requirements (285M FLOPS vs 1.6B for NN2). In addition to the reduced input size it does not use 5 × 5 5 \times 5 5×5 convolutions in the higher layers as the receptive field is already too small by then. Generally we found that the 5 × 5 5 \times 5 5×5 convolutions can be removed throughout with only a minor drop in accuracy. Figure 4 compares all our models.

第二类,我们基于GoogLeNet风格的Inception模型[16]。这些模型的参数减少了约20倍(大约 6.6M-7.5M),FLOPS 减少了 5 倍。其中一些模型的尺寸(深度和过滤器数量)显着减小,因此它们可以在手机上运行。One, NNS1,有 26M 参数,每张图像只需要 220M FLOPS。另一个,NNS2,有 4.3M 参数和 20M FLOPS。表 2 详细描述了我们最大的网络 NN2。 NN3 在架构上是相同的,但输入大小减少了 160 × 160 160 \times 160 160×160。 NN4 的输入大小仅为 96 × 96 96 \times 96 96×96,从而大大降低了 CPU 需求(285M FLOPS 与 NN2 的 1.6B)。除了减少输入大小之外,它没有在较高层中使用 5 × 5 5 \times 5 5×5 的卷积,因为可接收范围已经太小了。一般来说,我们发现 5 × 5 5 × 5 5×5 的卷积可以在整个过程中被移除,而准确度只会有轻微的下降。图 4 比较了我们所有的模型。

论文研读 —— 5. FaceNet A Unified Embedding for Face Recognition and Clustering (2/3)
Table 2. NN2. Details of the NN2 Inception incarnation. This model is almost identical to the one described in [16]. The two major differences are the use of L2 pooling instead of max pooling (m), where specified. The pooling is always 3 × 3 3 \times 3 3×3 (aside from the final average pooling) and in parallel to the convolutional modules inside each Inception module. If there is a dimensionality reduction after the pooling it is denoted with p p p. 1 × 1 1 \times 1 1×1, 3 × 3 3 \times 3 3×3, and 5 × 5 5 \times 5 5×5 pooling are then concatenated to get the final output.
译:
表2. NN2. NN2 Inception的详细信息。该模型和[16]中描述的很相似。最主要的两个区别是在L2用池化方法替代了最大池化方法。池化总是 3 × 3 3 \times 3 3×3大小(除了最终的平均池化)并与每个 Inception 模块内的卷积模块并行。 如果在池化之后有降维,则用 p p p 表示。 然后将 1 × 1 1\times 1 1×1、 3 × 3 3\times 3 3×3 和 5 × 5 5\times 5 5×5 池连接起来以获得最终输出。

4. Datasets and Evaluation

We evaluate our method on four datasets and with the exception of Labelled Faces in the Wild and YouTube Faces we evaluate our method on the face verification task. I.e. given a pair of two face images a squared L2 distance threshold D ( x i , x j ) D(x_i, x_j) D(xi​,xj​) is used to determine the classification of same and different. All faces pairs ( i , j ) (i, j) (i,j) of the same identity are denoted with P s a m e P_{same} Psame​, whereas all pairs of different identities are denoted with P d i f f P_{diff} Pdiff​.

我们在四个数据集上评估我们的方法,除了野外标记的面孔和 YouTube 面孔之外,我们在人脸验证任务上评估我们的方法。 例如,给定一对两张人脸图像,使用平方 L2 距离阈值 D ( x i , x j ) D(x_i, x_j) D(xi​,xj​) 来确定相同和不同的分类。 所有具有相同身份的人脸对 ( i , j ) (i, j) (i,j) 用 P s a m e P_{same} Psame​ 表示,而所有不同身份的人脸对用 P d i f f P_{diff} Pdiff​ 表示。

We define the set of all true accepts as

我们定义 true accepts

Eq. 4

T A ( d ) = { ( i , j ) ∈ P s a m e ,   w i t h   D ( x i , x j ) ≤ d } TA(d) = \{ (i, j) \in P_{same}, \ with \ D(x_i, x_j) \leq d \} TA(d)={(i,j)∈Psame​, with D(xi​,xj​)≤d}

These are the face pairs ( i , j ) (i, j) (i,j) that were correctly classified as same at threshold d. Similarly

这些是在阈值 d 被正确分类为相同的面部对 ( i , j ) (i, j) (i,j)。 相似地

Eq. 5

F A ( d ) = { ( i , j ) ∈ P d i f f ,   w i t h   D ( x i , x j ) ≤ d } FA(d) = \{ (i, j) \in P_{diff}, \ with \ D(x_i, x_j) \leq d \} FA(d)={(i,j)∈Pdiff​, with D(xi​,xj​)≤d}

is the set of all pairs that was incorrectly classified as same (false accept).

是被错误分类为相同(错误接受)的所有对的集合。

The validation rate VAL(d) and the false accept rate FAR(d) for a given face distance d are then defined as

然后将给定面部距离 d 的验证率 VAL(d) 和错误接受率 FAR(d) 定义为

Eq. 6

V A L ( d ) = ∣ T A ( d ) ∣ ∣ P s a m e ∣ ,   F A R ( d ) = ∣ F A ( d ) ∣ ∣ P d i f f ∣ VAL(d) = \frac{| TA(d) |}{| P_{same} |}, \ FAR(d) = \frac{| FA(d) |}{| P_{diff} |} VAL(d)=∣Psame​∣∣TA(d)∣​, FAR(d)=∣Pdiff​∣∣FA(d)∣​

4.1. Hold-out Test Set

We keep a hold out set of around one million images, that has the same distribution as our training set, but dis- joint identities. For evaluation we split it into five disjoint sets of 200k images each. The FAR and VAL rate are then computed on 100 k × 100 k 100k \times 100k 100k×100k image pairs. Standard error is reported across the five splits.

我们保留了一组大约一百万张图像,与我们的训练集具有相同的分布,但身份不同。 为了评估,我们将其分成五组不相交的图像,每组 20 万张图像。 然后在 100 k × 100 k 100k \times 100k 100k×100k 图像对上计算 FAR 和 VAL 率。 在五个拆分中报告标准误差。

4.2. Personal Photos

Thisisatestsetwithsimilardistributiontoourtraining set, but has been manually verified to have very clean labels. It consists of three personal photo collections with a total of around 12k images. We compute the FAR and VAL rate across all 12k squared pairs of images.

这是一个与我们的训练集分布相似的测试集,但已经过手动验证,标签非常干净。 它由三个个人照片集组成,共有约 12k 张图像。 我们计算所有 12k 平方图像对的 FAR 和 VAL 率。

4.3. Academic Datasets

Labeled Faces in the Wild (LFW) is the de-facto aca- demic test set for face verification [7]. We follow the stan- dard protocol for unrestricted, labeled outside data and re- port the mean classification accuracy as well as the standard error of the mean.

Labeled Faces in the Wild (LFW) 是用于人脸验证的事实上的学术测试集 [7]。 我们遵循不受限制的、标记的外部数据的标准协议,并报告平均分类准确度以及均值的标准误差。

Youtube Faces DB [21] is a new dataset that has gained popularity in the face recognition community [17, 15]. The setup is similar to LFW, but instead of verifying pairs of images, pairs of videos are used.

Youtube Faces DB [21] 是一个在人脸识别社区 [17, 15] 中广受欢迎的新数据集。 该设置类似于 LFW,但不是验证成对的图像,而是使用成对的视频。

上一篇:交叉编译器 arm-linux-gnueabi 和 arm-linux-gnueabihf 的区别


下一篇:刷题-空格替换