

本文为美国宾夕法尼亚州立大学(作者:Nicolas Papernot)的博士论文,共177页。




然后,描述了最近针对这些缺点的防御性运动。为此,介绍了Deep k-最近邻分类器,通过在测试时进行完整性检查来增强深度神经网络。该方法将深度神经网络在测试数据上生成的内部表示与在其训练点上学习的内部表示进行比较。在深度神经网络的各个层次上,通过训练点的标记来估计预测的不一致性。


Advances in machine learning (ML) in recent years have enabled a dizzying array of applications such as object recognition, autonomous systems, security diagnostics, and playing the game of Go. Machine learning is not only a new paradigm for building software and systems, it is bringing social disruption at scale. There is growing recognition that ML exposes new vulnerabilities in software systems, yet the technical community’s understanding of the nature and extent of these vulnerabilities remains limited. In this thesis, I focus my study on the integrity of ML models. Integrity refers here to the faithfulness of model predictions with respect to an expected outcome. This property is at the core of traditional machine learning evaluation, as demonstrated by the pervasiveness of metrics such as accuracy among practitioners. A large fraction of ML techniques were designed for benign execution environments. Yet, the presence of adversaries may invalidate some of these underlying assumptions by forcing a mismatch between the distributions on which the model is trained and tested. As ML is increasingly applied and being relied on for decision-making in critical applications like transportation or energy, the models produced are becoming a target for adversaries who have a strong incentive to force ML to mispredict. I explore the space of attacks against ML integrity at test time. Given full or limited access to a trained model, I devise strategies that modify the test data to create a worst-case drift between the training and test distributions. The implications of this part of my research is that an adversary with very weak access to a system, and little knowledge about the ML techniques it deploys, can nevertheless mount powerful attacks against such systems as long as she has the capability of interacting with it as an oracle: i.e., send inputs of the adversary’s choice and observe the ML prediction. This systematic exposition of the poor generalization of ML models indicates the lack of reliable confidence estimates when the model is making predictions far from its training data. Hence, my e↵orts to increase the robustness of models to these adversarial manipulations strive to decrease the confidence of predictions made far from the training distribution. Informed by my progress on attacks operating in the black-box threat model, I first identify limitations to two defenses: defensive distillation and adversarial training.

I then describe recent defensive e↵orts addressing these shortcomings. To this end, I introduce the Deep k-Nearest Neighbors classifier, which augments deep neural networks with an integrity check at test time. The approach compares internal representations produced by the deep neural network on test data with the ones learned on its training points. Using the labels of training points whose representations neighbor the test input across the deep neural network’s layers, I estimate the nonconformity of the prediction with respect to the model’s training data. An application of conformal prediction methodology then paves the way for more reliable estimates of the model’s prediction credibility, i.e., how well the prediction is supported by training data. In turn, we distinguish legitimate test data with high credibility from adversarial data with low credibility. This research calls for future e↵orts to investigate the robustness of individual layers of deep neural networks rather than treating the model as a black-box. This aligns well with the modular nature of deep neural networks, which orchestrate simple computations to model complex functions. This also allows us to draw connections to other areas like interpretability in ML, which seeks to answer the question of: “How can we provide an explanation for the model prediction to a human?” Another by-product of this research direction is that I better distinguish vulnerabilities of ML models that are a consequence of the ML algorithms from those that can be explained by artifacts in the data.

  1.   引言
  2. 基础概念:机器学习概述
  3. 安全模型与文献回顾
  4. 对抗性范例制作
  5. 对抗性范例可迁移性
  6. 针对机器学习的实用黑盒攻击
  7. Deep k最近邻算法
  8. 安全机器学习的发展方向


上一篇:Training Vision Transformers for Image Retrieval 论文笔记

下一篇:How to determine the correct number of epoch during neural network training? 如何确定Epoch