本文为美国宾夕法尼亚州立大学(作者:Nicolas Papernot)的博士论文,共177页。
近年来,机器学习(ML)的发展使许多令人眼花缭乱的应用成为可能,例如对象识别、自主系统、安全诊断和围棋游戏。机器学习不仅是构建软件和系统的一种新范式,而且正在带来大规模的社会颠覆。人们越来越认识到ML暴露了软件系统中的新漏洞,然而技术界对这些漏洞的性质和理解仍然有限。
本文主要研究ML模型的完整性。完整性是指模型预测相对于预期结果的真实性。这一特性是传统机器学习评估的核心,如实践者中普遍存在的度量标准(如准确性)。大部分ML技术是为良性执行环境而设计的。然而,对手的存在可能会迫使模型训练和测试所依据的分布之间出现不匹配,从而使这些基本假设失效。随着ML在运输或能源等的应用和决策依赖性日益增强,所产生的模型正成为对手的目标,他们有强烈的动机迫使ML预测失误。
我探讨了在测试时针对ML完整性的攻击空间。给定对训练模型的完全或有限的访问权,我设计了修改测试数据的策略,以在训练和测试分布之间创建最坏情况下的漂移。这部分研究的意义在于,一个对系统访问能力非常弱,对其部署的ML技术知之甚少的对手,只要有能力以先知的身份与之交互,就可以对此类系统发动强大的攻击:即,发送对手选择的输入并观察ML预测。这种对ML模型泛化不良的系统阐述表明,当模型的预测远离其训练数据时,缺乏可靠的置信估计。因此,为了提高模型对这些对抗性操纵的鲁棒性,我努力降低远离训练分布的预测的置信度。根据在黑匣子威胁模型下攻击的进展,首先确定了两种防御的局限性:防御蒸馏和对抗性训练。
然后,描述了最近针对这些缺点的防御性运动。为此,介绍了Deep k-最近邻分类器,通过在测试时进行完整性检查来增强深度神经网络。该方法将深度神经网络在测试数据上生成的内部表示与在其训练点上学习的内部表示进行比较。在深度神经网络的各个层次上,通过训练点的标记来估计预测的不一致性。
保角预测方法的应用为更可靠地估计模型的预测可信度铺平了道路,即训练数据对预测的支持程度。反过来,我们将具有高可信度的合法测试数据与低可信度的对抗性数据区分开来。这项研究呼吁未来的研究着重于深度神经网络各个层次的鲁棒性,而不是把模型当作黑匣子来处理。这与深度神经网络的模块化特性很好地吻合,后者协调简单的计算来模拟复杂的函数。这也让我们可以联系到其他领域,比如ML中的可解释性,它试图回答这样一个问题:“我们如何才能为人类的模型预测提供解释?“这个研究方向的另一个副产品是,我可以更好地识别ML模型的漏洞,这些漏洞是ML算法的结果,可以通过数据来解释。
Advances in machine learning (ML) in recent years have enabled a dizzying array of applications such as object recognition, autonomous systems, security diagnostics, and playing the game of Go. Machine learning is not only a new paradigm for building software and systems, it is bringing social disruption at scale. There is growing recognition that ML exposes new vulnerabilities in software systems, yet the technical community’s understanding of the nature and extent of these vulnerabilities remains limited. In this thesis, I focus my study on the integrity of ML models. Integrity refers here to the faithfulness of model predictions with respect to an expected outcome. This property is at the core of traditional machine learning evaluation, as demonstrated by the pervasiveness of metrics such as accuracy among practitioners. A large fraction of ML techniques were designed for benign execution environments. Yet, the presence of adversaries may invalidate some of these underlying assumptions by forcing a mismatch between the distributions on which the model is trained and tested. As ML is increasingly applied and being relied on for decision-making in critical applications like transportation or energy, the models produced are becoming a target for adversaries who have a strong incentive to force ML to mispredict. I explore the space of attacks against ML integrity at test time. Given full or limited access to a trained model, I devise strategies that modify the test data to create a worst-case drift between the training and test distributions. The implications of this part of my research is that an adversary with very weak access to a system, and little knowledge about the ML techniques it deploys, can nevertheless mount powerful attacks against such systems as long as she has the capability of interacting with it as an oracle: i.e., send inputs of the adversary’s choice and observe the ML prediction. This systematic exposition of the poor generalization of ML models indicates the lack of reliable confidence estimates when the model is making predictions far from its training data. Hence, my e↵orts to increase the robustness of models to these adversarial manipulations strive to decrease the confidence of predictions made far from the training distribution. Informed by my progress on attacks operating in the black-box threat model, I first identify limitations to two defenses: defensive distillation and adversarial training.
I then describe recent defensive e↵orts addressing these shortcomings. To this end, I introduce the Deep k-Nearest Neighbors classifier, which augments deep neural networks with an integrity check at test time. The approach compares internal representations produced by the deep neural network on test data with the ones learned on its training points. Using the labels of training points whose representations neighbor the test input across the deep neural network’s layers, I estimate the nonconformity of the prediction with respect to the model’s training data. An application of conformal prediction methodology then paves the way for more reliable estimates of the model’s prediction credibility, i.e., how well the prediction is supported by training data. In turn, we distinguish legitimate test data with high credibility from adversarial data with low credibility. This research calls for future e↵orts to investigate the robustness of individual layers of deep neural networks rather than treating the model as a black-box. This aligns well with the modular nature of deep neural networks, which orchestrate simple computations to model complex functions. This also allows us to draw connections to other areas like interpretability in ML, which seeks to answer the question of: “How can we provide an explanation for the model prediction to a human?” Another by-product of this research direction is that I better distinguish vulnerabilities of ML models that are a consequence of the ML algorithms from those that can be explained by artifacts in the data.
-
引言
- 基础概念:机器学习概述
- 安全模型与文献回顾
- 对抗性范例制作
- 对抗性范例可迁移性
- 针对机器学习的实用黑盒攻击
- Deep k最近邻算法
- 安全机器学习的发展方向
更多精彩文章请关注公众号: