智能论文笔记

The Feasibility and Inevitability of Stealth Attacks

Ivan Y. Tyukin , Desmond J. Higham , Alexander Bastounis , Eliyas Woldegeorgis , Alexander N. Gorban

分类：人工智能 | 机器学习

2021-06-26

We develop and study new adversarial perturbations that enable an attacker to gain control over decisions in generic Artificial Intelligence (AI) systems including deep learning neural networks. In contrast to adversarial data modification, the attack mechanism we consider here involves alterations to the AI system itself. Such a stealth attack could be conducted by a mischievous, corrupt or disgruntled member of a software development team. It could also be made by those wishing to exploit a ``democratization of AI'' agenda, where network architectures and trained parameter sets are shared publicly. We develop a range of new implementable attack strategies with accompanying analysis, showing that with high probability a stealth attack can be made transparent, in the sense that system performance is unchanged on a fixed validation set which is unknown to the attacker, while evoking any desired output on a trigger input of interest. The attacker only needs to have estimates of the size of the validation set and the spread of the AI's relevant latent space. In the case of deep learning neural networks, we show that a one neuron attack is possible - a modification to the weights and bias associated with a single neuron - revealing a vulnerability arising from over-parameterization. We illustrate these concepts using state of the art architectures on two standard image data sets. Guided by the theory and computational results, we also propose strategies to guard against stealth attacks.

translated by 谷歌翻译

ML Attack Models: Adversarial Attacks and Data Poisoning Attacks

Jing Lin , Long Dang , Mohamed Rahouti , Kaiqi Xiong

分类：机器学习

2021-12-06

许多最先进的ML模型在各种任务中具有优于图像分类的人类。具有如此出色的性能，ML模型今天被广泛使用。然而，存在对抗性攻击和数据中毒攻击的真正符合ML模型的稳健性。例如，Engstrom等人。证明了最先进的图像分类器可以容易地被任意图像上的小旋转欺骗。由于ML系统越来越纳入安全性和安全敏感的应用，对抗攻击和数据中毒攻击构成了相当大的威胁。本章侧重于ML安全的两个广泛和重要的领域：对抗攻击和数据中毒攻击。

translated by 谷歌翻译

An Overview of Backdoor Attacks Against Deep Neural Networks and Possible Defences

Wei Guo , Benedetta Tondi , Mauro Barni

分类：计算机视觉

2021-11-16

与令人印象深刻的进步触动了我们社会的各个方面，基于深度神经网络（DNN）的AI技术正在带来越来越多的安全问题。虽然在考试时间运行的攻击垄断了研究人员的初始关注，但是通过干扰培训过程来利用破坏DNN模型的可能性，代表了破坏训练过程的可能性，这是破坏AI技术的可靠性的进一步严重威胁。在后门攻击中，攻击者损坏了培训数据，以便在测试时间诱导错误的行为。然而，测试时间误差仅在存在与正确制作的输入样本对应的触发事件的情况下被激活。通过这种方式，损坏的网络继续正常输入的预期工作，并且只有当攻击者决定激活网络内隐藏的后门时，才会发生恶意行为。在过去几年中，后门攻击一直是强烈的研究活动的主题，重点是新的攻击阶段的发展，以及可能对策的提议。此概述文件的目标是审查发表的作品，直到现在，分类到目前为止提出的不同类型的攻击和防御。指导分析的分类基于攻击者对培训过程的控制量，以及防御者验证用于培训的数据的完整性，并监控DNN在培训和测试中的操作时间。因此，拟议的分析特别适合于参考他们在运营的应用方案的攻击和防御的强度和弱点。

translated by 谷歌翻译

RAB: Provable Robustness Against Backdoor Attacks

Maurice Weber , Xiaojun Xu , Bojan Karlaš , Ce Zhang , Bo Li

分类：机器学习 | (统计)机器学习

2020-03-19

最近的研究表明，深神经网络（DNN）易受对抗性攻击的影响，包括逃避和后门（中毒）攻击。在防守方面，有密集的努力，改善了对逃避袭击的经验和可怜的稳健性;然而，对后门攻击的可稳健性仍然很大程度上是未开发的。在本文中，我们专注于认证机器学习模型稳健性，反对一般威胁模型，尤其是后门攻击。我们首先通过随机平滑技术提供统一的框架，并展示如何实例化以证明对逃避和后门攻击的鲁棒性。然后，我们提出了第一个强大的培训过程Rab，以平滑训练有素的模型，并证明其稳健性对抗后门攻击。我们派生机学习模型的稳健性突出了培训的机器学习模型，并证明我们的鲁棒性受到紧张。此外，我们表明，可以有效地训练强大的平滑模型，以适用于诸如k最近邻分类器的简单模型，并提出了一种精确的平滑训练算法，该算法消除了从这种模型的噪声分布采样采样的需要。经验上，我们对MNIST，CIFAR-10和Imagenet数据集等DNN，差异私有DNN和K-NN模型等不同机器学习（ML）型号进行了全面的实验，并为反卧系攻击提供认证稳健性的第一个基准。此外，我们在SPAMBase表格数据集上评估K-NN模型，以展示所提出的精确算法的优点。对多元化模型和数据集的综合评价既有关于普通训练时间攻击的进一步强劲学习策略的多样化模型和数据集的综合评价。

translated by 谷歌翻译

The Limitations of Deep Learning in Adversarial Settings

Nicolas Papernot , Patrick McDaniel , Somesh Jha , Matt Fredrikson , Z. Berkay Celik , Ananthram Swami

分类：

2015-11-24

Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.

translated by 谷歌翻译

Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks

Nicolas Papernot , Patrick McDaniel , Xi Wu , Somesh Jha , Ananthram Swami

分类：

2015-11-14

Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10 30 . We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.

translated by 谷歌翻译

TSS: Transformation-Specific Smoothing for Robustness Certification

Linyi Li , Maurice Weber , Xiaojun Xu , Luka Rimanic , Bhavya Kailkhura , Tao Xie , Ce Zhang , Bo Li

分类：机器学习 | 计算机视觉 | (统计)机器学习

2020-02-27

由于机器学习（ML）系统变得普遍存在，因此保护其安全性至关重要。然而，最近已经证明，动机的对手能够通过使用语义转换扰乱测试数据来误导ML系统。虽然存在丰富的研究机构，但为ML模型提供了可提供的稳健性保证，以防止$ \ ell_p $ norm界限对抗对抗扰动，抵御语义扰动的保证仍然很广泛。在本文中，我们提供了TSS - 一种统一的框架，用于针对一般对抗性语义转换的鲁棒性认证。首先，根据每个转换的性质，我们将常见的变换划分为两类，即可解决的（例如，高斯模糊）和差异可解的（例如，旋转）变换。对于前者，我们提出了特定于转型的随机平滑策略并获得强大的稳健性认证。后者类别涵盖涉及插值错误的变换，我们提出了一种基于分层采样的新方法，以证明稳健性。我们的框架TSS利用这些认证策略并结合了一致性增强的培训，以提供严谨的鲁棒性认证。我们对十种挑战性语义转化进行了广泛的实验，并表明TSS显着优于现有技术。此外，据我们所知，TSS是第一种在大规模想象数据集上实现非竞争认证稳健性的方法。例如，我们的框架在ImageNet上实现了旋转攻击的30.4％认证的稳健准确性（在$ \ PM 30 ^ \ CIC $）。此外，要考虑更广泛的转换，我们展示了TSS对自适应攻击和不可预见的图像损坏，例如CIFAR-10-C和Imagenet-C。

translated by 谷歌翻译

How to Certify Machine Learning Based Safety-critical Systems? A Systematic Literature Review

Florian Tambon , Gabriel Laberge , Le An , Amin Nikanjam , Paulina Stevia Nouwou Mindom , Yann Pequignot , Foutse Khomh , Giulio Antoniol , Ettore Merlo , François Laviolette

分类：机器学习

2021-07-26

背景信息：在过去几年中，机器学习（ML）一直是许多创新的核心。然而，包括在所谓的“安全关键”系统中，例如汽车或航空的系统已经被证明是非常具有挑战性的，因为ML的范式转变为ML带来完全改变传统认证方法。目的：本文旨在阐明与ML为基础的安全关键系统认证有关的挑战，以及文献中提出的解决方案，以解决它们，回答问题的问题如何证明基于机器学习的安全关键系统？'方法：我们开展2015年至2020年至2020年之间发布的研究论文的系统文献综述（SLR），涵盖了与ML系统认证有关的主题。总共确定了217篇论文涵盖了主题，被认为是ML认证的主要支柱：鲁棒性，不确定性，解释性，验证，安全强化学习和直接认证。我们分析了每个子场的主要趋势和问题，并提取了提取的论文的总结。结果：单反结果突出了社区对该主题的热情，以及在数据集和模型类型方面缺乏多样性。它还强调需要进一步发展学术界和行业之间的联系，以加深域名研究。最后，它还说明了必须在上面提到的主要支柱之间建立连接的必要性，这些主要柱主要主要研究。结论：我们强调了目前部署的努力，以实现ML基于ML的软件系统，并讨论了一些未来的研究方向。

translated by 谷歌翻译

Origins of Low-dimensional Adversarial Perturbations

Elvis Dohmatob , Chuan Guo , Morgane Goibert

分类： (统计)机器学习 | 机器学习

2022-03-25

在本文中，我们启动了对分类中低维对逆动力（LDAP）现象的严格研究。与经典设置不同，这些扰动仅限于尺寸$ k $的子空间，该子空间比功能空间的尺寸$ d $小得多。 $ k = 1 $的情况对应于所谓的通用对抗扰动（UAPS; Moosavi-Dezfooli等，2017）。首先，我们考虑在通用规律条件（包括RELU网络）下的二进制分类器，并根据任何子空间的愚蠢率计算分析下限。这些界限明确强调了愚蠢率对模型的点缘的依赖性（即，在测试点的输出与其梯度的$ L_2 $ norm的比率），以及给定子空间与该梯度的对齐模型W.R.T.的梯度输入。我们的结果为启发式方法的最新成功提供了有效产生低维对对抗性扰动的严格解释。最后，我们表明，如果决策区域紧凑，那么它将接受通用的对抗性扰动，其$ l_2 $ norm，比典型的$ \ sqrt {d} $倍乘以数据点的典型$ l_2 $ norm。我们的理论结果通过对合成和真实数据的实验证实。

translated by 谷歌翻译

Introduction to Machine Learning for the Sciences

Titus Neupert , Mark H Fischer , Eliska Greplova , Kenny Choo , M. Michael Denner

分类：机器学习

2021-02-08

这是一门专门针对STEM学生开发的介绍性机器学习课程。我们的目标是为有兴趣的读者提供基础知识，以在自己的项目中使用机器学习，并将自己熟悉术语作为进一步阅读相关文献的基础。在这些讲义中，我们讨论受监督，无监督和强化学习。注释从没有神经网络的机器学习方法的说明开始，例如原理分析，T-SNE，聚类以及线性回归和线性分类器。我们继续介绍基本和先进的神经网络结构，例如密集的进料和常规神经网络，经常性的神经网络，受限的玻尔兹曼机器，（变性）自动编码器，生成的对抗性网络。讨论了潜在空间表示的解释性问题，并使用梦和对抗性攻击的例子。最后一部分致力于加强学习，我们在其中介绍了价值功能和政策学习的基本概念。

translated by 谷歌翻译

The Devil is in the GAN: Backdoor Attacks and Defenses in Deep Generative Models

Ambrish Rawat , Killian Levacher , Mathieu Sinn

分类：人工智能 | 机器学习

2021-08-03

Deep Generative Models (DGMs) are a popular class of deep learning models which find widespread use because of their ability to synthesize data from complex, high-dimensional manifolds. However, even with their increasing industrial adoption, they haven't been subject to rigorous security and privacy analysis. In this work we examine one such aspect, namely backdoor attacks on DGMs which can significantly limit the applicability of pre-trained models within a model supply chain and at the very least cause massive reputation damage for companies outsourcing DGMs form third parties. While similar attacks scenarios have been studied in the context of classical prediction models, their manifestation in DGMs hasn't received the same attention. To this end we propose novel training-time attacks which result in corrupted DGMs that synthesize regular data under normal operations and designated target outputs for inputs sampled from a trigger distribution. These attacks are based on an adversarial loss function that combines the dual objectives of attack stealth and fidelity. We systematically analyze these attacks, and show their effectiveness for a variety of approaches like Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), as well as different data domains including images and audio. Our experiments show that - even for large-scale industry-grade DGMs (like StyleGAN) - our attacks can be mounted with only modest computational effort. We also motivate suitable defenses based on static/dynamic model and output inspections, demonstrate their usefulness, and prescribe a practical and comprehensive defense strategy that paves the way for safe usage of DGMs.

translated by 谷歌翻译

Excess Capacity and Backdoor Poisoning

Naren Sarayu Manoj , Avrim Blum

分类：机器学习 | (统计)机器学习

2021-09-02

后门数据中毒攻击是一种对抗的攻击，其中攻击者将几个水印，误标记的训练示例注入训练集中。水印不会影响典型数据模型的测试时间性能;但是，该模型在水印示例中可靠地错误。为获得对后门数据中毒攻击的更好的基础认识，我们展示了一个正式的理论框架，其中一个人可以讨论对分类问题的回溯数据中毒攻击。然后我们使用它来分析这些攻击的重要统计和计算问题。在统计方面，我们识别一个参数，我们称之为记忆能力，捕捉到后门攻击的学习问题的内在脆弱性。这使我们能够争论几个自然学习问题的鲁棒性与后门攻击。我们的结果，攻击者涉及介绍后门攻击的明确建设，我们的鲁棒性结果表明，一些自然问题设置不能产生成功的后门攻击。从计算的角度来看，我们表明，在某些假设下，对抗训练可以检测训练集中的后门的存在。然后，我们表明，在类似的假设下，我们称之为呼叫滤波和鲁棒概括的两个密切相关的问题几乎等同。这意味着它既是渐近必要的，并且足以设计算法，可以识别训练集中的水印示例，以便获得既广泛概念的学习算法，以便在室外稳健。

translated by 谷歌翻译

Towards a mathematical understanding of learning from few examples with nonlinear feature maps

Oliver J. Sutton , Alexander N. Gorban , Ivan Y. Tyukin

分类：机器学习 | 人工智能 | 计算机视觉

2022-11-07

We consider the problem of data classification where the training set consists of just a few data points. We explore this phenomenon mathematically and reveal key relationships between the geometry of an AI model's feature space, the structure of the underlying data distributions, and the model's generalisation capabilities. The main thrust of our analysis is to reveal the influence on the model's generalisation capabilities of nonlinear feature transformations mapping the original data into high, and possibly infinite, dimensional spaces.

translated by 谷歌翻译

Feature Purification: How Adversarial Training Performs Robust Deep Learning

Zeyuan Allen-Zhu , Yuanzhi Li

分类：机器学习 | 神经与进化计算 | (统计)机器学习

2020-05-20

尽管使用对抗性训练捍卫深度学习模型免受对抗性扰动的经验成功，但到目前为止，仍然不清楚对抗性扰动的存在背后的原则是什么，而对抗性培训对神经网络进行了什么来消除它们。在本文中，我们提出了一个称为特征纯化的原则，在其中，我们表明存在对抗性示例的原因之一是在神经网络的训练过程中，在隐藏的重量中积累了某些小型密集混合物；更重要的是，对抗训练的目标之一是去除此类混合物以净化隐藏的重量。我们介绍了CIFAR-10数据集上的两个实验，以说明这一原理，并且一个理论上的结果证明，对于某些自然分类任务，使用随机初始初始化的梯度下降训练具有RELU激活的两层神经网络确实满足了这一原理。从技术上讲，我们给出了我们最大程度的了解，第一个结果证明，以下两个可以同时保持使用RELU激活的神经网络。（1）对原始数据的训练确实对某些半径的小对抗扰动确实不舒适。（2）即使使用经验性扰动算法（例如FGM），实际上也可以证明对对抗相同半径的任何扰动也可以证明具有强大的良好性。最后，我们还证明了复杂性的下限，表明该网络的低复杂性模型，例如线性分类器，低度多项式或什至是神经切线核，无论使用哪种算法，都无法防御相同半径的扰动训练他们。

translated by 谷歌翻译

Reconstructing Training Data with Informed Adversaries

Borja Balle , Giovanni Cherubin , Jamie Hayes

分类：机器学习

2022-01-13

鉴于对机器学习模型的访问，可以进行对手重建模型的培训数据？这项工作从一个强大的知情对手的镜头研究了这个问题，他们知道除了一个之外的所有培训数据点。通过实例化混凝土攻击，我们表明重建此严格威胁模型中的剩余数据点是可行的。对于凸模型（例如Logistic回归），重建攻击很简单，可以以封闭形式导出。对于更常规的模型（例如神经网络），我们提出了一种基于训练的攻击策略，该攻击策略接收作为输入攻击的模型的权重，并产生目标数据点。我们展示了我们对MNIST和CIFAR-10训练的图像分类器的攻击的有效性，并系统地研究了标准机器学习管道的哪些因素影响重建成功。最后，我们从理论上调查了有多差异的隐私足以通过知情对手减轻重建攻击。我们的工作提供了有效的重建攻击，模型开发人员可以用于评估超出以前作品中考虑的一般设置中的个别点的记忆（例如，生成语言模型或访问培训梯度）;它表明，标准模型具有存储足够信息的能力，以实现培训数据点的高保真重建;它表明，差异隐私可以成功减轻该参数制度中的攻击，其中公用事业劣化最小。

translated by 谷歌翻译

On Adaptive Attacks to Adversarial Example Defenses

Florian Tramer , Nicholas Carlini , Wieland Brendel , Aleksander Madry

分类：

2020-02-19

Adaptive attacks have (rightfully) become the de facto standard for evaluating defenses to adversarial examples. We find, however, that typical adaptive evaluations are incomplete. We demonstrate that thirteen defenses recently published at ICLR, ICML and NeurIPS-and which illustrate a diverse set of defense strategies-can be circumvented despite attempting to perform evaluations using adaptive attacks. While prior evaluation papers focused mainly on the end result-showing that a defense was ineffective-this paper focuses on laying out the methodology and the approach necessary to perform an adaptive attack. Some of our attack strategies are generalizable, but no single strategy would have been sufficient for all defenses. This underlines our key message that adaptive attacks cannot be automated and always require careful and appropriate tuning to a given defense. We hope that these analyses will serve as guidance on how to properly perform adaptive attacks against defenses to adversarial examples, and thus will allow the community to make further progress in building more robust models.

translated by 谷歌翻译

Mapping the Internet: Modelling Entity Interactions in Complex Heterogeneous Networks

Simon Mandlik , Tomas Pevny

分类：机器学习

2021-04-19

即使机器学习算法已经在数据科学中发挥了重要作用，但许多当前方法对输入数据提出了不现实的假设。由于不兼容的数据格式，或数据集中的异质，分层或完全缺少的数据片段，因此很难应用此类方法。作为解决方案，我们提出了一个用于样本表示，模型定义和培训的多功能，统一的框架，称为“ Hmill”。我们深入审查框架构建和扩展的机器学习的多个范围范式。从理论上讲，为HMILL的关键组件的设计合理，我们将通用近似定理的扩展显示到框架中实现的模型所实现的所有功能的集合。本文还包含有关我们实施中技术和绩效改进的详细讨论，该讨论将在MIT许可下发布供下载。该框架的主要资产是其灵活性，它可以通过相同的工具对不同的现实世界数据源进行建模。除了单独观察到每个对象的一组属性的标准设置外，我们解释了如何在框架中实现表示整个对象系统的图表中的消息推断。为了支持我们的主张，我们使用框架解决了网络安全域的三个不同问题。第一种用例涉及来自原始网络观察结果的IoT设备识别。在第二个问题中，我们研究了如何使用以有向图表示的操作系统的快照可以对恶意二进制文件进行分类。最后提供的示例是通过网络中实体之间建模域黑名单扩展的任务。在所有三个问题中，基于建议的框架的解决方案可实现与专业方法相当的性能。

translated by 谷歌翻译

Universal Approximation Theorems for Differentiable Geometric Deep Learning

Anastasis Kratsios , Leonie Papon

分类：机器学习

2021-01-13

本文通过引入几何深度学习（GDL）框架来构建通用馈电型型模型与可区分的流形几何形状兼容的通用馈电型模型，从而解决了对非欧国人数据进行处理的需求。我们表明，我们的GDL模型可以在受控最大直径的紧凑型组上均匀地近似任何连续目标函数。我们在近似GDL模型的深度上获得了最大直径和上限的曲率依赖性下限。相反，我们发现任何两个非分类紧凑型歧管之间始终都有连续的函数，任何“局部定义”的GDL模型都不能均匀地近似。我们的最后一个主要结果确定了数据依赖性条件，确保实施我们近似的GDL模型破坏了“维度的诅咒”。我们发现，任何“现实世界”（即有限）数据集始终满足我们的状况，相反，如果目标函数平滑，则任何数据集都满足我们的要求。作为应用，我们确认了以下GDL模型的通用近似功能：Ganea等。（2018）的双波利馈电网络，实施Krishnan等人的体系结构。（2015年）的深卡尔曼 - 滤波器和深度玛克斯分类器。我们构建了：Meyer等人的SPD-Matrix回归剂的通用扩展/变体。（2011）和Fletcher（2003）的Procrustean回归剂。在欧几里得的环境中，我们的结果暗示了Kidger和Lyons（2020）的近似定理和Yarotsky和Zhevnerchuk（2019）无估计近似率的数据依赖性版本的定量版本。

translated by 谷歌翻译

Certified Robustness to Adversarial Examples with Differential Privacy

Mathias Lecuyer , Vaggelis Atlidakis , Roxana Geambasu , Daniel Hsu , Suman Jana

分类：

2018-02-09

Adversarial examples that fool machine learning models, particularly deep neural networks, have been a topic of intense research interest, with attacks and defenses being developed in a tight back-and-forth. Most past defenses are best effort and have been shown to be vulnerable to sophisticated attacks. Recently a set of certified defenses have been introduced, which provide guarantees of robustness to normbounded attacks. However these defenses either do not scale to large datasets or are limited in the types of models they can support. This paper presents the first certified defense that both scales to large networks and datasets (such as Google's Inception network for ImageNet) and applies broadly to arbitrary model types. Our defense, called PixelDP, is based on a novel connection between robustness against adversarial examples and differential privacy, a cryptographically-inspired privacy formalism, that provides a rigorous, generic, and flexible foundation for defense.

translated by 谷歌翻译

Robust Explanation Constraints for Neural Networks

Matthew Wicker , Juyeon Heo , Luca Costabello , Adrian Weller

分类：机器学习

2022-12-16

Post-hoc explanation methods are used with the intent of providing insights about neural networks and are sometimes said to help engender trust in their outputs. However, popular explanations methods have been found to be fragile to minor perturbations of input features or model parameters. Relying on constraint relaxation techniques from non-convex optimization, we develop a method that upper-bounds the largest change an adversary can make to a gradient-based explanation via bounded manipulation of either the input features or model parameters. By propagating a compact input or parameter set as symbolic intervals through the forwards and backwards computations of the neural network we can formally certify the robustness of gradient-based explanations. Our bounds are differentiable, hence we can incorporate provable explanation robustness into neural network training. Empirically, our method surpasses the robustness provided by previous heuristic approaches. We find that our training method is the only method able to learn neural networks with certificates of explanation robustness across all six datasets tested.

translated by 谷歌翻译