文本后门攻击是对NLP系统的实际威胁。通过在训练阶段注入后门,对手可以通过预定义的触发器控制模型预测。由于已经提出了各种攻击和防御模型,因此进行严格的评估至关重要。但是,我们在以前的后门学习评估中重点介绍了两个问题:(1)忽略了现实世界情景(例如释放中毒的数据集或模型)之间的差异,我们认为每种情况都有其自身的限制和关注点,因此需要特定的评估。协议; (2)评估指标仅考虑攻击是否可以翻转模型对中毒样品的预测并保留对良性样品的表演,但是忽略了中毒样品也应该是隐秘和语义上的。为了解决这些问题,我们将现有作品分为三种实际情况,在这种情况下,攻击者分别释放数据集,预培训模型和微调模型,然后讨论其独特的评估方法。关于指标,为了完全评估中毒样本,我们使用语法误差增加和隐形性差异以及有效性的文本相似性。对框架进行正式化后,我们开发了一个开源工具包openbackdoor,以促进文本后门学习的实现和评估。使用此工具包,我们在建议的范式下进行基准攻击和防御模型进行广泛的实验。为了促进针对中毒数据集的不充分的防御能力,我们进一步提出了Cube,这是一个简单而强大的基于聚类的防御基线。我们希望我们的框架和基准可以作为未来模型开发和评估的基石。
translated by 谷歌翻译
最近,已经表明,自然语言处理(NLP)模型容易受到一种称为后门攻击的安全威胁,它利用“后门触发器”范例误导模型。最威胁的后门攻击是隐身的后门,它将触发器定义为文本样式或句法。虽然他们已经取得了令人难以置信的高攻击成功率(ASR),但我们发现为ASR的主要因素贡献不是“后门触发”范式。因此,当作为后门攻击分类时,这些隐身后门攻击的能力大得多。因此,为了评估后门攻击的真正攻击力,我们提出了一种称为攻击成功率差异(ASRD)的新度量,从而测量干净状态和毒药状态模型之间的ASR差异。此外,由于对抗隐蔽的后门攻击的防御,我们提出了触发破坏者,包括两个太简单的技巧,可以有效地防御隐秘的后门攻击。关于文本分类任务的实验表明,我们的方法比对隐身后门攻击的最先进的防御方法实现了更好的性能。
translated by 谷歌翻译
后门攻击对NLP模型构成了新的威胁。在后门攻击中构建中毒数据的标准策略是将触发器(例如,稀有字)插入所选句子,并将原始标签更改为目标标签。该策略具有从触发器和标签视角轻松检测到的严重缺陷:注入的触发器,通常是一种罕见的单词,导致异常的自然语言表达,因此可以通过防御模型容易地检测到异常的自然语言表达;改变的目标标签会导致误报标记的示例,因此可以通过手动检查容易地检测到。要处理此问题,请在本文中,我们提出了一种新的策略来执行不需要外部触发的文本后门攻击,并且中毒样品被正确标记。拟议策略的核心思想是构建清洁标记的例子,其标签是正确的,但可以导致测试标签在与培训集合融合时的变化。为了产生中毒清洁标记的例子,我们提出了一种基于遗传算法的句子生成模型,以满足文本数据的不可微差特性。广泛的实验表明,拟议的攻击策略不仅有效,而且更重要的是,由于其令人触发和清洁的性质,难以防御。我们的工作标志着在NLP中开发令人触发的攻击策略的第一步。
translated by 谷歌翻译
后门攻击是对深度神经网络(DNN)的一种紧急培训时间威胁。它们可以操纵DNN的输出并具有高度思虑。在自然语言处理领域,已经提出了一些攻击方法,并在多个流行型号上实现了非常高的攻击成功率。尽管如此,很少有关于捍卫文本后门攻击的研究。在本文中,我们提出了一个简单且有效的文本后门防御,名为洋葱,这是基于异常字检测,并据我们所知,是可以处理所有文本后门攻击情况的第一种方法。实验证明了我们模型在捍卫Bilstm和BERT的措施与五种不同的后门攻击的有效性。本文的所有代码和数据都可以在https://github.com/thunlp/onion获得。
translated by 谷歌翻译
预训练模型(PTM)已被广泛用于各种下游任务。 PTM的参数分布在Internet上,可能会遭受后门攻击。在这项工作中,我们演示了PTMS的普遍脆弱性,在该工作中,可以通过任意下游任务中的后门攻击轻松控制PTMS。具体而言,攻击者可以添加一个简单的预训练任务,该任务将触发实例的输出表示限制为预定义的向量,即神经元级后门攻击(NEUBA)。如果在微调过程中未消除后门功能,则触发器可以通过预定义的矢量预测固定标签。在自然语言处理(NLP)和计算机视觉(CV)的实验中,我们表明Neuba绝对可以控制触发实例的预测,而无需了解下游任务。最后,我们将几种防御方法应用于Neuba,并发现模型修剪是通过排除后门神经元来抵抗Neuba的有希望的方向。我们的发现听起来是红色警报,用于广泛使用PTM。我们的源代码和模型可在\ url {https://github.com/thunlp/neuba}上获得。
translated by 谷歌翻译
Pre-trained language models allowed us to process downstream tasks with the help of fine-tuning, which aids the model to achieve fairly high accuracy in various Natural Language Processing (NLP) tasks. Such easily-downloaded language models from various websites empowered the public users as well as some major institutions to give a momentum to their real-life application. However, it was recently proven that models become extremely vulnerable when they are backdoor attacked with trigger-inserted poisoned datasets by malicious users. The attackers then redistribute the victim models to the public to attract other users to use them, where the models tend to misclassify when certain triggers are detected within the training sample. In this paper, we will introduce a novel improved textual backdoor defense method, named MSDT, that outperforms the current existing defensive algorithms in specific datasets. The experimental results illustrate that our method can be effective and constructive in terms of defending against backdoor attack in text domain. Code is available at https://github.com/jcroh0508/MSDT.
translated by 谷歌翻译
We conduct a systematic study of backdoor vulnerabilities in normally trained Deep Learning models. They are as dangerous as backdoors injected by data poisoning because both can be equally exploited. We leverage 20 different types of injected backdoor attacks in the literature as the guidance and study their correspondences in normally trained models, which we call natural backdoor vulnerabilities. We find that natural backdoors are widely existing, with most injected backdoor attacks having natural correspondences. We categorize these natural backdoors and propose a general detection framework. It finds 315 natural backdoors in the 56 normally trained models downloaded from the Internet, covering all the different categories, while existing scanners designed for injected backdoors can at most detect 65 backdoors. We also study the root causes and defense of natural backdoors.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Backdoor attacks represent one of the major threats to machine learning models. Various efforts have been made to mitigate backdoors. However, existing defenses have become increasingly complex and often require high computational resources or may also jeopardize models' utility. In this work, we show that fine-tuning, one of the most common and easy-to-adopt machine learning training operations, can effectively remove backdoors from machine learning models while maintaining high model utility. Extensive experiments over three machine learning paradigms show that fine-tuning and our newly proposed super-fine-tuning achieve strong defense performance. Furthermore, we coin a new term, namely backdoor sequela, to measure the changes in model vulnerabilities to other attacks before and after the backdoor has been removed. Empirical evaluation shows that, compared to other defense methods, super-fine-tuning leaves limited backdoor sequela. We hope our results can help machine learning model owners better protect their models from backdoor threats. Also, it calls for the design of more advanced attacks in order to comprehensively assess machine learning models' backdoor vulnerabilities.
translated by 谷歌翻译
后门攻击已成为深度神经网络(DNN)的主要安全威胁。虽然现有的防御方法在检测或擦除后以后展示了有希望的结果,但仍然尚不清楚是否可以设计强大的培训方法,以防止后门触发器首先注入训练的模型。在本文中,我们介绍了\ emph {反后门学习}的概念,旨在培训\ emph {Clean}模型给出了后门中毒数据。我们将整体学习过程框架作为学习\ emph {clean}和\ emph {backdoor}部分的双重任务。从这种观点来看,我们确定了两个后门攻击的固有特征,因为他们的弱点2)后门任务与特定类(后门目标类)相关联。根据这两个弱点,我们提出了一般学习计划,反后门学习(ABL),在培训期间自动防止后门攻击。 ABL引入了标准培训的两级\ EMPH {梯度上升}机制,帮助分离早期训练阶段的后台示例,2)在后续训练阶段中断后门示例和目标类之间的相关性。通过对多个基准数据集的广泛实验,针对10个最先进的攻击,我们经验证明,后卫中毒数据上的ABL培训模型实现了与纯净清洁数据训练的相同性能。代码可用于\ url {https:/github.com/boylyg/abl}。
translated by 谷歌翻译
The frustratingly fragile nature of neural network models make current natural language generation (NLG) systems prone to backdoor attacks and generate malicious sequences that could be sexist or offensive. Unfortunately, little effort has been invested to how backdoor attacks can affect current NLG models and how to defend against these attacks. In this work, by giving a formal definition of backdoor attack and defense, we investigate this problem on two important NLG tasks, machine translation and dialog generation. Tailored to the inherent nature of NLG models (e.g., producing a sequence of coherent words given contexts), we design defending strategies against attacks. We find that testing the backward probability of generating sources given targets yields effective defense performance against all different types of attacks, and is able to handle the {\it one-to-many} issue in many NLG tasks such as dialog generation. We hope that this work can raise the awareness of backdoor risks concealed in deep NLG systems and inspire more future work (both attack and defense) towards this direction.
translated by 谷歌翻译
Transforming off-the-shelf deep neural network (DNN) models into dynamic multi-exit architectures can achieve inference and transmission efficiency by fragmenting and distributing a large DNN model in edge computing scenarios (e.g., edge devices and cloud servers). In this paper, we propose a novel backdoor attack specifically on the dynamic multi-exit DNN models. Particularly, we inject a backdoor by poisoning one DNN model's shallow hidden layers targeting not this vanilla DNN model but only its dynamically deployed multi-exit architectures. Our backdoored vanilla model behaves normally on performance and cannot be activated even with the correct trigger. However, the backdoor will be activated when the victims acquire this model and transform it into a dynamic multi-exit architecture at their deployment. We conduct extensive experiments to prove the effectiveness of our attack on three structures (ResNet-56, VGG-16, and MobileNet) with four datasets (CIFAR-10, SVHN, GTSRB, and Tiny-ImageNet) and our backdoor is stealthy to evade multiple state-of-the-art backdoor detection or removal methods.
translated by 谷歌翻译
我们调查对神经序列到序列(SEQ2SEQ)模型的新威胁:训练时间攻击使模型“自旋”的输出,以支持对抗的选择情绪或观点,但仅在输入包含时逆境触发词。例如,旋转的摘要模型将输出提到某些个人或组织名称的文本的正摘要。模型纺纱使得宣传的AS-A-Service。对手可以创建为所选触发产生所需的旋转的自定义语言模型,然后部署它们以生成虚假信息(平台攻击),或者将它们注入ML培训管道(供应链攻击),将恶意功能转移到下游模型。在技​​术术语中,模型纺纱将一个“Meta-Backdoor”引入模型中。虽然传统的后门导致模型在具有触发器的输入上产生不正确的输出,但旋转模型的输出保留上下文并维持标准精度度量,但也满足了对手(例如,积极情绪)选择的元任务。为了证明模型纺丝的可行性,我们开发了一种新的回溯技术。它将对手元任务堆叠到SEQ2SEQ模型上,将所需的元任务输出返回到嵌入空间中的所需的元任务输出,我们称之为“伪字”,并使用伪字来换档SEQ2Seq模型的整个输出分布。我们评估了对语言生成,摘要和翻译模型的攻击,具有不同的触发器和诸如情感,毒性和征集等方面的触发器和荟萃任务。旋转模型在满足对抗的元任务时保持其准确性指标。在供应链中攻击旋转转移到下游型号。最后,我们提出了一个黑匣子,元任务独立的防御,以检测选择性地将旋转旋转到具有特定触发的输入的模型。
translated by 谷歌翻译
Backdoor attacks have emerged as one of the major security threats to deep learning models as they can easily control the model's test-time predictions by pre-injecting a backdoor trigger into the model at training time. While backdoor attacks have been extensively studied on images, few works have investigated the threat of backdoor attacks on time series data. To fill this gap, in this paper we present a novel generative approach for time series backdoor attacks against deep learning based time series classifiers. Backdoor attacks have two main goals: high stealthiness and high attack success rate. We find that, compared to images, it can be more challenging to achieve the two goals on time series. This is because time series have fewer input dimensions and lower degrees of freedom, making it hard to achieve a high attack success rate without compromising stealthiness. Our generative approach addresses this challenge by generating trigger patterns that are as realistic as real-time series patterns while achieving a high attack success rate without causing a significant drop in clean accuracy. We also show that our proposed attack is resistant to potential backdoor defenses. Furthermore, we propose a novel universal generator that can poison any type of time series with a single generator that allows universal attacks without the need to fine-tune the generative model for new time series datasets.
translated by 谷歌翻译
在这项工作中,我们提出了一个新的和一般的框架来防御后门攻击,灵感来自攻击触发器通常遵循\ textsc {特定}类型的攻击模式,因此,中毒训练示例在彼此期间对彼此产生更大的影响训练。我们介绍了{\ IT影响图}的概念,它包括分别代表各个训练点和相关的对方式的节点和边缘组成。一对训练点之间的影响代表了去除一个训练点对另一个训练点的影响,由影响函数\ citep {koh2017understanding}近似。通过查找特定大小的最大平均子图来提取恶意训练点。关于计算机视觉和自然语言处理任务的广泛实验证明了所提出的框架的有效性和一般性。
translated by 谷歌翻译
与令人印象深刻的进步触动了我们社会的各个方面,基于深度神经网络(DNN)的AI技术正在带来越来越多的安全问题。虽然在考试时间运行的攻击垄断了研究人员的初始关注,但是通过干扰培训过程来利用破坏DNN模型的可能性,代表了破坏训练过程的可能性,这是破坏AI技术的可靠性的进一步严重威胁。在后门攻击中,攻击者损坏了培训数据,以便在测试时间诱导错误的行为。然而,测试时间误差仅在存在与正确制作的输入样本对应的触发事件的情况下被激活。通过这种方式,损坏的网络继续正常输入的预期工作,并且只有当攻击者决定激活网络内隐藏的后门时,才会发生恶意行为。在过去几年中,后门攻击一直是强烈的研究活动的主题,重点是新的攻击阶段的发展,以及可能对策的提议。此概述文件的目标是审查发表的作品,直到现在,分类到目前为止提出的不同类型的攻击和防御。指导分析的分类基于攻击者对培训过程的控制量,以及防御者验证用于培训的数据的完整性,并监控DNN在培训和测试中的操作时间。因此,拟议的分析特别适合于参考他们在运营的应用方案的攻击和防御的强度和弱点。
translated by 谷歌翻译
视觉变压器(VITS)具有与卷积神经网络相比,具有较小的感应偏置的根本不同的结构。随着绩效的提高,VIT的安全性和鲁棒性也非常重要。与许多最近利用VIT反对对抗性例子的鲁棒性的作品相反,本文调查了代表性的病因攻击,即后门。我们首先检查了VIT对各种后门攻击的脆弱性,发现VIT也很容易受到现有攻击的影响。但是,我们观察到,VIT的清洁数据准确性和后门攻击成功率在位置编码之前对补丁转换做出了明显的反应。然后,根据这一发现,我们为VIT提出了一种通过补丁处理来捍卫基于补丁的触发后门攻击的有效方法。在包括CIFAR10,GTSRB和Tinyimagenet在内的几个基准数据集上评估了这些表演,这些数据表明,该拟议的新颖防御在减轻VIT的后门攻击方面非常成功。据我们所知,本文提出了第一个防御性策略,该策略利用了反对后门攻击的VIT的独特特征。
translated by 谷歌翻译
计算能力和大型培训数据集的可用性增加,机器学习的成功助长了。假设它充分代表了在测试时遇到的数据,则使用培训数据来学习新模型或更新现有模型。这种假设受到中毒威胁的挑战,这种攻击会操纵训练数据,以损害模型在测试时的表现。尽管中毒已被认为是行业应用中的相关威胁,到目前为止,已经提出了各种不同的攻击和防御措施,但对该领域的完整系统化和批判性审查仍然缺失。在这项调查中,我们在机器学习中提供了中毒攻击和防御措施的全面系统化,审查了过去15年中该领域发表的100多篇论文。我们首先对当前的威胁模型和攻击进行分类,然后相应地组织现有防御。虽然我们主要关注计算机视觉应用程序,但我们认为我们的系统化还包括其他数据模式的最新攻击和防御。最后,我们讨论了中毒研究的现有资源,并阐明了当前的局限性和该研究领域的开放研究问题。
translated by 谷歌翻译
AI安全社区的一个主要目标是为现实世界应用安全可靠地生产和部署深入学习模型。为此,近年来,在生产阶段(或培训阶段)和相应的防御中,基于数据中毒基于深度神经网络(DNN)的后门攻击以及相应的防御。具有讽刺意味的是,部署阶段的后门攻击,这些攻击通常可以在不专业用户的设备中发生,因此可以说是在现实世界的情景中威胁要威胁,得以更少的关注社区。我们将这种警惕的不平衡归因于现有部署阶段后门攻击算法的弱实用性以及现实世界攻击示范的不足。为了填补空白,在这项工作中,我们研究了对DNN的部署阶段后门攻击的现实威胁。我们基于普通使用的部署阶段攻击范式 - 对抗对抗权重攻击的研究,主体选择性地修改模型权重,以将后台嵌入到部署的DNN中。为了实现现实的实用性,我们提出了第一款灰度盒和物理可实现的重量攻击算法,即替换注射,即子网替换攻击(SRA),只需要受害者模型的架构信息,并且可以支持现实世界中的物理触发器。进行了广泛的实验模拟和系统级真实的世界攻击示范。我们的结果不仅提出了所提出的攻击算法的有效性和实用性,还揭示了一种新型计算机病毒的实际风险,这些计算机病毒可能会广泛传播和悄悄地将后门注入用户设备中的DNN模型。通过我们的研究,我们要求更多地关注DNN在部署阶段的脆弱性。
translated by 谷歌翻译
后门攻击是对深度学习模型的强大攻击算法。最近,GNN对后门攻击的脆弱性已被证明,尤其是在图形分类任务上。在本文中,我们提出了GNN上的第一种后门检测和防御方法。大多数后门攻击都取决于向干净样品注入小但有影响力的扳机。对于图数据,当前的后门攻击专注于操纵图形结构以注入触发器。我们发现,良性样本和恶意样本之间存在明显的差异,例如忠诚度和不忠行为。在确定了恶意样本后,GNN模型的解释性可以帮助我们捕获最重要的子图,这可能是Trojan图中的触发器。我们使用各种数据集和不同的攻击设置来证明我们的防御方法的有效性。攻击成功率的所有事实都大大降低。
translated by 谷歌翻译