最近,非自动增加(NAT)模型并行地预测输出,与自回归(AT)模型相比,实现了产生速度的大量改进。在对原始数据上表现更差的同时,大多数NAT模型都被培训为在教师模型生成的蒸馏数据上的学生模型,称为序列级知识蒸馏。提高模型性能的有效培训策略是自蒸馏混合(SDM)培训,预先训练原始数据模型,通过预先训练的模型本身产生蒸馏数据,最后重新列举模型原始数据和蒸馏数据的组合。在这项工作中,我们的目标是查看NAT模型的SDM,但发现直接采用SDM到NAT模型在翻译质量方面没有改进。通过仔细分析,我们观察失效与教师模型与NAT学生模型的建模和确认偏差相关。基于这些发现,我们提出了一种增强的策略,通过向经典SDM添加两个阶段来提高名为SDMRT的策略:一个是在自蒸馏数据上进行预重磅,另一个是对滤波后的教师蒸馏数据进行微调。我们的结果在多个NAT模型上以0.6至1.2 bleu表示基础。作为另一个奖励,对于迭代细化NAT模型,我们的方法可以在半迭代号内倾斜基线,这意味着2x加速度。
translated by 谷歌翻译
非自动性变压器(NAT)是文本生成模型的家族,旨在通过并行预测整个句子来减少解码延迟。但是,这种延迟减少牺牲了捕获从左到右的依赖性的能力,从而使NAT学习非常具有挑战性。在本文中,我们介绍了理论和经验分析,以揭示NAT学习的挑战,并提出统一的观点来了解现有的成功。首先,我们表明,简单地通过最大化可能性来训练NAT可以导致边际分布的近似值,但在代币之间降低了所有依赖关系,在该数据集的条件总相关性可以测量删除的信息。其次,我们在统一的框架中正式化了许多以前的目标,并表明他们的成功可以得出结论,以最大程度地提高代理分布的可能性,从而减少了信息损失。实证研究表明,我们的观点可以解释NAT学习中的现象,并指导新培训方法的设计。
translated by 谷歌翻译
Non-autoregressive neural machine translation (NAT) models suffer from the multi-modality problem that there may exist multiple possible translations of a source sentence, so the reference sentence may be inappropriate for the training when the NAT output is closer to other translations. In response to this problem, we introduce a rephraser to provide a better training target for NAT by rephrasing the reference sentence according to the NAT output. As we train NAT based on the rephraser output rather than the reference sentence, the rephraser output should fit well with the NAT output and not deviate too far from the reference, which can be quantified as reward functions and optimized by reinforcement learning. Experiments on major WMT benchmarks and NAT baselines show that our approach consistently improves the translation quality of NAT. Specifically, our best variant achieves comparable performance to the autoregressive Transformer, while being 14.7 times more efficient in inference.
translated by 谷歌翻译
自回归(AR)和非自动增加(NAR)模型对性能和延迟具有自己的优势,将它们与一个模型相结合,可能会利用两者。目前的组合框架更多地关注多个解码范例的集成,具有统一的生成模型,例如,屏蔽语言模型。然而,由于训练目标和推理之间的差距,概括可能对性能有害。在本文中,我们的目标是通过在统一框架下保留AR和NAR的原始目标来缩小差距。具体地,我们通过将AR和NAR共同建模(左右,左右和直)与新引入的方向变量来提出定向变压器(Diformer),这通过控制每个的预测令牌在那方面有特定的依赖关系。通过方向实现的统一成功地保留了AR和NAR中使用的原始依赖性假设,保留了泛化和性能。 4 WMT基准测试的实验表明,Diformer优于当前的联合建模工作,适用于AR和NAR解码的1.5个以上的BLEU积分,也对最先进的独立AR和NAR模型具有竞争力。
translated by 谷歌翻译
最近非自动增加(NAR)机器翻译最近取得了显着的改进,现在优于一些基准测试的自动增加(AR)模型,为AR推断提供有效的替代方案。然而,虽然AR转换通常使用多语言模型来实现,但是从语言之间的转移和改善的服务效率,多语言NAR模型仍然相对未开发。作为一个示例NAR模型和变压器作为半NAR模型,采用连接员时间分类(CTC),我们展示了多语种NAR的全面实证研究。我们在容量限制下对相关语言与负转移之间的积极转移来测试其能力。随着NAR模型需要蒸馏培训套,我们仔细研究双语与多语种教师的影响。最后,我们适合多语言NAR的缩放法,这使得其相对于AR模型的性能随着模型量表的增加而定量。
translated by 谷歌翻译
Free-text rationales (FTRs) follow how humans communicate by explaining reasoning processes via natural language. A number of recent works have studied how to improve language model (LM) generalization by using FTRs to teach LMs the correct reasoning processes behind correct task outputs. These prior works aim to learn from FTRs by appending them to the LM input or target output, but this may introduce an input distribution shift or conflict with the task objective, respectively. We propose KNIFE, which distills FTR knowledge from an FTR-augmented teacher LM (takes both task input and FTR) to a student LM (takes only task input), which is used for inference. Crucially, the teacher LM's forward computation has a bottleneck stage in which all of its FTR states are masked out, which pushes knowledge from the FTR states into the task input/output states. Then, FTR knowledge is distilled to the student LM by training its task input/output states to align with the teacher LM's. On two question answering datasets, we show that KNIFE significantly outperforms existing FTR learning methods, in both fully-supervised and low-resource settings.
translated by 谷歌翻译
Knowledge distillation (KD) has been widely used for model compression and knowledge transfer. Typically, a big teacher model trained on sufficient data transfers knowledge to a small student model. However, despite the success of KD, little effort has been made to study whether KD leaks the training data of the teacher model. In this paper, we experimentally reveal that KD suffers from the risk of privacy leakage. To alleviate this issue, we propose a novel knowledge distillation method, swing distillation, which can effectively protect the private information of the teacher model from flowing to the student model. In our framework, the temperature coefficient is dynamically and adaptively adjusted according to the degree of private information contained in the data, rather than a predefined constant hyperparameter. It assigns different temperatures to tokens according to the likelihood that a token in a position contains private information. In addition, we inject noise into soft targets provided to the student model, in order to avoid unshielded knowledge transfer. Experiments on multiple datasets and tasks demonstrate that the proposed swing distillation can significantly reduce (by over 80% in terms of canary exposure) the risk of privacy leakage in comparison to KD with competitive or better performance. Furthermore, swing distillation is robust against the increasing privacy budget.
translated by 谷歌翻译
半监督学习(SSL)在许多应用领域中已经取得了成功,但这种成功经常涉及任务特定的未标记数据的可用性。知识蒸馏(KD)能够有效地优化紧凑的神经网络,当通过新鲜任务特定的未标记数据蒸馏昂贵的网络时,实现了最佳结果。但是,任务特定的未标记数据可能具有挑战性,特别是对于NLP。我们调查使用生成模型在合成未标记数据中的使用,并呈现一个名为“生成,注释和学习(GAL)”的简单和一般框架。语言模型(LM)用于扫描域中的未标记数据。然后,分类器用于注释这样的数据。最后,综合生成和注释的数据用于推进SSL,KD和NLP和表格任务的几次拍摄学习。为了获得强大的任务特定的LM,我们要么微调来自特定任务的输入的大LM,或者提示具有少数输入示例的大型LM,并且有条件地生成更明显的示例。它还为胶水排行榜上的6层变压器产生了一种新的最先进的。最后,使用GAL的自我训练从UCI存储库的四个表格任务上提供大的收益。
translated by 谷歌翻译
Simultaneous machine translation (SiMT) is usually done via sequence-level knowledge distillation (Seq-KD) from a full-sentence neural machine translation (NMT) model. However, there is still a significant performance gap between NMT and SiMT. In this work, we propose to leverage monolingual data to improve SiMT, which trains a SiMT student on the combination of bilingual data and external monolingual data distilled by Seq-KD. Preliminary experiments on En-Zh and En-Ja news domain corpora demonstrate that monolingual data can significantly improve translation quality (e.g., +3.15 BLEU on En-Zh). Inspired by the behavior of human simultaneous interpreters, we propose a novel monolingual sampling strategy for SiMT, considering both chunk length and monotonicity. Experimental results show that our sampling strategy consistently outperforms the random sampling strategy (and other conventional typical NMT monolingual sampling strategies) by avoiding the key problem of SiMT -- hallucination, and has better scalability. We achieve +0.72 BLEU improvements on average against random sampling on En-Zh and En-Ja. Data and codes can be found at https://github.com/hexuandeng/Mono4SiMT.
translated by 谷歌翻译
已经证明了对比学习适合学习句子嵌入,可以显着提高语义文本相似性(STS)任务。最近,大型对比学习模型,例如句子T5倾向于学到更强大的句子嵌入。虽然有效,但由于计算资源或时间成本限制,这种大型型号很难在线服务。为了解决这个问题,通常采用知识蒸馏(KD),这可以将大型“教师”模型压缩成一个小的“学生”模型,但通常会遭受一些性能损失。在这里,我们提出了一个增强的KD框架,称为蒸馏 - 对比度(迪斯科)。所提出的迪斯科框架首先利用KD将大句子嵌入模型的能力转移到大型未标记数据的小学生模型,然后在标记的训练数据上具有对比学习的学生模型。对于迪斯科舞厅的KD进程,我们进一步提出了对比的知识蒸馏(CKD),以增强教师模型培训,KD和学生模型的一致性,这可能会提高迅速学习的表现。 7 STS基准测试的广泛实验表明,使用所提出的迪斯科和CKD培训的学生模型很少或甚至没有性能损失,并且始终如一地优于相同参数大小的相应对应物。令人惊讶的是,我们的110米学生模型甚至可以优于最新的最新(SOTA)模型,即句子T5(11B),只有1%的参数。
translated by 谷歌翻译
我们从任务特定的BERT基教师模型执行知识蒸馏(KD)基准到各种学生模型:Bilstm,CNN,Bert-Tiny,Bert-Mini和Bert-small。我们的实验涉及在两个任务中分组的12个数据集:印度尼西亚语言中的文本分类和序列标记。我们还比较蒸馏的各个方面,包括使用Word Embeddings和未标记的数据增强的使用。我们的实验表明,尽管基于变压器的模型的普及程度不断上升,但是使用Bilstm和CNN学生模型,与修剪的BERT模型相比,使用Bilstm和CNN学生模型提供了性能和计算资源(CPU,RAM和存储)之间的最佳权衡。我们进一步提出了一些快速胜利,通过涉及涉及丢失功能,Word Embeddings和未标记的数据准备的简单选择的高效KD培训机制来生产小型NLP模型。
translated by 谷歌翻译
This paper describes the submission of the RoyalFlush neural machine translation system for the WMT 2022 translation efficiency task. Unlike the commonly used autoregressive translation system, we adopted a two-stage translation paradigm called Hybrid Regression Translation (HRT) to combine the advantages of autoregressive and non-autoregressive translation. Specifically, HRT first autoregressively generates a discontinuous sequence (e.g., make a prediction every $k$ tokens, $k>1$) and then fills in all previously skipped tokens at once in a non-autoregressive manner. Thus, we can easily trade off the translation quality and speed by adjusting $k$. In addition, by integrating other modeling techniques (e.g., sequence-level knowledge distillation and deep-encoder-shallow-decoder layer allocation strategy) and a mass of engineering efforts, HRT improves 80\% inference speed and achieves equivalent translation performance with the same-capacity AT counterpart. Our fastest system reaches 6k+ words/second on the GPU latency setting, estimated to be about 3.1x faster than the last year's winner.
translated by 谷歌翻译
受益于训练有素的模型的强大能力,近年来近年来的中文分割(CWS)的研究取得了很大进展。然而,由于巨大的计算,大型和复杂的模型无法赋予其工业用途能力。另一方面,对于低资源场景,普遍的解码方法(例如条件随机字段(CRF))无法利用培训数据的完整信息。这项工作提出了一种快速准确的CWS框架,其包含光加权模型和升级的解码方法(PCRF),朝工业低资源CWS场景。首先,我们将基于变压器的学生模型作为编码器蒸发,这不仅加速推理速度而且结合了开放知识和特定于域的知识。其次,评估语言模型的困惑分数融合到CRF模块中以更好地识别字边界。实验表明,与基于原始BERT的模型相比,我们的工作在多达14 \%消耗的多达14 \%的多个数据集中获得了相对高的性能。此外,在低资源设置下,与传统的解码方法相比,我们得到了卓越的结果。
translated by 谷歌翻译
随着互联网和智能手机的广泛影响,电子商务平台的用户群越来越多。由于本地语言用户的英语不是熟悉的,因此他们首选的浏览模式是他们的区域语言或区域语言和英语的组合。从我们最近关于查询数据的研究中,我们注意到我们收到的许多查询都是代码混合物,特别是hinglish,即用英语(拉丁)脚本写的一个或多个印地语单词的查询。我们为代码混合查询转换提出了一种基于变压器的方法,以使用户可以通过这些查询进行搜索。我们证明了在该任务上未标记的英语文本的大型语料库中训练的预训练的编码模型的有效性。使用通用域翻译模型,我们创建了一个伪标记的数据集,用于培训有关搜索查询的模型,并验证了各种数据增强技术的有效性。此外,为了减少模型的延迟,我们使用知识蒸馏和权重量化。该方法的有效性已通过实验评估和A/B测试验证。该模型目前在Flipkart应用程序和网站上直播,可供数百万个查询。
translated by 谷歌翻译
Although continually extending an existing NMT model to new domains or languages has attracted intensive interest in recent years, the equally valuable problem of continually improving a given NMT model in its domain by leveraging knowledge from an unlimited number of existing NMT models is not explored yet. To facilitate the study, we propose a formal definition for the problem named knowledge accumulation for NMT (KA-NMT) with corresponding datasets and evaluation metrics and develop a novel method for KA-NMT. We investigate a novel knowledge detection algorithm to identify beneficial knowledge from existing models at token level, and propose to learn from beneficial knowledge and learn against other knowledge simultaneously to improve learning efficiency. To alleviate catastrophic forgetting, we further propose to transfer knowledge from previous to current version of the given model. Extensive experiments show that our proposed method significantly and consistently outperforms representative baselines under homogeneous, heterogeneous, and malicious model settings for different language pairs.
translated by 谷歌翻译
在过去的几年中,基于变压器的预训练的语言模型在行业和学术界都取得了惊人的成功。但是,较大的模型尺寸和高运行时间延迟是在实践中应用它们的严重障碍,尤其是在手机和物联网(IoT)设备上。为了压缩该模型,最近有大量文献围绕知识蒸馏(KD)的主题长大。然而,KD在基于变压器的模型中的工作方式仍不清楚。我们取消了KD的组件,并提出了一个统一的KD框架。通过框架,花费了23,000多个GPU小时的系统和广泛的实验,从知识类型的角度,匹配策略,宽度深度折衷,初始化,型号大小等。在培训前语言模型中,对先前最新的(SOTA)的相对显着改善。最后,我们为基于变压器模型的KD提供了最佳实践指南。
translated by 谷歌翻译
我们提出了一种简单而有效的方法,用于培训命名实体识别(NER)模型,该模型在业务电话交易记录上运行,该转录本包含噪音,这是由于口语对话的性质和自动语音识别的工件。我们首先通过有限数量的成绩单微调卢克(Luke),这是一种最先进的命名实体识别(NER)模型弱标记的数据和少量的人类注销数据。该模型可以达到高精度,同时还满足了将包含在商业电话产品中的实际限制:在具有成本效益的CPU而不是GPU上部署时实时性能。
translated by 谷歌翻译
Recently, non-autoregressive (NAR) neural machine translation models have received increasing attention due to their efficient parallel decoding. However, the probabilistic framework of NAR models necessitates conditional independence assumption on target sequences, falling short of characterizing human language data. This drawback results in less informative learning signals for NAR models under conventional MLE training, thereby yielding unsatisfactory accuracy compared to their autoregressive (AR) counterparts. In this paper, we propose a simple and model-agnostic multi-task learning framework to provide more informative learning signals. During training stage, we introduce a set of sufficiently weak AR decoders that solely rely on the information provided by NAR decoder to make prediction, forcing the NAR decoder to become stronger or else it will be unable to support its weak AR partners. Experiments on WMT and IWSLT datasets show that our approach can consistently improve accuracy of multiple NAR baselines without adding any additional decoding overhead.
translated by 谷歌翻译
We present Noisy Student Training, a semi-supervised learning approach that works well even when labeled data is abundant. Noisy Student Training achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to 83.7%, reduces ImageNet-C mean corruption error from 45.7 to 28.3, and reduces ImageNet-P mean flip rate from 27.8 to 12.2.Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. On Im-ageNet, we first train an EfficientNet model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images. We then train a larger Efficient-Net as a student model on the combination of labeled and pseudo labeled images. We iterate this process by putting back the student as the teacher. During the learning of the student, we inject noise such as dropout, stochastic depth, and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. 1 * This work was conducted at Google.
translated by 谷歌翻译
Language models (LMs) have demonstrated remarkable performance on downstream tasks, using in-context exemplars or human instructions. Recent works have shown that chain-of-thought (CoT) prompting can elicit models to solve complex reasoning tasks, step-by-step. However, the efficacy of prompt-based CoT methods is restricted to very large LMs such as GPT-3 (175B), thus limiting deployability. In this paper, we revisit the fine-tuning approach to enable complex reasoning in smaller LMs, optimized to efficiently perform a specific task. We propose Fine-tune-CoT, a method that leverages the capabilities of very large LMs to generate reasoning samples and teach smaller models via fine-tuning. We evaluate our method on publicly available LMs across a wide range of complex tasks and model sizes. We find that Fine-tune-CoT enables substantial reasoning capability in small models, whereas previous prompt-based baselines exhibit near-random performance. Student models can even outperform the teacher in some tasks while reducing model size requirements by several orders of magnitude. We conduct extensive ablations and sample studies to understand the reasoning capabilities of student models. We also identify several important nuances that have been overlooked in concurrent fine-tuning works on CoT and address them in our analysis.
translated by 谷歌翻译