预训练会产生对各种下游任务有效的表示,但是目前尚不清楚预训练的有效收益必不可少的特性。值得注意的是,最近的工作表明,即使对合成任务进行预训练也可以在下游任务中取得显着增长。在这项工作中,我们进行了三个实验,可以迭代地简化预训练,并表明简化仍然保留了其大部分收益。首先,在先前的工作中,我们对六个下游任务的三种现有合成预训练方法进行系统评估。我们发现最好的合成预训练方法是石灰,平均获得了自然预训练的收益的67美元\%$。其次,令我们惊讶的是,我们发现由设定功能定义的简单且通用的合成任务进行预培训可实现$ 65 \%的好处,几乎是匹配的石灰。第三,我们发现仅使用合成预培训的参数统计数据可以实现$ 39 \%的利益。我们在https://github.com/felixzli/synthetic_pretraining上发布源代码。
translated by 谷歌翻译
Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
translated by 谷歌翻译
Pre-training is an effective technique for ensuring robust performance on a variety of machine learning tasks. It typically depends on large-scale crawled corpora that can result in toxic or biased models. Such data can also be problematic with respect to copyright, attribution, and privacy. Pre-training with synthetic tasks and data is a promising way of alleviating such concerns since no real-world information is ingested by the model. Our goal in this paper is to understand what makes for a good pre-trained model when using synthetic resources. We answer this question in the context of neural machine translation by considering two novel approaches to translation model pre-training. Our first approach studies the effect of pre-training on obfuscated data derived from a parallel corpus by mapping words to a vocabulary of 'nonsense' tokens. Our second approach explores the effect of pre-training on procedurally generated synthetic parallel data that does not depend on any real human language corpus. Our empirical evaluation on multiple language pairs shows that, to a surprising degree, the benefits of pre-training can be realized even with obfuscated or purely synthetic parallel data. In our analysis, we consider the extent to which obfuscated and synthetic pre-training techniques can be used to mitigate the issue of hallucinated model toxicity.
translated by 谷歌翻译
最近的工作表明,(1)增加输入长度或(2)增加模型大小可以提高基于变压器的神经模型的性能。在本文中,我们提出了一个名为Longt5的新模型,我们探讨了同时缩放输入长度和模型大小的效果。具体而言,我们综合了从长输入变压器(ETC)的关注思路,并采用了从摘要预训练(PEGASU)的预训练策略进入可扩展的T5架构。结果是我们称之为{\ EM瞬态全球}(TGLOBAL)的新关注机制,这些机制是模仿等本地/全球注意力机制,但不需要额外的侧面输入。我们能够实现最先进的结果,以若干摘要任务,优于问题应答任务的原始T5模型。
translated by 谷歌翻译
In this work, we explore "prompt tuning," a simple yet effective mechanism for learning "soft prompts" to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signals from any number of labeled examples. Our end-to-end learned approach outperforms GPT-3's few-shot learning by a large margin. More remarkably, through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method "closes the gap" and matches the strong performance of model tuning (where all model weights are tuned). This finding is especially relevant because large models are costly to share and serve and the ability to reuse one frozen model for multiple downstream tasks can ease this burden. Our method can be seen as a simplification of the recently proposed "prefix tuning" of Li and Liang (2021) and we provide a comparison to this and other similar approaches. Finally, we show that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer and enables efficient "prompt ensembling." * Work done as a Google AI Resident.
translated by 谷歌翻译
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
translated by 谷歌翻译
源代码的预训练的生成语言模型(例如PLBART,CODET5,SPT-CODE)在过去几年中对多个任务(包括代码生成和翻译)产生了强劲的结果。这些模型采用了不同的训练前目标,以自我监督的方式从非常大规模的语料库中学习代码构建的统计数据。预训练模型的成功很大程度上取决于这些预训练的目标。本文提出了一个新的预训练目标,即“归化”源代码,利用代码的双峰,双通道(正式和自然渠道)性质。与自然语言不同,代码的双峰,双通道的性质使我们能够大规模生成语义上等效的代码。我们介绍了六类的语义保存转换,以引入非自然的代码形式,然后强迫我们的模型制作开发人员编写的更自然的原创程序。学习在没有明确的手动监督的情况下,通过大型的开源代码来生成等效但更自然的代码,有助于模型学习摄入和生成代码。我们将模型在三个生成软件工程任务中微调:代码生成,代码翻译和代码改进,具有有限的人类策划标记数据并实现最先进的性能与CODET5。我们表明,我们的预训练模型在零射门和少数学习方面特别有竞争力,并且在学习代码属性(例如语法,数据流)方面更好。
translated by 谷歌翻译
在这项工作中,我们证明了多种语的大规模序列到序列(SEQ2SEQ)模型,该模型是通过Denoising和因果语言建模(CLM)任务的混合物进行训练的,比仅解码器模型更有效地进行了效率的学习者在各种任务上。特别是,我们培训了一个名为Alexa教师模型(Alexatm 20b)的200亿个参数多语言SEQ2SEQ模型,并表明它在1-Shot摘要任务上实现了最先进的(SOTA)性能,超过了更大的540B PALM DOPODER模型。 Alexatm 20b还可以在1-Shot Machine翻译中实现SOTA,尤其是对于低资源语言,几乎所有语言对(阿拉伯语,英语,法语,德语,德语,印地语,意大利语,日语,以及flores-101数据集上的泰卢固语)。我们还显示了零拍设置,AlexATM 20B在SuperGlue和SqueadV2数据集上的表现优于GPT3(175B),并在XNLI,XCOPA,PAWS-X和XWINOGRAD等多语言任务上提供SOTA性能。总体而言,我们的结果为SEQ2SEQ模型提供了一个令人信服的案例,作为大型语言模型(LLM)培训的仅解码器模型的强大替代方法。
translated by 谷歌翻译
We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. Span-BERT consistently outperforms BERT and our better-tuned baselines, with substantial gains on span selection tasks such as question answering and coreference resolution. In particular, with the same training data and model size as BERT large , our single model obtains 94.6% and 88.7% F1 on SQuAD 1.1 and 2.0 respectively. We also achieve a new state of the art on the OntoNotes coreference resolution task (79.6% F1), strong performance on the TACRED relation extraction benchmark, and even gains on GLUE. 1 * Equal contribution. 1 Our code and pre-trained models are available at https://github.com/facebookresearch/ SpanBERT.
translated by 谷歌翻译
本文介绍了Z-Code ++,这是一种针对抽象文本摘要优化的新的预训练的语言模型。该模型使用三种技术扩展了艺术编码器模型的状态。首先,我们使用两阶段的预训练过程来改善模型在低资源摘要任务上的性能。该模型首先是使用文本语料库进行语言理解的预先培训的,然后在汇总语料库中不断预先培训,以进行基础文本生成。其次,我们用分离的注意力层代替编码器中的自我发项层,其中每个单词都使用两个向量分别代表其内容和位置。第三,我们使用融合编码器,这是一种以层次方式编码长序列的简单而有效的方法。 Z-Code ++在13个文本摘要任务中的9个跨5种语言中创建了新的艺术状态。我们的模型的参数有效,因为它的表现优于XSUM上600倍较大的Palm-540b,并且在Samsum上的易经的200倍GPT3-175B较大。在零射击和少量设置中,我们的模型大大优于竞争模型。
translated by 谷歌翻译
我们提出了一项合成任务,乐高(学习平等和小组操作),该任务封装了遵循推理链的问题,我们研究了变压器体系结构如何学习这项任务。我们特别注意数据效应,例如预处理(看似无关的NLP任务)和数据集组成(例如,训练和测试时间时的链长度不同),以及体系结构变体,例如重量绑定层或添加卷积组件。我们研究了受过训练的模型最终如何在任务中取得成功,尤其是我们能够在某种程度上(一定程度地)理解一些注意力头以及网络中的信息如何流动。基于这些观察结果,我们提出了一个假设,即在这里进行预训练仅是因为是智能初始化而不是网络中存储的深层知识。我们还观察到,在某些数据制度中,受过训练的变压器发现“快捷方式”解决方案遵循推理链,这阻碍了该模型将其推广到主要任务的简单变体的能力,而且我们发现人们可以防止适当的快捷方式架构修改或仔细的数据准备。在我们的发现的激励下,我们开始探索学习执行C程序的任务,在此过程中,对变压器进行了卷积修改,即在密钥/查询/值图中添加卷积结构,显示出令人鼓舞的优势。
translated by 谷歌翻译
我们提出了一项实证研究,以适应现有的经过验证的文本对文本模型,以备长期输入。通过沿预训练管道的三个轴的全面研究 - 模型架构,优化目标和训练式语料库,我们提出了一种有效的食谱,以从现有的短篇小说模型中构建长篇小说模型。具体而言,我们用汇总仪的块关注替换了变压器中的全部注意力,并使用蒙版的跨度预测任务为模型预算,长度不同。就训练训练的语料库而言,我们发现,与使用通常在其域覆盖范围中通常受到限制的现有长文档语料库相比,使用大型开放域语料库的随机串联的短篇小说可以提高性能。通过这些发现,我们建立了一个长篇文本模型,该模型可以在长篇文本质量检查任务上实现竞争性能,并在五个长文本摘要数据集上建立新的最新技术,通常优于先前的方法,具有较大的模型大小。
translated by 谷歌翻译
大型语言模型在各种任务上显示出令人印象深刻的几次结果。但是,当知识是此类结果的关键时,就像问题回答和事实检查之类的任务一样,似乎需要存储知识的大量参数计数。众所周知,检索增强模型可以在不需要多个参数的情况下在知识密集的任务上表现出色,但是目前尚不清楚它们是否在几个弹药设置中工作。在这项工作中,我们介绍了地图集,这是一个经过精心设计和预先训练的增强语言模型,能够通过很少的培训示例学习知识密集型任务。我们对包括MMLU,苏格兰短裙和归类等各种任务进行评估,并研究文档索引内容的影响,表明它可以很容易地进行更新。值得注意的是,在自然问题上仅使用64个示例在自然问题上达到超过42 \%的准确性,尽管参数少了50倍,但比540B参数模型的表现优于540b参数模型。
translated by 谷歌翻译
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.
translated by 谷歌翻译
Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understand (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, outperforming the human baseline by a decent margin (90.3 versus 89.8). The pre-trained DeBERTa models and the source code were released at: https://github.com/microsoft/DeBERTa 1 .
translated by 谷歌翻译
We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by ( 1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new stateof-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also report ablation experiments that replicate other pretraining schemes within the BART framework, to better measure which factors most influence end-task performance.
translated by 谷歌翻译
预审前的语言模型已被证明在许多与软件有关的一代任务中都是有效的。但是,它们不适合编辑任务,因为它们不是为了推理编辑的原因。为了解决这个问题,我们提出了一个新颖的预处理目标,该目标明确地对编辑进行了建模并使用它来构建Coditt5,这是一种用于软件相关编辑任务的大型语言模型,该任务是在大量源代码和自然语言评论中鉴定的。我们将其对各种下游编辑任务进行微调,包括评论更新,错误修复和自动代码审核。通过优于基于纯生成的模型,我们证明了方法的普遍性及其对编辑任务的适用性。我们还展示了纯生成模型和我们的基于编辑的模型如何通过简单的重读策略相互补充,我们可以通过该策略实现三个下游编辑任务的最新性能。
translated by 谷歌翻译
Neural models that do not rely on pre-training have excelled in the keyphrase generation task with large annotated datasets. Meanwhile, new approaches have incorporated pre-trained language models (PLMs) for their data efficiency. However, there lacks a systematic study of how the two types of approaches compare and how different design choices can affect the performance of PLM-based models. To fill in this knowledge gap and facilitate a more informed use of PLMs for keyphrase extraction and keyphrase generation, we present an in-depth empirical study. Formulating keyphrase extraction as sequence labeling and keyphrase generation as sequence-to-sequence generation, we perform extensive experiments in three domains. After showing that PLMs have competitive high-resource performance and state-of-the-art low-resource performance, we investigate important design choices including in-domain PLMs, PLMs with different pre-training objectives, using PLMs with a parameter budget, and different formulations for present keyphrases. Further results show that (1) in-domain BERT-like PLMs can be used to build strong and data-efficient keyphrase generation models; (2) with a fixed parameter budget, prioritizing model depth over width and allocating more layers in the encoder leads to better encoder-decoder models; and (3) introducing four in-domain PLMs, we achieve a competitive performance in the news domain and the state-of-the-art performance in the scientific domain.
translated by 谷歌翻译
问题答案(QA)是自然语言处理中最具挑战性的最具挑战性的问题之一(NLP)。问答(QA)系统试图为给定问题产生答案。这些答案可以从非结构化或结构化文本生成。因此,QA被认为是可以用于评估文本了解系统的重要研究区域。大量的QA研究致力于英语语言,调查最先进的技术和实现最先进的结果。然而,由于阿拉伯QA中的研究努力和缺乏大型基准数据集,在阿拉伯语问答进展中的研究努力得到了很大速度的速度。最近许多预先接受的语言模型在许多阿拉伯语NLP问题中提供了高性能。在这项工作中,我们使用四个阅读理解数据集来评估阿拉伯QA的最先进的接种变压器模型,它是阿拉伯语 - 队,ArcD,AQAD和TYDIQA-GoldP数据集。我们微调并比较了Arabertv2基础模型,ArabertV0.2大型型号和ARAElectra模型的性能。在最后,我们提供了一个分析,了解和解释某些型号获得的低绩效结果。
translated by 谷歌翻译
大型审慎的语言模型最近征服了自然语言处理领域。作为BERT中引入的主要掩盖语言建模的替代方案,T5模型引入了更通用的训练目标,即序列转换的顺序,其中包括蒙版语言模型,但自然地适合文本生成任务,例如机器翻译,摘要,开放 - 开放 - 域问题回答,文本简化,对话系统等。T5模型的单语变体仅限于资源良好的语言,而大量的多语言T5模型则支持101种语言。相比之下,我们训练了两个不同尺寸的T5型序列,以使用较少的资源并分析其行为的形态丰富的斯洛文尼语的序列模型。关于分类任务,SLOT5模型主要落后于单语Slovene Sloberta模型,但应考虑生成任务。
translated by 谷歌翻译