We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by ( 1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new stateof-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also report ablation experiments that replicate other pretraining schemes within the BART framework, to better measure which factors most influence end-task performance.
translated by 谷歌翻译
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
translated by 谷歌翻译
This paper presents a new UNIfied pre-trained Language Model (UNILM) that can be fine-tuned for both natural language understanding and generation tasks. The model is pre-trained using three types of language modeling tasks: unidirectional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on. UNILM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks. Moreover, UNILM achieves new state-ofthe-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40.51 (2.04 absolute improvement), the Gigaword abstractive summarization ROUGE-L to 35.75 (0.86 absolute improvement), the CoQA generative question answering F1 score to 82.5 (37.1 absolute improvement), the SQuAD question generation BLEU-4 to 22.12 (3.75 absolute improvement), and the DSTC7 document-grounded dialog response generation NIST-4 to 2.67 (human performance is 2.65). The code and pre-trained models are available at https://github.com/microsoft/unilm. * Equal contribution. † Contact person.
translated by 谷歌翻译
Pre-trained models have achieved remarkable success in natural language processing (NLP). However, existing pre-training methods underutilize the benefits of language understanding for generation. Inspired by the idea of Generative Adversarial Networks (GANs), we propose a GAN-style model for encoder-decoder pre-training by introducing an auxiliary discriminator, unifying the ability of language understanding and generation in a single model. Our model, named as GanLM, is trained with two pre-training objectives: replaced token detection and replaced token denoising. Specifically, given masked source sentences, the generator outputs the target distribution and the discriminator predicts whether the target sampled tokens from distribution are incorrect. The target sentence is replaced with misclassified tokens to construct noisy previous context, which is used to generate the gold sentence. In general, both tasks improve the ability of language understanding and generation by selectively using the denoising data. Extensive experiments in language generation benchmarks show that GanLM with the powerful language understanding capability outperforms various strong pre-trained language models (PLMs) and achieves state-of-the-art performance.
translated by 谷歌翻译
Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
translated by 谷歌翻译
We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. Span-BERT consistently outperforms BERT and our better-tuned baselines, with substantial gains on span selection tasks such as question answering and coreference resolution. In particular, with the same training data and model size as BERT large , our single model obtains 94.6% and 88.7% F1 on SQuAD 1.1 and 2.0 respectively. We also achieve a new state of the art on the OntoNotes coreference resolution task (79.6% F1), strong performance on the TACRED relation extraction benchmark, and even gains on GLUE. 1 * Equal contribution. 1 Our code and pre-trained models are available at https://github.com/facebookresearch/ SpanBERT.
translated by 谷歌翻译
在本文中,我们利用了以前的预训练模型(PTM)的优势,并提出了一种新型的中国预训练的不平衡变压器(CPT)。与以前的中国PTM不同,CPT旨在利用自然语言理解(NLU)和自然语言生成(NLG)之间的共同知识来促进表现。 CPT包括三个部分:共享编码器,一个理解解码器和一代解码器。具有共享编码器的两个特定解码器分别通过蒙版语言建模(MLM)进行了预训练,并分别将自动编码(DAE)任务进行了验证。借助部分共享的体系结构和多任务预培训,CPT可以(1)使用两个解码器学习NLU或NLG任务的特定知识,并且(2)对模型的潜力充分利用了微调。此外,不平衡的变压器节省了计算和存储成本,这使CPT竞争激烈,并极大地加速了文本生成的推断。对各种中国NLU和NLG任务的实验结果显示了CPT的有效性。
translated by 谷歌翻译
在这项工作中,我们探索如何学习专用的语言模型,旨在学习从文本文件中学习关键词的丰富表示。我们在判别和生成设置中进行预训练变压器语言模型(LMS)的不同掩蔽策略。在歧视性设定中,我们引入了一种新的预训练目标 - 关键边界,用替换(kbir)infifiling,在使用Kbir预先训练的LM进行微调时显示出在Sota上的性能(F1中高达9.26点)的大量增益关键酶提取的任务。在生成设置中,我们为BART - 键盘介绍了一个新的预训练设置,可再现与CATSeq格式中的输入文本相关的关键字,而不是Denoised原始输入。这也导致在关键词中的性能(F1 @ M)中的性能(高达4.33点),用于关键正版生成。此外,我们还微调了在命名实体识别(ner),问题应答(qa),关系提取(重新),抽象摘要和达到与SOTA的可比性表现的预训练的语言模型,表明学习丰富的代表关键词确实有利于许多其他基本的NLP任务。
translated by 谷歌翻译
Bidirectional Encoder Representations from Transformers (BERT; Devlin et al. 2019) represents the latest incarnation of pretrained language models which have recently advanced a wide range of natural language processing tasks. In this paper, we showcase how BERT can be usefully applied in text summarization and propose a general framework for both extractive and abstractive models. We introduce a novel document-level encoder based on BERT which is able to express the semantics of a document and obtain representations for its sentences. Our extractive model is built on top of this encoder by stacking several intersentence Transformer layers. For abstractive summarization, we propose a new fine-tuning schedule which adopts different optimizers for the encoder and the decoder as a means of alleviating the mismatch between the two (the former is pretrained while the latter is not). We also demonstrate that a two-staged fine-tuning approach can further boost the quality of the generated summaries. Experiments on three datasets show that our model achieves stateof-the-art results across the board in both extractive and abstractive settings. 1
translated by 谷歌翻译
Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit nonparametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, and another which can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state of the art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
translated by 谷歌翻译
This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART -a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective . mBART is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete model allows it to be directly fine tuned for supervised (both sentence-level and document-level) and unsupervised machine translation, with no task-specific modifications. We demonstrate that adding mBART initialization produces performance gains in all but the highest-resource settings, including up to 12 BLEU points for low resource MT and over 5 BLEU points for many document-level and unsupervised models. We also show it also enables new types of transfer to language pairs with no bi-text or that were not in the pre-training corpus, and present extensive analysis of which factors contribute the most to effective pre-training.
translated by 谷歌翻译
我们提出了一项实证研究,以适应现有的经过验证的文本对文本模型,以备长期输入。通过沿预训练管道的三个轴的全面研究 - 模型架构,优化目标和训练式语料库,我们提出了一种有效的食谱,以从现有的短篇小说模型中构建长篇小说模型。具体而言,我们用汇总仪的块关注替换了变压器中的全部注意力,并使用蒙版的跨度预测任务为模型预算,长度不同。就训练训练的语料库而言,我们发现,与使用通常在其域覆盖范围中通常受到限制的现有长文档语料库相比,使用大型开放域语料库的随机串联的短篇小说可以提高性能。通过这些发现,我们建立了一个长篇文本模型,该模型可以在长篇文本质量检查任务上实现竞争性能,并在五个长文本摘要数据集上建立新的最新技术,通常优于先前的方法,具有较大的模型大小。
translated by 谷歌翻译
最近的工作表明,(1)增加输入长度或(2)增加模型大小可以提高基于变压器的神经模型的性能。在本文中,我们提出了一个名为Longt5的新模型,我们探讨了同时缩放输入长度和模型大小的效果。具体而言,我们综合了从长输入变压器(ETC)的关注思路,并采用了从摘要预训练(PEGASU)的预训练策略进入可扩展的T5架构。结果是我们称之为{\ EM瞬态全球}(TGLOBAL)的新关注机制,这些机制是模仿等本地/全球注意力机制,但不需要额外的侧面输入。我们能够实现最先进的结果,以若干摘要任务,优于问题应答任务的原始T5模型。
translated by 谷歌翻译
Sequence-to-sequence (seq2seq) learning is a popular fashion for large-scale pretraining language models. However, the prior seq2seq pretraining models generally focus on reconstructive objectives on the decoder side and neglect the effect of encoder-side supervision, which we argue may lead to sub-optimal performance. To verify our hypothesis, we first empirically study the functionalities of the encoder and decoder in seq2seq pretrained language models, and find that the encoder takes an important but under-exploitation role than the decoder regarding the downstream performance and neuron activation. Therefore, we propose an encoding-enhanced seq2seq pretraining strategy, namely E2S2, which improves the seq2seq models via integrating more efficient self-supervised information into the encoders. Specifically, E2S2 adopts two self-supervised objectives on the encoder side from two aspects: 1) locally denoising the corrupted sentence (denoising objective); and 2) globally learning better sentence representations (contrastive objective). With the help of both objectives, the encoder can effectively distinguish the noise tokens and capture high-level (i.e. syntactic and semantic) knowledge, thus strengthening the ability of seq2seq model to accurately achieve the conditional generation. On a large diversity of downstream natural language understanding and generation tasks, E2S2 dominantly improves the performance of its powerful backbone models, e.g. BART and T5. For example, upon BART backbone, we achieve +1.1% averaged gain on the general language understanding evaluation (GLUE) benchmark and +1.75% F_0.5 score improvement on CoNLL2014 dataset. We also provide in-depth analyses to show the improvement stems from better linguistic representation. We hope that our work will foster future self-supervision research on seq2seq language model pretraining.
translated by 谷歌翻译
最近在单语数据和机器翻译(MT)进行微调的预培训方面取得了成功,但尚不清楚如何最好地利用预先训练的模型来完成给定的MT任务。本文在微调MT上的预训练模型时研究了冻结参数的好处和缺点。我们专注于1)微调仅在英语单语言数据的BART上训练的模型。2)微调一个模型,该模型对25种语言的单语言数据进行了培训,Mbart。对于Bart,我们通过冻结大多数模型参数并添加额外的位置嵌入来获得最佳性能。对于MBART,我们将大多数语言对的天真微调的性能与编码器以及大多数解码器搭配。编码器的注意参数对于微调最重要。当将自己限制为越南人对英语的室外训练套装时,我们看到了基线的最大进步。
translated by 谷歌翻译
ROUGE is a standard automatic evaluation metric based on n-grams for sequence-to-sequence tasks, while cross-entropy loss is an essential objective of neural network language model that optimizes at a unigram level. We present differentiable n-gram objectives, attempting to alleviate the discrepancy between training criterion and evaluating criterion. The objective maximizes the probabilistic weight of matched sub-sequences, and the novelty of our work is the objective weights the matched sub-sequences equally and does not ceil the number of matched sub-sequences by the ground truth count of n-grams in reference sequence. We jointly optimize cross-entropy loss and the proposed objective, providing decent ROUGE score enhancement over abstractive summarization dataset CNN/DM and XSum, outperforming alternative n-gram objectives.
translated by 谷歌翻译
本文介绍了Z-Code ++,这是一种针对抽象文本摘要优化的新的预训练的语言模型。该模型使用三种技术扩展了艺术编码器模型的状态。首先,我们使用两阶段的预训练过程来改善模型在低资源摘要任务上的性能。该模型首先是使用文本语料库进行语言理解的预先培训的,然后在汇总语料库中不断预先培训,以进行基础文本生成。其次,我们用分离的注意力层代替编码器中的自我发项层,其中每个单词都使用两个向量分别代表其内容和位置。第三,我们使用融合编码器,这是一种以层次方式编码长序列的简单而有效的方法。 Z-Code ++在13个文本摘要任务中的9个跨5种语言中创建了新的艺术状态。我们的模型的参数有效,因为它的表现优于XSUM上600倍较大的Palm-540b,并且在Samsum上的易经的200倍GPT3-175B较大。在零射击和少量设置中,我们的模型大大优于竞争模型。
translated by 谷歌翻译
查询聚焦的文本摘要(QFTS)任务旨在构建基于给定查询的文本文档摘要的构建系统。解决此任务的关键挑战是缺乏培训摘要模型的大量标记数据。在本文中,我们通过探索一系列域适应技术来解决这一挑战。鉴于最近在广泛的自然语言处理任务中进行预先接受的变压器模型的成功,我们利用此类模型为单文档和多文件方案的QFTS任务产生抽象摘要。对于域适应,我们使用预先训练的变压器的摘要模型应用了各种技术,包括转移学习,弱监督学习和远程监督。六个数据集的广泛实验表明,我们所提出的方法非常有效地为QFTS任务产生抽象摘要,同时在一组自动和人类评估指标上设置新的最先进的结果。
translated by 谷歌翻译
在这项工作中,我们证明了多种语的大规模序列到序列(SEQ2SEQ)模型,该模型是通过Denoising和因果语言建模(CLM)任务的混合物进行训练的,比仅解码器模型更有效地进行了效率的学习者在各种任务上。特别是,我们培训了一个名为Alexa教师模型(Alexatm 20b)的200亿个参数多语言SEQ2SEQ模型,并表明它在1-Shot摘要任务上实现了最先进的(SOTA)性能,超过了更大的540B PALM DOPODER模型。 Alexatm 20b还可以在1-Shot Machine翻译中实现SOTA,尤其是对于低资源语言,几乎所有语言对(阿拉伯语,英语,法语,德语,德语,印地语,意大利语,日语,以及flores-101数据集上的泰卢固语)。我们还显示了零拍设置,AlexATM 20B在SuperGlue和SqueadV2数据集上的表现优于GPT3(175B),并在XNLI,XCOPA,PAWS-X和XWINOGRAD等多语言任务上提供SOTA性能。总体而言,我们的结果为SEQ2SEQ模型提供了一个令人信服的案例,作为大型语言模型(LLM)培训的仅解码器模型的强大替代方法。
translated by 谷歌翻译
以查询为中心的摘要(QFS)旨在产生应答感兴趣的特定问题的摘要,从而实现更大的用户控制和个性化。虽然最近发布的数据集如QMSUM或Aquamuse,促进QFS中的研究工作,但该领域缺乏对适用建模方法的广泛空间的全面研究。在本文中,考虑到两种普遍的方法,我们对QFS进行了系统探索,探讨了QFS:两阶段的采掘解决方案和端到端模型。在这些类别中,我们调查现有方法,并呈现了在QMSUM数据集上实现最先进的性能的两个模型扩展,其边缘高达3.38 Rouge-1,3.72 Rouge-2和3.28 Rouge-L。通过定量实验,我们突出了不同模型配置之间的权衡,并探讨了摘要任务之间的转移能力。代码和检查点公开可用:https://github.com/salesforce/query-focused-sum。
translated by 谷歌翻译