With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking. 1 .
translated by 谷歌翻译
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
translated by 谷歌翻译
Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understand (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, outperforming the human baseline by a decent margin (90.3 versus 89.8). The pre-trained DeBERTa models and the source code were released at: https://github.com/microsoft/DeBERTa 1 .
translated by 谷歌翻译
This paper presents a new UNIfied pre-trained Language Model (UNILM) that can be fine-tuned for both natural language understanding and generation tasks. The model is pre-trained using three types of language modeling tasks: unidirectional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on. UNILM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks. Moreover, UNILM achieves new state-ofthe-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40.51 (2.04 absolute improvement), the Gigaword abstractive summarization ROUGE-L to 35.75 (0.86 absolute improvement), the CoQA generative question answering F1 score to 82.5 (37.1 absolute improvement), the SQuAD question generation BLEU-4 to 22.12 (3.75 absolute improvement), and the DSTC7 document-grounded dialog response generation NIST-4 to 2.67 (human performance is 2.65). The code and pre-trained models are available at https://github.com/microsoft/unilm. * Equal contribution. † Contact person.
translated by 谷歌翻译
近年来,在应用预训练的语言模型(例如Bert)上,取得了巨大进展,以获取信息检索(IR)任务。在网页中通常使用的超链接已被利用用于设计预训练目标。例如,超链接的锚文本已用于模拟查询,从而构建了巨大的查询文档对以进行预训练。但是,作为跨越两个网页的桥梁,尚未完全探索超链接的潜力。在这项工作中,我们专注于建模通过超链接连接的两个文档之间的关系,并为临时检索设计一个新的预训练目标。具体而言,我们将文档之间的关系分为四组:无链接,单向链接,对称链接和最相关的对称链接。通过比较从相邻组采样的两个文档,该模型可以逐渐提高其捕获匹配信号的能力。我们提出了一个渐进的超链接预测({php})框架,以探索预训练中超链接的利用。对两个大规模临时检索数据集和六个提问数据集的实验结果证明了其优于现有的预训练方法。
translated by 谷歌翻译
We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. Span-BERT consistently outperforms BERT and our better-tuned baselines, with substantial gains on span selection tasks such as question answering and coreference resolution. In particular, with the same training data and model size as BERT large , our single model obtains 94.6% and 88.7% F1 on SQuAD 1.1 and 2.0 respectively. We also achieve a new state of the art on the OntoNotes coreference resolution task (79.6% F1), strong performance on the TACRED relation extraction benchmark, and even gains on GLUE. 1 * Equal contribution. 1 Our code and pre-trained models are available at https://github.com/facebookresearch/ SpanBERT.
translated by 谷歌翻译
自回归(AR)和非自动增加(NAR)模型对性能和延迟具有自己的优势,将它们与一个模型相结合,可能会利用两者。目前的组合框架更多地关注多个解码范例的集成,具有统一的生成模型,例如,屏蔽语言模型。然而,由于训练目标和推理之间的差距,概括可能对性能有害。在本文中,我们的目标是通过在统一框架下保留AR和NAR的原始目标来缩小差距。具体地,我们通过将AR和NAR共同建模(左右,左右和直)与新引入的方向变量来提出定向变压器(Diformer),这通过控制每个的预测令牌在那方面有特定的依赖关系。通过方向实现的统一成功地保留了AR和NAR中使用的原始依赖性假设,保留了泛化和性能。 4 WMT基准测试的实验表明,Diformer优于当前的联合建模工作,适用于AR和NAR解码的1.5个以上的BLEU积分,也对最先进的独立AR和NAR模型具有竞争力。
translated by 谷歌翻译
来自变压器(BERT)的双向编码器表示显示了各种NLP任务的奇妙改进,并且已经提出了其连续的变体来进一步提高预先训练的语言模型的性能。在本文中,我们的目标是首先介绍中国伯特的全文掩蔽(WWM)策略,以及一系列中国预培训的语言模型。然后我们还提出了一种简单但有效的型号,称为Macbert,这在几种方面提高了罗伯塔。特别是,我们提出了一种称为MLM作为校正(MAC)的新掩蔽策略。为了展示这些模型的有效性,我们创建了一系列中国预先培训的语言模型,作为我们的基线,包括BERT,Roberta,Electra,RBT等。我们对十个中国NLP任务进行了广泛的实验,以评估创建的中国人托管语言模型以及提议的麦克白。实验结果表明,Macbert可以在许多NLP任务上实现最先进的表演,我们还通过几种可能有助于未来的研究的调查结果来消融细节。我们开源我们的预先培训的语言模型,以进一步促进我们的研究界。资源可用:https://github.com/ymcui/chinese-bert-wwm
translated by 谷歌翻译
变压器注意机制的二次计算和内存复杂性限制了对长序列建模的可扩展性。在本文中,我们提出了Luna,一种线性统一嵌套关注机制,使Softmax注意力具有两个嵌套线性关注功能,仅产生线性(与二次)的时间和空间复杂度相反。具体地,通过第一注意功能,LUNA将输入序列包装成固定长度的序列。然后,使用第二关注功能未包装包装序列。与更传统的关注机制相比,LUNA引入具有固定长度的附加序列作为输入和额外的相应输出,允许LUNA线性地进行关注操作,同时还存储足够的上下文信息。我们对三个序列建模任务的基准进行了广泛的评估:长上下文序列建模,神经机平移和大型预磨损的屏蔽语言建模。竞争甚至更好的实验结果表明了Luna的有效性和效率与各种各样相比
translated by 谷歌翻译
We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by ( 1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new stateof-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also report ablation experiments that replicate other pretraining schemes within the BART framework, to better measure which factors most influence end-task performance.
translated by 谷歌翻译
Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BIGBIRD, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BIGBIRD is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, BIGBIRD drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.
translated by 谷歌翻译
上下文感知的str方法通常使用内部自回旋(AR)语言模型(LM)。 AR模型的固有局限性动机是采用外部LM的两阶段方法。输入图像上外部LM的条件独立性可能导致其错误地纠正正确的预测,从而导致明显的低效率。我们的方法Parseq使用置换语言建模学习了具有共同权重的内部AR LMS集合。它统一了无上下文的非AR和上下文感知的AR推断,并使用双向上下文统一了迭代的精致。使用合成训练数据,Parseq实现了最新的(SOTA),从而获得了Str基准(精度为91.9%)和更具挑战性的数据集。在对实际数据进行培训时,它建立了新的SOTA结果(精度为96.0%)。 Parseq由于其简单,统一的结构和平行的令牌处理,对准确性与参数计数,拖放和延迟非常最佳。由于其广泛使用了注意力,它对在现实世界图像中常见的任意导向文本具有鲁棒性。代码,预处理的权重和数据可在以下网址提供:https://github.com/baudm/parseq。
translated by 谷歌翻译
Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
translated by 谷歌翻译
大型语言模型在各种任务上显示出令人印象深刻的几次结果。但是,当知识是此类结果的关键时,就像问题回答和事实检查之类的任务一样,似乎需要存储知识的大量参数计数。众所周知,检索增强模型可以在不需要多个参数的情况下在知识密集的任务上表现出色,但是目前尚不清楚它们是否在几个弹药设置中工作。在这项工作中,我们介绍了地图集,这是一个经过精心设计和预先训练的增强语言模型,能够通过很少的培训示例学习知识密集型任务。我们对包括MMLU,苏格兰短裙和归类等各种任务进行评估,并研究文档索引内容的影响,表明它可以很容易地进行更新。值得注意的是,在自然问题上仅使用64个示例在自然问题上达到超过42 \%的准确性,尽管参数少了50倍,但比540B参数模型的表现优于540b参数模型。
translated by 谷歌翻译
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.
translated by 谷歌翻译
与伯特(Bert)等语言模型相比,已证明知识增强语言表示的预培训模型在知识基础构建任务(即〜关系提取)中更有效。这些知识增强的语言模型将知识纳入预训练中,以生成实体或关系的表示。但是,现有方法通常用单独的嵌入表示每个实体。结果,这些方法难以代表播出的实体和大量参数,在其基础代币模型之上(即〜变压器),必须使用,并且可以处理的实体数量为由于内存限制,实践限制。此外,现有模型仍然难以同时代表实体和关系。为了解决这些问题,我们提出了一个新的预培训模型,该模型分别从图书中学习实体和关系的表示形式,并分别在文本中跨越跨度。通过使用SPAN模块有效地编码跨度,我们的模型可以代表实体及其关系,但所需的参数比现有模型更少。我们通过从Wikipedia中提取的知识图对我们的模型进行了预训练,并在广泛的监督和无监督的信息提取任务上进行了测试。结果表明,我们的模型比基线学习对实体和关系的表现更好,而在监督的设置中,微调我们的模型始终优于罗伯塔,并在信息提取任务上取得了竞争成果。
translated by 谷歌翻译
Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression. We then outline directions for future research.
translated by 谷歌翻译
在培训数据中拟合复杂的模式,例如推理和争议,是语言预训练的关键挑战。根据最近的研究和我们的经验观察,一种可能的原因是训练数据中的一些易于适应的模式,例如经常共同发生的单词组合,主导和伤害预训练,使模型很难适合更复杂的信息。我们争辩说,错误预测可以帮助找到危害语言理解的这种主导模式。当发生错误预测时,应该经常与导致MIS预测的模型拟合的MIS预测字相同的模式。如果我们可以添加正规化以培训模型,当MIS预测发生并更多地对待更微妙的模式时,可以在更多信息上缩小到这种主导模式时,可以在预训练中有效地安装更多信息。在此动机之后,我们提出了一种新的语言预培训方法,错误预测作为伤害警报(MPA)。在MPA中,当在预训练期间发生错误预测时,我们使用其共同发生信息来指导自我关注模块的多个头部。变压器模块中的一些自我关注头经过优化,以将更低的注意重量分配给频繁地在误报中的输入句子中的单词,同时将更高权重分配给另一个单词。通过这样做,变压器模型训练,以依赖于主导的频繁共同发生模式,而在误报中,当发生错误预测时,在剩余更复杂的信息上更加关注更多。我们的实验表明,MPA加快了伯特和电器的预训练,并提高了他们对下游任务的表现。
translated by 谷歌翻译
变压器模型是置换等分之一的。要提供输入令牌的顺序和类型信息,通常将位置和段嵌入式添加到输入中。最近的作品提出了具有相对位置编码的位置编码的变化,实现了更好的性能。我们的分析表明,增益实际上来自从输入中将位置信息移动到注意层。由此激励,我们介绍了变压器(饮食)的解耦的位置注意,一个简单但有效的机制,将位置和分段信息编码为变压器模型。该方法具有更快的培训和推理时间,同时在胶水,Xtreme和WMT基准上实现竞争性能。我们进一步概括了我们的方法到远程变压器并显示性能增益。
translated by 谷歌翻译
许多NLP任务需要处理超出预磨模模型的长度限制的长语境。为了将这些模型扩展到更长的文本序列,已经提出了许多有效的远程注意力变体。尽管沿着这个方向进行了丰富的研究,但仍然难以在实际用例中衡量这些模型的相对有效性,例如,如果我们在预先rain-yfetune范式之后应用这些模型。在这项工作中,我们的目标是对这些具有大规模和受控实验的这些新兴模型进行彻底的分析。对于每个关注变体,我们使用相同的长DOC语料库,然后使用相同的长DOC语料库,然后为现实世界的长情节任务进行芬特这些模型。我们的调查结果揭示了现有广泛使用的远程基准的陷阱,并显示任何经过测试的高效关注可以在标准预介质范式下击败一个简单的本地窗口关注。对本地注意力变化的进一步分析表明,即使是常用的注意力窗口重叠也没有必要实现良好的下游结果 - 使用不相交的本地关注,我们能够构建符合性能的更简单且更高效的Long-Doc QA模型霍尔福勒〜\ citep {longformer}其预先花费的一半。
translated by 谷歌翻译