FAIRSEQ is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks. The toolkit is based on PyTorch and supports distributed training across multiple GPUs and machines. We also support fast mixed-precision training and inference on modern GPUs. A demo video can be found here: https://www.youtube. com/watch?v=OtgDdWtHvto.
translated by 谷歌翻译
The prevalent approach to sequence to sequence learning maps an input sequence to a variable length output sequence via recurrent neural networks. We introduce an architecture based entirely on convolutional neural networks. 1 Compared to recurrent models, computations over all elements can be fully parallelized during training to better exploit the GPU hardware and optimization is easier since the number of non-linearities is fixed and independent of the input length. Our use of gated linear units eases gradient propagation and we equip each decoder layer with a separate attention module. We outperform the accuracy of the deep LSTM setup of Wu et al. (2016) on both WMT'14 English-German and WMT'14 English-French translation at an order of magnitude faster speed, both on GPU and CPU.
translated by 谷歌翻译
We describe an open-source toolkit for neural machine translation (NMT). The toolkit prioritizes efficiency, modularity, and extensibility with the goal of supporting NMT research into model architectures, feature representations, and source modalities, while maintaining competitive performance and reasonable training requirements. The toolkit consists of modeling and translation support, as well as detailed pedagogical documentation about the underlying techniques.
translated by 谷歌翻译
Sockeye 3是神经机器翻译(NMT)的Mockeye工具包的最新版本。现在,基于Pytorch,Sockeye 3提供了更快的模型实现和更高级的功能,并具有进一步的简化代码库。这可以通过更快的迭代,对更强大,更快的模型进行有效的培训以及快速从研究转移到生产的新想法的灵活性,从而实现更广泛的实验。当运行可比较的型号时,Sockeye 3的速度比GPU上的其他Pytorch实现快126%,在CPU上的实现速度高达292%。Sockeye 3是根据Apache 2.0许可发布的开源软件。
translated by 谷歌翻译
多语种NMT已成为MT在生产中部署的有吸引力的解决方案。但是要匹配双语质量,它符合较大且较慢的型号。在这项工作中,我们考虑了几种方法在推理时更快地使多语言NMT变得更快而不会降低其质量。我们在两种20语言多平行设置中尝试几个“光解码器”架构:在TED会谈中小规模和帕拉克曲线上的大规模。我们的实验表明,将具有词汇过滤的浅解码器组合在于,在翻译质量下没有损失的速度超过两倍。我们用Bleu和Chrf(380语言对),鲁棒性评估和人类评估验证了我们的研究结果。
translated by 谷歌翻译
诸如变形金刚和LSTMS之类的流行模型将令牌用作其信息单位。也就是说,每个令牌都被编码为向量表示,这些向量直接在计算中使用。但是,人类经常考虑跨令牌(即短语)而不是其组成代币。在本文中,我们介绍了TreeFormer,这是一个受CKY算法和变压器启发的体系结构,该体系结构学习了组成操作员和汇总功能,以构建针对短语和句子的层次编码。我们的广泛实验证明了将层次结构纳入变压器的好处,并且与机器翻译,抽象性摘要和各种自然语言理解任务相比,与基线变压器相比显示出重大改进。
translated by 谷歌翻译
基于变压器的神经模型在许多AI应用中使用。培训这些模型很昂贵,因为它需要大量的GPU资源和较长的持续时间。这是具有挑战性的,因为诸如句子之类的典型数据具有可变的长度,而变压器的计算模式比卷积神经网络更为复杂。现有系统要么仅专注于模型推理,要么仅针对BERT样编码器模型进行优化。在本文中,我们提出了LightSeq2,该系统是为GPU上的一般变压器模型加速培训的系统。我们提出了一系列针对变压器模型的特定计算流量和内存访问模式量身定制的GPU优化技术。 LightSeq2支持许多模型体系结构,包括BERT(仅编码),GPT(仅解码器),变压器(编码器编码器)和视觉变压器。我们对各种模型和基准测试的实验表明,LightSeq2始终比不同GPU上的先前系统更快(1.4-3.5倍)。特别是,与大型公共机器翻译基准(WMT14英语 - 德国人)上的现有系统相比,它获得了308%的培训速度。
translated by 谷歌翻译
Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference -sometimes prohibitively so in the case of very large data sets and large models. Several authors have also charged that NMT systems lack robustness, particularly when input sentences contain rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using residual connections as well as attention connections from the decoder network to the encoder. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. To directly optimize the translation BLEU scores, we consider refining the models by using reinforcement learning, but we found that the improvement in the BLEU scores did not reflect in the human evaluation. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system.
translated by 谷歌翻译
本文概述了NVIDIA Nemo的神经电机翻译系统,用于WMT21新闻和生物医学共享翻译任务的受限数据跟踪。我们的新闻任务提交英语 - 德语(EN-DE)和英语 - 俄语(EN-RU)是基于基于基于基线变换器的序列到序列模型之上。具体而言,我们使用1)检查点平均2)模型缩放3)模型缩放3)与从左右分解模型的逆转传播和知识蒸馏的数据增强4)从前一年的测试集上的FINETUNING 5)型号集合6)浅融合解码变压器语言模型和7)嘈杂的频道重新排名。此外,我们的BioMedical任务提交的英语 - 俄语使用生物学偏见的词汇表,并从事新闻任务数据的划痕,从新闻任务数据集中策划的医学相关文本以及共享任务提供的生物医学数据。我们的新闻系统在WMT'20 en-de试验中实现了39.5的Sacrebleu得分优于去年任务38.8的最佳提交。我们的生物医学任务ru-en和en-ru系统分别在WMT'20生物医学任务测试集中达到43.8和40.3的Bleu分数,优于上一年的最佳提交。
translated by 谷歌翻译
Subword units are an effective way to alleviate the open vocabulary problems in neural machine translation (NMT). While sentences are usually converted into unique subword sequences, subword segmentation is potentially ambiguous and multiple segmentations are possible even with the same vocabulary. The question addressed in this paper is whether it is possible to harness the segmentation ambiguity as a noise to improve the robustness of NMT. We present a simple regularization method, subword regularization, which trains the model with multiple subword segmentations probabilistically sampled during training. In addition, for better subword sampling, we propose a new subword segmentation algorithm based on a unigram language model. We experiment with multiple corpora and report consistent improvements especially on low resource and out-of-domain settings.
translated by 谷歌翻译
在几乎所有文本生成应用中,Word序列在左右(L2R)或左右(R2L)方式中构造,因为自然语言句子是写入L2R或R2L。但是,我们发现自然语言书面订单对文本生成至关重要。在本文中,我们提出了一种螺旋语言建模(SLM),这是一种普遍的方法,使人们能够构建超出L2R和R2L订单的自然语言句子。 SLM允许其中一个从结果文本内的任意令牌开始,并在所选的任意令牌中展开REST令牌。它使解码顺序除了语言模型困惑之外的新优化目标,这进一步提高了所生成文本的分集和质量。此外,SLM使得可以通过选择正确的开始令牌来操纵文本构建过程。 SLM还将生成排序引入了额外的正则化,以提高低资源方案中的模型稳健性。 8次广泛研究的神经机翻译(NMT)任务的实验表明,与传统的L2R解码方法相比,SLM高达4.7 BLEU增加。
translated by 谷歌翻译
We propose BERTSCORE, an automatic evaluation metric for text generation. Analogously to common metrics, BERTSCORE computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTSCORE correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task to show that BERTSCORE is more robust to challenging examples when compared to existing metrics.
translated by 谷歌翻译
Recently, contrastive learning attracts increasing interests in neural text generation as a new solution to alleviate the exposure bias problem. It introduces a sequence-level training signal which is crucial to generation tasks that always rely on auto-regressive decoding. However, previous methods using contrastive learning in neural text generation usually lead to inferior performance. In this paper, we analyse the underlying reasons and propose a new Contrastive Neural Text generation framework, CoNT. CoNT addresses bottlenecks that prevent contrastive learning from being widely adopted in generation tasks from three aspects -- the construction of contrastive examples, the choice of the contrastive loss, and the strategy in decoding. We validate CoNT on five generation tasks with ten benchmarks, including machine translation, summarization, code comment generation, data-to-text generation and commonsense generation. Experimental results show that CoNT clearly outperforms the conventional training framework on all the ten benchmarks with a convincing margin. Especially, CoNT surpasses previous the most competitive contrastive learning method for text generation, by 1.50 BLEU on machine translation and 1.77 ROUGE-1 on summarization, respectively. It achieves new state-of-the-art on summarization, code comment generation (without external data) and data-to-text generation.
translated by 谷歌翻译
我们介绍了双图:一种简单但有效的训练策略,以提高神经机器翻译(NMT)性能。它由两个程序组成:双向预处理和单向填充。这两个过程均使用SIMCUT,这是一种简单的正则化方法,迫使原始句子对的输出分布之间的一致性。在不利用额外的数据集通过反翻译或集成大规模预认证的模型的情况下,BI-Simcut可以在五个翻译基准(数据尺寸从160K到20.20万)中实现强大的翻译性能:EN-的BLEU得分为31.16,EN-> DE和38.37的BLEU得分为38.37 de-> en在IWSLT14数据集上,en-> de的30.78和35.15在WMT14数据集上进行DE-> en,而WMT17数据集中的ZH-> EN为27.17。 Simcut不是一种新方法,而是简化和适用于NMT的cutoff(Shen等,2020)的版本,可以将其视为基于扰动的方法。鉴于Simcut和Bi-Simcut的普遍性和简单性,我们认为它们可以作为未来NMT研究的强大基准。
translated by 谷歌翻译
Pre-trained models have achieved remarkable success in natural language processing (NLP). However, existing pre-training methods underutilize the benefits of language understanding for generation. Inspired by the idea of Generative Adversarial Networks (GANs), we propose a GAN-style model for encoder-decoder pre-training by introducing an auxiliary discriminator, unifying the ability of language understanding and generation in a single model. Our model, named as GanLM, is trained with two pre-training objectives: replaced token detection and replaced token denoising. Specifically, given masked source sentences, the generator outputs the target distribution and the discriminator predicts whether the target sampled tokens from distribution are incorrect. The target sentence is replaced with misclassified tokens to construct noisy previous context, which is used to generate the gold sentence. In general, both tasks improve the ability of language understanding and generation by selectively using the denoising data. Extensive experiments in language generation benchmarks show that GanLM with the powerful language understanding capability outperforms various strong pre-trained language models (PLMs) and achieves state-of-the-art performance.
translated by 谷歌翻译
Bidirectional Encoder Representations from Transformers (BERT; Devlin et al. 2019) represents the latest incarnation of pretrained language models which have recently advanced a wide range of natural language processing tasks. In this paper, we showcase how BERT can be usefully applied in text summarization and propose a general framework for both extractive and abstractive models. We introduce a novel document-level encoder based on BERT which is able to express the semantics of a document and obtain representations for its sentences. Our extractive model is built on top of this encoder by stacking several intersentence Transformer layers. For abstractive summarization, we propose a new fine-tuning schedule which adopts different optimizers for the encoder and the decoder as a means of alleviating the mismatch between the two (the former is pretrained while the latter is not). We also demonstrate that a two-staged fine-tuning approach can further boost the quality of the generated summaries. Experiments on three datasets show that our model achieves stateof-the-art results across the board in both extractive and abstractive settings. 1
translated by 谷歌翻译
我们为神经机翻译(NMT)提供了一个开源工具包。新工具包主要基于拱形变压器(Vaswani等,2017)以及下面详述的许多其他改进,以便创建一个独立的,易于使用,一致和全面的各个领域的机器翻译任务框架。它是为了支持双语和多语言翻译任务的工具,从构建各个语料库的模型开始推断新的预测或将模型打包给提供功能的JIT格式。
translated by 谷歌翻译
This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART -a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective . mBART is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete model allows it to be directly fine tuned for supervised (both sentence-level and document-level) and unsupervised machine translation, with no task-specific modifications. We demonstrate that adding mBART initialization produces performance gains in all but the highest-resource settings, including up to 12 BLEU points for low resource MT and over 5 BLEU points for many document-level and unsupervised models. We also show it also enables new types of transfer to language pairs with no bi-text or that were not in the pre-training corpus, and present extensive analysis of which factors contribute the most to effective pre-training.
translated by 谷歌翻译
We introduce extreme summarization, a new single-document summarization task which does not favor extractive strategies and calls for an abstractive modeling approach. The idea is to create a short, one-sentence news summary answering the question "What is the article about?". We collect a real-world, large scale dataset for this task by harvesting online articles from the British Broadcasting Corporation (BBC). We propose a novel abstractive model which is conditioned on the article's topics and based entirely on convolutional neural networks. We demonstrate experimentally that this architecture captures longrange dependencies in a document and recognizes pertinent content, outperforming an oracle extractive system and state-of-the-art abstractive approaches when evaluated automatically and by humans. 1
translated by 谷歌翻译
在序列到序列学习中,例如,自然语言生成,解码器依赖于注意机制,以有效地从编码器中提取信息。虽然常见的做法是从最后一个编码器层绘制信息,但最近的工作已经提出用于使用来自不同编码器层的表示,以进行多样化的信息。尽管如此,解码器仍然仅获得源序列的单个视图,这可能导致由于层级绕过问题而导致编码器层堆栈的训练不足。在这项工作中,我们提出了层次的多视图解码,其中对于每个解码器层以及来自最后一个编码器层的表示,它作为全局视图,来自其他编码器层的那些是用于立体视图的源序列。系统实验和分析表明,我们成功地解决了层次结构绕过问题,需要几乎可忽略的参数增加,并大大提高了五种不同任务的深度表示的序列到序列学习的性能,即机器翻译,抽象总结,图像标题,视频字幕和医疗报告生成。特别是,我们的方法在八个基准数据集中实现了新的最先进的结果,包括低资源机器转换数据集和两个低资源医疗报告生成数据集。
translated by 谷歌翻译