象征性语言生成是在所需的言语中重新设计给定文本的任务,同时仍然忠于原始上下文。我们通过为自动生成五种英语中的五种常见形式形式提供基准,迈出了迈向多位数语言建模的第一步。我们训练MFLAG采用一种在BART顶部预训练的多基因语言的方案,以及将目标象征性信息注入编码器的机制;这使得具有目标形式形式的文本从另一种比喻形式产生,而没有平行的形象构句。我们的方法表现优于所有强大的基线。我们还提供了一些定性分析和对不同语音数字之间关系的反思。
translated by 谷歌翻译
This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART -a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective . mBART is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete model allows it to be directly fine tuned for supervised (both sentence-level and document-level) and unsupervised machine translation, with no task-specific modifications. We demonstrate that adding mBART initialization produces performance gains in all but the highest-resource settings, including up to 12 BLEU points for low resource MT and over 5 BLEU points for many document-level and unsupervised models. We also show it also enables new types of transfer to language pairs with no bi-text or that were not in the pre-training corpus, and present extensive analysis of which factors contribute the most to effective pre-training.
translated by 谷歌翻译
惯用表达式(IES)在自然语言中起重要作用。在本文中,我们研究了惯用句子解释(ISP)的任务,旨在通过用IE用文字解释来解释一个句子。缺乏与惯用语文平行句子的大型语料库是这项任务的主要挑战,我们考虑了两个单独的解决方案。首先,我们向ISP提出了一个无人监督的方法,它利用IE的上下文信息和定义,不需要并行句子训练集。其次,我们提出了一种弱监督的方法,使用后翻来的方法与IE共同执行释义和生成句子,以扩大小规模并行句子训练数据集。该研究的其他重要衍生物包括一种模型,该模型将句子中的文字短语替换为一种与IE生成惯用表达式和具有惯用/文字句对的大规模并行数据集。拟议的解决方案与竞争性基线相比的有效性在Bleu超过5.16点的相对增益中观察到超过8.75点,在使用自动和手动的并行数据集上经验上验证生成的句子时,Sari超过19.57点评估。我们展示了ISP作为EN-DE机器翻译中的预处理步骤的实用实用性。
translated by 谷歌翻译
我们提出了两种小型无监督方法,用于消除文本中的毒性。我们的第一个方法结合了最近的两个想法:(1)使用小型条件语言模型的生成过程的指导和(2)使用释义模型进行风格传输。我们使用良好的令人措辞的令人愉快的释放器,由风格培训的语言模型引导,以保持文本内容并消除毒性。我们的第二种方法使用BERT用他们的非攻击性同义词取代毒性单词。我们通过使BERT替换具有可变数量的单词的屏蔽令牌来使该方法更灵活。最后,我们介绍了毒性去除任务的风格转移模型的第一个大规模比较研究。我们将模型与许多用于样式传输的方法进行比较。使用无监督的样式传输指标的组合以可参考方式评估该模型。两种方法都建议产生新的SOTA结果。
translated by 谷歌翻译
最近在单语数据和机器翻译(MT)进行微调的预培训方面取得了成功,但尚不清楚如何最好地利用预先训练的模型来完成给定的MT任务。本文在微调MT上的预训练模型时研究了冻结参数的好处和缺点。我们专注于1)微调仅在英语单语言数据的BART上训练的模型。2)微调一个模型,该模型对25种语言的单语言数据进行了培训,Mbart。对于Bart,我们通过冻结大多数模型参数并添加额外的位置嵌入来获得最佳性能。对于MBART,我们将大多数语言对的天真微调的性能与编码器以及大多数解码器搭配。编码器的注意参数对于微调最重要。当将自己限制为越南人对英语的室外训练套装时,我们看到了基线的最大进步。
translated by 谷歌翻译
惯用表达(IES)以其非构成性为特征,是自然语言的重要组成部分。这是对NLP的经典挑战,包括推动当今最先进的培训的预培训语言模型。先前的工作已经确定了其背景化表示的缺陷,这是由于代表的基本组成范式所致。在这项工作中,我们采用了第一个原理的方法,以使用适配器作为对惯用句子的轻量级非构成语言专家来建立惯用性。通过固有和外在方法可以看到基准(例如BART)的能力提高,其中嵌入聚类的均匀性得分高0.19分,而IE sense sense Inambiagiation和Insense Disamage Disamage和Idiom处理任务的均质得分提高了0.19分,高达25%跨度检测。
translated by 谷歌翻译
近年来,文本的风格特性吸引了计算语言学研究人员。具体来说,研究人员研究了文本样式转移(TST)任务,该任务旨在在保留其样式独立内容的同时改变文本的风格属性。在过去的几年中,已经开发了许多新颖的TST算法,而该行业利用这些算法来实现令人兴奋的TST应用程序。由于这种共生,TST研究领域迅速发展。本文旨在对有关文本样式转移的最新研究工作进行全面审查。更具体地说,我们创建了一种分类法来组织TST模型,并提供有关最新技术状况的全面摘要。我们回顾了针对TST任务的现有评估方法,并进行了大规模的可重复性研究,我们在两个公开可用的数据集上实验基准了19个最先进的TST TST算法。最后,我们扩展了当前趋势,并就TST领域的新开发发展提供了新的观点。
translated by 谷歌翻译
We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by ( 1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new stateof-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also report ablation experiments that replicate other pretraining schemes within the BART framework, to better measure which factors most influence end-task performance.
translated by 谷歌翻译
Pre-training is an effective technique for ensuring robust performance on a variety of machine learning tasks. It typically depends on large-scale crawled corpora that can result in toxic or biased models. Such data can also be problematic with respect to copyright, attribution, and privacy. Pre-training with synthetic tasks and data is a promising way of alleviating such concerns since no real-world information is ingested by the model. Our goal in this paper is to understand what makes for a good pre-trained model when using synthetic resources. We answer this question in the context of neural machine translation by considering two novel approaches to translation model pre-training. Our first approach studies the effect of pre-training on obfuscated data derived from a parallel corpus by mapping words to a vocabulary of 'nonsense' tokens. Our second approach explores the effect of pre-training on procedurally generated synthetic parallel data that does not depend on any real human language corpus. Our empirical evaluation on multiple language pairs shows that, to a surprising degree, the benefits of pre-training can be realized even with obfuscated or purely synthetic parallel data. In our analysis, we consider the extent to which obfuscated and synthetic pre-training techniques can be used to mitigate the issue of hallucinated model toxicity.
translated by 谷歌翻译
预训练的语言模型(PLM)在自然语言生成(NLG)任务中取得了显着的成功。到目前为止,大多数PLM都使用大型一般语料库以无监督的方式进行了预培训。同时,与无监督的模型相比,预先训练的模型越来越多地显示出较低的数据表现出色。受监督预训练的成功的激励,我们提出了自然语言生成的多任务监督预训练(MVP)。为了预先培训文本生成模型MVP,我们从七个生成任务中收集了45个数据集的标记预训练语料库。对于每个任务,我们进一步预先训练特定的软提示,以刺激执行特定任务的模型能力。广泛的实验证明了我们在许多NLG任务中有监督的预训练的有效性,并且我们的一般方法在17个数据集中的12个中实现了最先进的性能。
translated by 谷歌翻译
Unavailability of parallel corpora for training text style transfer (TST) models is a very challenging yet common scenario. Also, TST models implicitly need to preserve the content while transforming a source sentence into the target style. To tackle these problems, an intermediate representation is often constructed that is devoid of style while still preserving the meaning of the source sentence. In this work, we study the usefulness of Abstract Meaning Representation (AMR) graph as the intermediate style agnostic representation. We posit that semantic notations like AMR are a natural choice for an intermediate representation. Hence, we propose T-STAR: a model comprising of two components, text-to-AMR encoder and a AMR-to-text decoder. We propose several modeling improvements to enhance the style agnosticity of the generated AMR. To the best of our knowledge, T-STAR is the first work that uses AMR as an intermediate representation for TST. With thorough experimental evaluation we show T-STAR significantly outperforms state of the art techniques by achieving on an average 15.2% higher content preservation with negligible loss (3% approx.) in style accuracy. Through detailed human evaluation with 90,000 ratings, we also show that T-STAR has up to 50% lesser hallucinations compared to state of the art TST models.
translated by 谷歌翻译
定义生成任务旨在自动在特定上下文中生成一个单词的定义。但是,由于缺乏针对不同复杂性的数据集,模型产生的定义往往会保持相同的复杂度。本文提出了为具有可控复杂性级别的单词生成定义的新任务。相应地,我们介绍了编译,一个数据集给出了有关中国定义的详细信息,并且每个定义都标有其复杂性级别。编译数据集包括74,303个单词和106,882个定义。据我们所知,它是中国定义生成任务的最大数据集。我们选择各种代表性生成方法作为此任务的基准和进行评估,这说明我们的数据集在协助模型生成不同的复杂性级别定义方面发挥了出色的作用。我们认为,编译数据集将使复杂性可控定义生成的进一步研究受益。
translated by 谷歌翻译
Neural machine translation(NMT) has aroused wide attention due to its impressive quality. Beyond quality, controlling translation styles is also an important demand for many languages. Previous related studies mainly focus on controlling formality and gain some improvements. However, they still face two challenges. The first is the evaluation limitation. Style contains abundant information including lexis, syntax, etc. But only formality is well studied. The second is the heavy reliance on iterative fine-tuning when new styles are required. Correspondingly, this paper contributes in terms of the benchmark and approach. First, we re-visit this task and propose a multiway stylized machine translation (MSMT) benchmark, which includes multiple categories of styles in four language directions to push the boundary of this task. Second, we propose a method named style activation prompt (StyleAP) by retrieving prompts from stylized monolingual corpus, which needs no extra fine-tuning. Experiments show that StyleAP could effectively control the style of translation and achieve remarkable performance. All of our data and code are released at https://github.com/IvanWang0730/StyleAP.
translated by 谷歌翻译
文本样式传输是自然语言生成中的重要任务,旨在控制生成的文本中的某些属性,例如礼貌,情感,幽默和许多其他特性。它在自然语言处理领域拥有悠久的历史,最近由于深神经模型带来的有希望的性能而重大关注。在本文中,我们对神经文本转移的研究进行了系统调查,自2017年首次神经文本转移工作以来跨越100多个代表文章。我们讨论了任务制定,现有数据集和子任务,评估,以及丰富的方法在存在并行和非平行数据存在下。我们还提供关于这项任务未来发展的各种重要主题的讨论。我们的策据纸张列表在https://github.com/zhijing-jin/text_style_transfer_survey
translated by 谷歌翻译
本文介绍了一种新的数据增强方法,用于神经机器翻译,该方法可以在语言内部和跨语言内部实施更强的语义一致性。我们的方法基于条件掩盖语言模型(CMLM),该模型是双向的,可以在左右上下文以及标签上有条件。我们证明CMLM是生成上下文依赖性单词分布的好技术。特别是,我们表明CMLM能够通过在替换过程中对源和目标进行调节来实现语义一致性。此外,为了增强多样性,我们将软词替换的想法纳入了数据增强,该概念用词汇上的概率分布代替了一个单词。在不同量表的四个翻译数据集上进行的实验表明,总体解决方案会导致更现实的数据增强和更好的翻译质量。与最新作品相比,我们的方法始终取得了最佳性能,并且在基线上的提高了1.90个BLEU点。
translated by 谷歌翻译
文本样式传输(TST)旨在在保持相同内容的同时将源文本的底层样式更改为另一种特定样式。由于高质量平行训练数据的稀缺性,无监督的学习已成为TST任务的趋势方向。在本文中,我们提出了一种新的基于VAE的文本方式转移,具有Pivot词增强学习(VT-LOWER)方法,该方法利用变分AutiConder(VAE)和外部风格嵌入,共同学习语义和风格分布。此外,我们介绍了枢轴词学习,它用于学习特定风格的决定性词语,从而进一步提高风格转移的整体性能。所提出的vt-rtower可以缩放到不同的TST场景,因为具有新颖和灵活的风格强度控制机制的非常有限和非平行训练数据。实验表明,VT-BURER优于语言,形式和代码切换TST任务的最先进。
translated by 谷歌翻译
机器翻译系统(MTS)是通过将文本或语音从一种语言转换为另一种语言的有效工具。在像印度这样的大型多语言环境中,对有效的翻译系统的需求变得显而易见,英语和一套印度语言(ILS)正式使用。与英语相反,由于语料库的不可用,IL仍然被视为低资源语言。为了解决不对称性质,多语言神经机器翻译(MNMT)系统会发展为在这个方向上的理想方法。在本文中,我们提出了一个MNMT系统,以解决与低资源语言翻译有关的问题。我们的模型包括两个MNMT系统,即用于英语印度(一对多),另一个用于指示英语(多一对多),其中包含15个语言对(30个翻译说明)的共享编码器码头。由于大多数IL对具有很少的平行语料库,因此不足以训练任何机器翻译模型。我们探索各种增强策略,以通过建议的模型提高整体翻译质量。最先进的变压器体系结构用于实现所提出的模型。大量数据的试验揭示了其优越性比常规模型的优势。此外,本文解决了语言关系的使用(在方言,脚本等方面),尤其是关于同一家族的高资源语言在提高低资源语言表现方面的作用。此外,实验结果还表明了ILS的倒退和域适应性的优势,以提高源和目标语言的翻译质量。使用所有这些关键方法,我们提出的模型在评估指标方面比基线模型更有效,即一组ILS的BLEU(双语评估研究)得分。
translated by 谷歌翻译
Powerful generative models have led to recent progress in question generation (QG). However, it is difficult to measure advances in QG research since there are no standardized resources that allow a uniform comparison among approaches. In this paper, we introduce QG-Bench, a multilingual and multidomain benchmark for QG that unifies existing question answering datasets by converting them to a standard QG setting. It includes general-purpose datasets such as SQuAD for English, datasets from ten domains and two styles, as well as datasets in eight different languages. Using QG-Bench as a reference, we perform an extensive analysis of the capabilities of language models for the task. First, we propose robust QG baselines based on fine-tuning generative language models. Then, we complement automatic evaluation based on standard metrics with an extensive manual evaluation, which in turn sheds light on the difficulty of evaluating QG models. Finally, we analyse both the domain adaptability of these models as well as the effectiveness of multilingual models in languages other than English. QG-Bench is released along with the fine-tuned models presented in the paper https://github.com/asahi417/lm-question-generation, which are also available as a demo https://autoqg.net/.
translated by 谷歌翻译
Controllable Text Generation (CTG) is emerging area in the field of natural language generation (NLG). It is regarded as crucial for the development of advanced text generation technologies that are more natural and better meet the specific constraints in practical applications. In recent years, methods using large-scale pre-trained language models (PLMs), in particular the widely used transformer-based PLMs, have become a new paradigm of NLG, allowing generation of more diverse and fluent text. However, due to the lower level of interpretability of deep neural networks, the controllability of these methods need to be guaranteed. To this end, controllable text generation using transformer-based PLMs has become a rapidly growing yet challenging new research hotspot. A diverse range of approaches have emerged in the recent 3-4 years, targeting different CTG tasks which may require different types of controlled constraints. In this paper, we present a systematic critical review on the common tasks, main approaches and evaluation methods in this area. Finally, we discuss the challenges that the field is facing, and put forward various promising future directions. To the best of our knowledge, this is the first survey paper to summarize CTG techniques from the perspective of PLMs. We hope it can help researchers in related fields to quickly track the academic frontier, providing them with a landscape of the area and a roadmap for future research.
translated by 谷歌翻译
舌头是有意义的句子,难以发音。自动产生舌头扭曲的过程具有挑战性,因为产生的话语必须立即满足两个条件:语音难度和语义含义。此外,语音难度本身很难表征,并且通过异质的现象(例如垂涎和谐音)的异质组合以自然的扭曲词来表达。在本文中,我们提出了Pancetta:音素意识到的神经完成,以自动引起舌头扭曲。我们利用音素表示来捕获语音难度的概念,并训练语言模型以在两个提出的任务设置上生成原始的舌头扭曲。为此,我们策划了一个名为Pancetta的数据集,该数据集由现有的英语舌头组成。通过自动和人类评估以及定性分析,我们表明pancetta产生了新颖,语音上的困难,流利和语义上有意义的舌头扭曲。
translated by 谷歌翻译