Current state-of-the-art summarization models are trained with either maximum likelihood estimation (MLE) or reinforcement learning (RL). In this study, we investigate the third training paradigm and argue that inverse reinforcement learning (IRL) may be more suitable for text summarization. IRL focuses on estimating the reward function of an agent, given a set of observations of that agent's behavior. Generally, IRL provides advantages in situations where the reward function is not explicitly known or where it is difficult to define or interact with the environment directly. These situations are exactly what we observe in summarization. Thus, we introduce inverse reinforcement learning into text summarization and define a suite of sub-rewards that are important for summarization optimization. By simultaneously estimating the reward function and optimizing the summarization agent with expert demonstrations, we show that the model trained with IRL produces summaries that closely follow human behavior, in terms of better ROUGE, coverage, novelty, compression ratio and factuality when compared to the baselines trained with MLE and RL.
translated by 谷歌翻译
最先进的抽象摘要系统经常生成\ emph {幻觉};即,不直接从源文本中推断的内容。尽管被认为是不正确的,我们发现非常令人难潮的内容是事实,即与世界知识一致。这些事实幻觉通过提供有用的背景信息,可以在摘要中受益。在这项工作中,我们提出了一种新的检测方法,将事实与实体的非事实幻觉分开。我们的方法分别使用实体的先前和后验概率,分别是预训练和芬特的屏蔽语言模型。经验结果表明,我们的方法在精度和F1分数方面大大优于两种基线%,与人类判断强烈相关。百分比对事实分类任务。此外,我们显示我们的探测器,当用作离线增强学习(RL)算法中的奖励信号时,显着提高了摘要的事实性,同时保持抽象水平。
translated by 谷歌翻译
ROUGE is a standard automatic evaluation metric based on n-grams for sequence-to-sequence tasks, while cross-entropy loss is an essential objective of neural network language model that optimizes at a unigram level. We present differentiable n-gram objectives, attempting to alleviate the discrepancy between training criterion and evaluating criterion. The objective maximizes the probabilistic weight of matched sub-sequences, and the novelty of our work is the objective weights the matched sub-sequences equally and does not ceil the number of matched sub-sequences by the ground truth count of n-grams in reference sequence. We jointly optimize cross-entropy loss and the proposed objective, providing decent ROUGE score enhancement over abstractive summarization dataset CNN/DM and XSum, outperforming alternative n-gram objectives.
translated by 谷歌翻译
作为自然语言生成的基本任务,文件摘要旨在为特定文件产生短期和连贯的摘要。可控摘要,特别是长度,是一些实际应用的重要问题,特别是如何折衷长度约束和信息完整性。在本文中,我们提出了一个\ textbf {a} daptive \ textbf {l} ength \ textbf {c} Ontrolling \ textbf {o} ptization(\ textbf {alco})方法,通过增强学习利用两阶段抽象摘要模型。 Alco将长度约束结合到句子提取阶段,以惩罚副主提取的句子。同时,旨在使显着性估计机制旨在保留所生成的句子中的突出信息。已经在普通使用的基准数据集\ TEXTIT {CNN /每日邮件}上进行了一系列实验。结果表明,在长度可控性和内容保存方面,ALCO比流行的基线更好。
translated by 谷歌翻译
Abstractive dialogue summarization has received increasing attention recently. Despite the fact that most of the current dialogue summarization systems are trained to maximize the likelihood of human-written summaries and have achieved significant results, there is still a huge gap in generating high-quality summaries as determined by humans, such as coherence and faithfulness, partly due to the misalignment in maximizing a single human-written summary. To this end, we propose to incorporate different levels of human feedback into the training process. This will enable us to guide the models to capture the behaviors humans care about for summaries. Specifically, we ask humans to highlight the salient information to be included in summaries to provide the local feedback , and to make overall comparisons among summaries in terms of coherence, accuracy, coverage, concise and overall quality, as the global feedback. We then combine both local and global feedback to fine-tune the dialog summarization policy with Reinforcement Learning. Experiments conducted on multiple datasets demonstrate the effectiveness and generalization of our methods over the state-of-the-art supervised baselines, especially in terms of human judgments.
translated by 谷歌翻译
由于暴露偏见,大多数现有的自然语言产生(NLG)模型通过最大化的可能性目标训练了推理阶段的文本结果不佳。在本文中,为了解决此问题,我们重新审视生成的框架,并提出了用于文本生成任务的联合发电机库(JGR)培训算法。在JGR中,生成器模型是通过最大化两个目标来训练的:训练语料库的可能性和排名者模型给出的预期奖励。同时,Ranker模型从发电机模型中获取输入样本,并学会了将优质样本与生成池区分开来。发电机和排名模型交替优化,直到收敛为止。在实证研究中,提出的JGR模型在五个公共基准测试中实现了新的最先进的表现,涵盖了三项大众一代任务:摘要,问题生成和回答生成。我们将在https://github.com/microsoft/advnlg上提供代码,数据和模型。
translated by 谷歌翻译
法律文本的自动摘要是一个重要的且仍然是一个具有挑战性的任务,因为法律文件往往是长期的,并且具有不寻常的结构和风格。深层模型的最近进步培训结束于终端以可分辨率的损失总结自然文本,但在适用于合法领域时,它们会显示有限的结果。在本文中,我们建议使用强化学习来培养当前的深度摘要模型,以提高其对法律领域的表现。为此,我们采用了近端政策优化方法,并引入了新的奖励函数,鼓励一代满足词汇和语义标准的候选摘要。我们将我们的方法应用于培训不同的摘要骨架,并在3个公共法律数据集中遵守一致而显着的性能增益。
translated by 谷歌翻译
Sentence summarization shortens given texts while maintaining core contents of the texts. Unsupervised approaches have been studied to summarize texts without human-written summaries. However, recent unsupervised models are extractive, which remove words from texts and thus they are less flexible than abstractive summarization. In this work, we devise an abstractive model based on reinforcement learning without ground-truth summaries. We formulate the unsupervised summarization based on the Markov decision process with rewards representing the summary quality. To further enhance the summary quality, we develop a multi-summary learning mechanism that generates multiple summaries with varying lengths for a given text, while making the summaries mutually enhance each other. Experimental results show that the proposed model substantially outperforms both abstractive and extractive models, yet frequently generating new words not contained in input texts.
translated by 谷歌翻译
诸如学术文章和商业报告之类的长期文件一直是详细说明重要问题和需要额外关注的复杂主题的标准格式。自动汇总系统可以有效地将长文档置于简短而简洁的文本中,以封装最重要的信息,从而在帮助读者的理解中很重要。最近,随着神经体系结构的出现,已经做出了重大的研究工作,以推动自动文本摘要系统,以及有关将这些系统扩展到长期文档领域的挑战的大量研究。在这项调查中,我们提供了有关长期文档摘要的研究的全面概述,以及其研究环境的三个主要组成部分的系统评估:基准数据集,汇总模型和评估指标。对于每个组成部分,我们在长期汇总的背景下组织文献,并进行经验分析,以扩大有关当前研究进度的观点。实证分析包括一项研究基准数据集的内在特征,摘要模型的多维分析以及摘要评估指标的综述。根据总体发现,我们通过提出可能在这个快速增长的领域中提出未来探索的方向来得出结论。
translated by 谷歌翻译
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
translated by 谷歌翻译
Current abstractive summarization systems present important weaknesses which prevent their deployment in real-world applications, such as the omission of relevant information and the generation of factual inconsistencies (also known as hallucinations). At the same time, automatic evaluation metrics such as CTC scores have been recently proposed that exhibit a higher correlation with human judgments than traditional lexical-overlap metrics such as ROUGE. In this work, we intend to close the loop by leveraging the recent advances in summarization metrics to create quality-aware abstractive summarizers. Namely, we propose an energy-based model that learns to re-rank summaries according to one or a combination of these metrics. We experiment using several metrics to train our energy-based re-ranker and show that it consistently improves the scores achieved by the predicted summaries. Nonetheless, human evaluation results show that the re-ranking approach should be used with care for highly abstractive summaries, as the available metrics are not yet sufficiently reliable for this purpose.
translated by 谷歌翻译
文本生成的广泛使用的评估指标要么与更长的文本效果不错,要么无法评估文本质量的所有方面。在本文中,我们引入了一个名为SMART的新指标,以减轻此类限制。具体而言,我们将句子视为匹配的基本单位,而不是代币,并使用句子匹配函数来匹配匹配候选和参考句子。还将候选句子与源文件中的句子进行了比较,以允许接地(例如,事实)评估。我们的结果表明,我们提出的指标与基于模型的匹配函数的系统级相关性优于萨姆瓦尔摘要元评估数据集上的所有竞争指标指标。后者不使用任何神经模型,这在模型开发阶段很有用,在这些阶段,资源可以受到限制且需要快速评估。最后,我们还进行了广泛的分析,表明我们提出的指标与较长的摘要很好地运行,并且对特定模型的偏见较小。
translated by 谷歌翻译
现有的抽象摘要模型缺乏明确的控制机制,允许用户影响模型输出的风格特征。这导致生成不迎合用户需求或偏好的通用摘要。为了解决这个问题,我们介绍了Hydrasum,这是一种新的摘要架构,其扩展了当前模型的单个解码器框架,例如, BART,到专家的混合版本,包括多个解码器。我们拟议的模型鼓励每个专家,即解码器,沿着尺寸学习和生成风格不同的摘要,例如抽象,长度,特异性等。在每个时间步骤中,Hydrasum采用一个门控机制,该机构决定每个单独解码器对下一个令牌的输出概率分布的贡献。通过对三个摘要数据集的实验(CNN,新闻编辑室,XSUM),我们证明了这种门控机制自动学习在标准培训目标下将对比摘要样式分配给不同的水路解码器,而无需额外监督。我们进一步表明,培训过程的指导版本可以明确地管理哪些摘要样式在解码器之间分区,例如,高抽象力与低吸引力或高特异性与低特异性,并且还增加各个解码器之间的致命差异。最后,我们的实验表明,我们的解码器框架非常灵活:在推理期间,我们可以从单独的解码器或解码器的不同子集的混合物中进行采样,以产生多种摘要,并强制对摘要生成的单一和多样式控制。
translated by 谷歌翻译
文档摘要将长期文档融入了一个简短的版本,具有突出的信息和准确的语义描述。主要问题是如何使输出摘要用输入文档进行语义一致。为了达到这一目标,最近,研究人员专注于监督的端到端混合方法,其中包含提取器模块和摘录器模块。其中,提取器识别来自输入文档的突出句子,摘录器从突出句子生成摘要。该模型通过各种策略(例如,强化学习)成功地保持了所生成的摘要和参考摘要之间的一致性。训练混合模型时有两个语义间隙(一个是文档和提取的句子之间,另一个是在提取的句子和摘要之间)。然而,它们在现有方法中未明确考虑它们,这通常会导致摘要的语义偏见。为了减轻上述问题,本文提出了一个新的\ textBF {r}光纤S \ TextBF {e} Mantic-\ TextBF {SY} MMetry Learning \ TextBF {M} odel为文档摘要(\ TextBF {Resym}) )。 Resym在提取器中引入了语义 - 一致性奖励,以弥合第一个间隙。语义双奖励旨在弥合摘录者中的第二个间隙。通过用混合奖励机制(结合上述两个奖励)来实现整个文件摘要过程。此外,呈现了全面的句子表示学习方法以充分捕获来自原始文档的信息。已经在两个疯狂的基准数据集CNN /日邮件和BigPatent上进行了一系列实验。通过将其与各种评估度量的最新的基线进行比较,结果表明了参数的优越性。
translated by 谷歌翻译
人工智能,特别是通过深度学习的最新进步,在自然语言处理和计算机视觉等领域的许多任务中都取得了出色的表现。除了理想的评估指标外,这些模型通常需要高水平的解释性。因此,对模型将其输入映射到其输出的过程的解释是备受追捧的。不幸的是,机器学习模型的当前黑匣子性质仍然是一个尚未解决的问题,这种性质使研究人员无法学习并为模型的行为和最终预测提供阐释描述。在这项工作中,我们提出了一个利用对抗性逆强化学习的新颖框架,该框架可以为通过强化学习模型做出的决策提供全球解释,并捕获模型通过总结模型的决策过程所遵循的直觉趋势。
translated by 谷歌翻译
有效地探索巨大的数据,以做出决定,类似于回答复杂的问题,是挑战许多现实世界应用场景。在这种情况下,自动摘要具有重要的重要性,因为它将为大数据分析提供基础。传统的摘要方法优化系统以产生短暂的静态摘要,适合所有不考虑概述主观性方面的用户,即对不同用户认为有价值的用户,使这些方法在现实世界使用情况下不切实际。本文提出了一种基于互动概念的摘要模型,称为自适应摘要,可帮助用户制作所需的摘要,而不是产生单一的不灵活的摘要。系统通过在迭代循环中提供反馈来逐渐从用户提供信息,同时与系统交互。用户可以选择拒绝或接受概述中包含概念的操作,以从用户的透视和反馈的置信界面的重要性。所提出的方法可以保证交互式速度,以防止用户从事该过程。此外,它消除了对参考摘要的需求,这对于总结任务来说是一个具有挑战性的问题。评估表明,自适应摘要可帮助用户通过最大化所生成的摘要中的用户期望的内容来基于它们的偏好来使高质量的摘要。
translated by 谷歌翻译
Despite the recent progress in language generation models, their outputs may not always meet user expectations. In this work, we study whether informational feedback in natural language can be leveraged to improve generation quality and user preference alignment. To this end, we consider factual consistency in summarization, the quality that the summary should only contain information supported by the input documents, for user preference alignment. We collect a high-quality dataset, DeFacto, containing human demonstrations and informational feedback in natural language consisting of corrective instructions, edited summaries, and explanations with respect to the factual consistency of the summary. Using our dataset, we study two natural language generation tasks: 1) editing a summary using the human feedback, and 2) generating human feedback from the original summary. Using the two tasks, we further evaluate if models can automatically correct factual inconsistencies in generated summaries. We show that the human-edited summaries we collected are more factually consistent, and pre-trained language models can leverage our dataset to improve the factual consistency of original system-generated summaries in our proposed generation tasks. We make the DeFacto dataset publicly available at https://github.com/microsoft/DeFacto.
translated by 谷歌翻译
尽管最近的抽象性摘要在自动评估指标上取得了成功,但生成的摘要仍然与源文档呈现事实不一致。在本文中,我们专注于实体级别的事实不一致,即减少生成的摘要与源文档之间的不匹配实体。因此,我们提出了一种基于实体的新型跨度机制,并通过全球相关成分探索其扩展。四个摘要数据集的实验结果表明,跨度可以有效地改善实体级别的事实一致性,而单词级别和实体级别的显着性基本上没有变化。该代码可在https://github.com/wendy-xiao/entity基于基础上找到
translated by 谷歌翻译
生成事实 - 一致的摘要是抽象总结的具有挑战性的任务。以前的作品主要编码事实信息或在解码后执行校正后/等级。在本文中,我们从对比学习的角度提供了一个事实 - 一致的解决方案,这是之前作品的自然延伸。我们提出CO2SUM(对比一致性),一种对比的学习方案,可以很容易地应用于事实 - 一致的抽象总结的序列模型,证明了模型可以在不修改架构的情况下感知。 CO2SUM在编码器上应用对比度学习,该编码器可以帮助模型意识到输入文章中包含的事实信息,或者对解码器进行对比学习,这使得模型生成事实正确的输出摘要。更重要的是,这两种方案是正交的,可以组合以进一步改善忠诚。关于公共基准测试的综合实验表明,与其他强大的事实 - 一致的摘要基线相比,CO2SUM提高了大型预先训练的语言模型的忠诚,并达到竞争力。
translated by 谷歌翻译
提取性摘要通过识别和串联文档中最重要的句子来产生摘要。由于大多数摘要数据集都没有带有指示文档句子是否值得摘要的黄金标签,因此已经提出了不同的标签算法来推断甲骨文提取物进行模型培训。在这项工作中,我们以广泛使用的贪婪标签方法来识别两个缺陷:它提供了次优和确定性的甲骨文。为了减轻这两个问题,我们提出了一种简单而有效的标签算法,该算法会产生柔和的,基于期望的句子标签。我们为提取性摘要定义了一个新的学习目标,该目标将来自多个Oracle摘要的学习信号结合在一起,并证明这等同于估计每个文档句子的Oracle期望。在没有任何架构修改的情况下,提议的标签方案在跨域和语言的各种摘要基准上都在监督和零击设置中取得了卓越的性能。
translated by 谷歌翻译