文本摘要方法一直引起了很多关注。近年来,深入学习已被应用于文本摘要,结果表明是非常有效的。然而,基于深度学习的大多数基于深度学习的文本摘要方法需要大规模数据集,这很难在实际应用中实现。本文提出了一种基于多轮计算的无监督的提取文本摘要方法。基于定向图算法,我们改变了一次计算句子排名的传统方法,以多轮计算,并且摘要句子在每一轮计算后动态优化,以更好地匹配文本的特征。在本文中,实验在四个数据集中进行,每组单独包含汉语,英文,长短和短文本。实验结果表明,我们的方法具有比基线方法和其他无监督方法更好的性能,并且在不同的数据集中是强大的。
translated by 谷歌翻译
In the scenario of unsupervised extractive summarization, learning high-quality sentence representations is essential to select salient sentences from the input document. Previous studies focus more on employing statistical approaches or pre-trained language models (PLMs) to extract sentence embeddings, while ignoring the rich information inherent in the heterogeneous types of interaction between words and sentences. In this paper, we are the first to propose an unsupervised extractive summarizaiton method with heterogeneous graph embeddings (HGEs) for Chinese document. A heterogeneous text graph is constructed to capture different granularities of interactions by incorporating graph structural information. Moreover, our proposed graph is general and flexible where additional nodes such as keywords can be easily integrated. Experimental results demonstrate that our method consistently outperforms the strong baseline in three summarization datasets.
translated by 谷歌翻译
已经研究了代表文本作为获取自动文本摘要的图形的图形已有十多年了。随着对自然语言处理(NLP)的关注或变压器的发展,可以在文本的图和注意结构之间建立联系。在本文中,整个文本的句子之间的注意力矩阵被用作文本完全连接图的加权相邻矩阵,可以通过预训练的语言模型产生。GCN进一步应用于文本图模型,以分类每个节点并从文本中找出显着句子。在两个典型数据集上的实验结果证明了这一点,我们提出的模型可以与现有模型相比获得竞争成果。
translated by 谷歌翻译
大型和超大语言模型的开发,例如GPT-3,T5,Switch Transformer,Ernie等,已经显着改善了文本生成的性能。该领域的重要研究方向之一是产生具有争论的文本。该问题的解决方案可以用于商务会议,政治辩论,对话系统,以准备学生论文。这些应用的主要领域之一是经济领域。俄罗斯语言的论证文本生成的关键问题是缺乏注释的论证语料库。在本文中,我们将论证的微观版,说服力论文和UKP句子语料库的翻译版本用于微调Rubert模型。此外,该模型用于通过论证注释经济新闻的语料库。然后使用带注释的语料库微调Rugpt-3模型,该模型生成参数文本。结果表明,与原始RUGPT-3模型相比,这种方法将论点生成的准确性提高了20个百分点(63.2%对42.5%)。
translated by 谷歌翻译
The research on text summarization for low-resource Indian languages has been limited due to the availability of relevant datasets. This paper presents a summary of various deep-learning approaches used for the ILSUM 2022 Indic language summarization datasets. The ISUM 2022 dataset consists of news articles written in Indian English, Hindi, and Gujarati respectively, and their ground-truth summarizations. In our work, we explore different pre-trained seq2seq models and fine-tune those with the ILSUM 2022 datasets. In our case, the fine-tuned SoTA PEGASUS model worked the best for English, the fine-tuned IndicBART model with augmented data for Hindi, and again fine-tuned PEGASUS model along with a translation mapping-based approach for Gujarati. Our scores on the obtained inferences were evaluated using ROUGE-1, ROUGE-2, and ROUGE-4 as the evaluation metrics.
translated by 谷歌翻译
Bidirectional Encoder Representations from Transformers (BERT; Devlin et al. 2019) represents the latest incarnation of pretrained language models which have recently advanced a wide range of natural language processing tasks. In this paper, we showcase how BERT can be usefully applied in text summarization and propose a general framework for both extractive and abstractive models. We introduce a novel document-level encoder based on BERT which is able to express the semantics of a document and obtain representations for its sentences. Our extractive model is built on top of this encoder by stacking several intersentence Transformer layers. For abstractive summarization, we propose a new fine-tuning schedule which adopts different optimizers for the encoder and the decoder as a means of alleviating the mismatch between the two (the former is pretrained while the latter is not). We also demonstrate that a two-staged fine-tuning approach can further boost the quality of the generated summaries. Experiments on three datasets show that our model achieves stateof-the-art results across the board in both extractive and abstractive settings. 1
translated by 谷歌翻译
诸如学术文章和商业报告之类的长期文件一直是详细说明重要问题和需要额外关注的复杂主题的标准格式。自动汇总系统可以有效地将长文档置于简短而简洁的文本中,以封装最重要的信息,从而在帮助读者的理解中很重要。最近,随着神经体系结构的出现,已经做出了重大的研究工作,以推动自动文本摘要系统,以及有关将这些系统扩展到长期文档领域的挑战的大量研究。在这项调查中,我们提供了有关长期文档摘要的研究的全面概述,以及其研究环境的三个主要组成部分的系统评估:基准数据集,汇总模型和评估指标。对于每个组成部分,我们在长期汇总的背景下组织文献,并进行经验分析,以扩大有关当前研究进度的观点。实证分析包括一项研究基准数据集的内在特征,摘要模型的多维分析以及摘要评估指标的综述。根据总体发现,我们通过提出可能在这个快速增长的领域中提出未来探索的方向来得出结论。
translated by 谷歌翻译
名人认可是品牌交流中最重要的策略之一。如今,越来越多的公司试图为自己建立生动的特征。因此,他们的品牌身份交流应符合人类和法规的某些特征。但是,以前的作品主要是通过假设停止的,而不是提出一种特定的品牌和名人之间匹配的方式。在本文中,我们建议基于自然语言处理(NLP)技术的品牌名人匹配模型(BCM)。鉴于品牌和名人,我们首先从互联网上获得了一些描述性文档,然后总结了这些文档,最后计算品牌和名人之间的匹配程度,以确定它们是否匹配。根据实验结果,我们提出的模型以0.362 F1得分和精度的6.3%优于最佳基线,这表明我们模型在现实世界中的有效性和应用值。更重要的是,据我们所知,拟议的BCM模型是使用NLP解决认可问题的第一项工作,因此它可以为以下工作提供一些新颖的研究思想和方法。
translated by 谷歌翻译
缺乏创造力的抽象方法在自动文本摘要中尤其是一个问题。模型产生的摘要主要是从源文章中提取的。该问题的主要原因之一是缺乏抽象性的数据集,尤其是对于中文而言。为了解决这个问题,我们用CLT中的参考摘要解释,中国长文本摘要数据集,正确的事实不一致的错误,并提出了第一个中国长文本摘要数据集,其中包含高度的clts+,其中包含超过更多的中文。 180k文章 - 苏格尔对,可在线购买。此外,我们引入了一个基于共发生词的固有度量,以评估我们构建的数据集。我们对CLTS+摘要中使用的提取策略进行了针对其他数据集的提取策略,以量化我们的新数据的抽象性和难度,并在CLTS+上训练多个基线,以验证IT的实用性以提高模型的创造力。
translated by 谷歌翻译
无监督的摘要方法通过纳入预训练的语言模型的表示形式来取得了显着的结果。但是,当输入文档非常长的同时,现有方法无法考虑效率和有效性。为了解决这个问题,在本文中,我们提出了一个基于语义块的无监督长期文档摘要,提议有效的粗到1个方面的排名(C2F-FAR)框架。语义块是指描述相同方面的文档中的连续句子。具体而言,我们通过将一步排名方法转换为层次多范围两阶段排名来解决此问题。在粗级阶段,我们提出了一种新的段算法,将文档拆分为相关的语义块,然后过滤量微不足道的块。在精细阶段,我们在每个块中选择显着句子,然后从选定的句子中提取最终摘要。我们在四个长文档摘要数据集上评估了我们的框架:Gov-Report,Billsum,Arxiv和PubMed。我们的C2F-FAR可以在Gov-Report和Billsum上实现新的无监督摘要结果。此外,我们的方法比以前的方法高4-28倍。
translated by 谷歌翻译
多文件摘要(MDS)是信息聚合的有效工具,它从与主题相关文档集群生成信息和简洁的摘要。我们的调查是,首先,系统地概述了最近的基于深度学习的MDS模型。我们提出了一种新的分类学,总结神经网络的设计策略,并进行全面的最先进的概要。我们突出了在现有文献中很少讨论的各种客观函数之间的差异。最后,我们提出了与这个新的和令人兴奋的领域有关的几个方向。
translated by 谷歌翻译
Though many algorithms can be used to automatically summarize legal case decisions, most fail to incorporate domain knowledge about how important sentences in a legal decision relate to a representation of its document structure. For example, analysis of a legal case summarization dataset demonstrates that sentences serving different types of argumentative roles in the decision appear in different sections of the document. In this work, we propose an unsupervised graph-based ranking model that uses a reweighting algorithm to exploit properties of the document structure of legal case decisions. We also explore the impact of using different methods to compute the document structure. Results on the Canadian Legal Case Law dataset show that our proposed method outperforms several strong baselines.
translated by 谷歌翻译
学术研究是解决以前从未解决过的问题的探索活动。通过这种性质,每个学术研究工作都需要进行文献审查,以区分其Novelties尚未通过事先作品解决。在自然语言处理中,该文献综述通常在“相关工作”部分下进行。鉴于研究文件的其余部分和引用的论文列表,自动相关工作生成的任务旨在自动生成“相关工作”部分。虽然这项任务是在10年前提出的,但直到最近,它被认为是作为科学多文件摘要问题的变种。然而,即使在今天,尚未标准化了自动相关工作和引用文本生成的问题。在这项调查中,我们进行了一个元研究,从问题制定,数据集收集,方法方法,绩效评估和未来前景的角度来比较相关工作的现有文献,以便为读者洞察到国家的进步 - 最内容的研究,以及如何进行未来的研究。我们还调查了我们建议未来工作要考虑整合的相关研究领域。
translated by 谷歌翻译
文本摘要模型通常经过培训,以产生满足人类质量要求的摘要。但是,现有的摘要文本评估指标只是摘要质量的粗略代理,与人类评分和抑制摘要多样性的相关性低。为了解决这些问题,我们提出了SummScore,这是基于CrossCoder的摘要质量评估的综合指标。首先,通过采用原始的苏格拉外测量模式并比较原始文本的语义,SummScore摆脱了抑制摘要多样性的抑制。借助文本匹配的预训练交叉编码器,SummScore可以有效地捕获摘要语义之间的细微差异。其次,为了提高全面性和解释性,SummScore由四个细粒子模型组成,它们分别测量连贯性,一致性,流利性和相关性。我们使用半监督的多轮训练来提高模型在极有限的注释数据上的性能。广泛的实验表明,与人类评分相关的上述四个维度中,SummScore在上述四个维度中的现有评估指标显着优于现有的评估指标。我们还为16个主流摘要模型提供了SummScore的质量评估结果,以供以后研究。
translated by 谷歌翻译
我们解决了无监督的提取文档摘要的问题,尤其是对于长文件。我们将无监督的问题建模为稀疏自动回归的问题,并通过凸,规范约束的问题近似产生的组合问题。我们使用专用的Frank-Wolfe算法来解决它。要生成带有$ k $句子的摘要,该算法只需要执行$ \ of of K $迭代,从而非常有效。我们解释了如何避免明确计算完整梯度以及如何包括嵌入信息的句子。我们使用词汇(标准)胭脂分数以及语义(基于嵌入式)的方法对其他两种无监督的方法评估了我们的方法。我们的方法在两个数据集中取得了更好的结果,并且在与高度释义的摘要结合使用时,尤其有效。
translated by 谷歌翻译
In the past few decades, there has been an explosion in the amount of available data produced from various sources with different topics. The availability of this enormous data necessitates us to adopt effective computational tools to explore the data. This leads to an intense growing interest in the research community to develop computational methods focused on processing this text data. A line of study focused on condensing the text so that we are able to get a higher level of understanding in a shorter time. The two important tasks to do this are keyword extraction and text summarization. In keyword extraction, we are interested in finding the key important words from a text. This makes us familiar with the general topic of a text. In text summarization, we are interested in producing a short-length text which includes important information about the document. The TextRank algorithm, an unsupervised learning method that is an extension of the PageRank (algorithm which is the base algorithm of Google search engine for searching pages and ranking them) has shown its efficacy in large-scale text mining, especially for text summarization and keyword extraction. this algorithm can automatically extract the important parts of a text (keywords or sentences) and declare them as the result. However, this algorithm neglects the semantic similarity between the different parts. In this work, we improved the results of the TextRank algorithm by incorporating the semantic similarity between parts of the text. Aside from keyword extraction and text summarization, we develop a topic clustering algorithm based on our framework which can be used individually or as a part of generating the summary to overcome coverage problems.
translated by 谷歌翻译
现有以查询为中心的摘要数据集的大小有限,使培训数据驱动的摘要模型提出了挑战。同时,以查询为重点的摘要语料库的手动构造昂贵且耗时。在本文中,我们使用Wikipedia自动收集超过280,000个示例的大型以查询为中心的摘要数据集(名为Wikiref),这可以用作数据增强的手段。我们还开发了一个基于BERT的以查询为重点的摘要模型(Q-bert),以从文档中提取句子作为摘要。为了更好地调整包含数百万个参数的巨大模型,我们仅识别和微调一个稀疏的子网络,这对应于整个模型参数的一小部分。三个DUC基准测试的实验结果表明,在Wikiref中预先培训的模型已经达到了合理的性能。在对特定基准数据集进行了微调后,具有数据增强的模型优于强大比较系统。此外,我们提出的Q-Bert模型和子网微调都进一步改善了模型性能。该数据集可在https://aka.ms/wikiref上公开获取。
translated by 谷歌翻译
In a citation graph, adjacent paper nodes share related scientific terms and topics. The graph thus conveys unique structure information of document-level relatedness that can be utilized in the paper summarization task, for exploring beyond the intra-document information. In this work, we focus on leveraging citation graphs to improve scientific paper extractive summarization under different settings. We first propose a Multi-granularity Unsupervised Summarization model (MUS) as a simple and low-cost solution to the task. MUS finetunes a pre-trained encoder model on the citation graph by link prediction tasks. Then, the abstract sentences are extracted from the corresponding paper considering multi-granularity information. Preliminary results demonstrate that citation graph is helpful even in a simple unsupervised framework. Motivated by this, we next propose a Graph-based Supervised Summarization model (GSS) to achieve more accurate results on the task when large-scale labeled data are available. Apart from employing the link prediction as an auxiliary task, GSS introduces a gated sentence encoder and a graph information fusion module to take advantage of the graph information to polish the sentence representation. Experiments on a public benchmark dataset show that MUS and GSS bring substantial improvements over the prior state-of-the-art model.
translated by 谷歌翻译
Future work sentences (FWS) are the particular sentences in academic papers that contain the author's description of their proposed follow-up research direction. This paper presents methods to automatically extract FWS from academic papers and classify them according to the different future directions embodied in the paper's content. FWS recognition methods will enable subsequent researchers to locate future work sentences more accurately and quickly and reduce the time and cost of acquiring the corpus. The current work on automatic identification of future work sentences is relatively small, and the existing research cannot accurately identify FWS from academic papers, and thus cannot conduct data mining on a large scale. Furthermore, there are many aspects to the content of future work, and the subdivision of the content is conducive to the analysis of specific development directions. In this paper, Nature Language Processing (NLP) is used as a case study, and FWS are extracted from academic papers and classified into different types. We manually build an annotated corpus with six different types of FWS. Then, automatic recognition and classification of FWS are implemented using machine learning models, and the performance of these models is compared based on the evaluation metrics. The results show that the Bernoulli Bayesian model has the best performance in the automatic recognition task, with the Macro F1 reaching 90.73%, and the SCIBERT model has the best performance in the automatic classification task, with the weighted average F1 reaching 72.63%. Finally, we extract keywords from FWS and gain a deep understanding of the key content described in FWS, and we also demonstrate that content determination in FWS will be reflected in the subsequent research work by measuring the similarity between future work sentences and the abstracts.
translated by 谷歌翻译
Nowadays, time-stamped web documents related to a general news query floods spread throughout the Internet, and timeline summarization targets concisely summarizing the evolution trajectory of events along the timeline. Unlike traditional document summarization, timeline summarization needs to model the time series information of the input events and summarize important events in chronological order. To tackle this challenge, in this paper, we propose a Unified Timeline Summarizer (UTS) that can generate abstractive and extractive timeline summaries in time order. Concretely, in the encoder part, we propose a graph-based event encoder that relates multiple events according to their content dependency and learns a global representation of each event. In the decoder part, to ensure the chronological order of the abstractive summary, we propose to extract the feature of event-level attention in its generation process with sequential information remained and use it to simulate the evolutionary attention of the ground truth summary. The event-level attention can also be used to assist in extracting summary, where the extracted summary also comes in time sequence. We augment the previous Chinese large-scale timeline summarization dataset and collect a new English timeline dataset. Extensive experiments conducted on these datasets and on the out-of-domain Timeline 17 dataset show that UTS achieves state-of-the-art performance in terms of both automatic and human evaluations.
translated by 谷歌翻译