在最近的地理空间研究中,通过自我监督学习建模大规模人类流动性数据的重要性与使用大型语料库的自我监督方法驱动的自然语言处理的进展并行。然而,已经有很多可行的方法适用于地理空间序列建模本身,似乎在评估方面似乎是改进的空间,特别是如何测量生成和参考序列之间的相似性。在这项工作中,我们提出了一种新颖的相似性测量,Geo-Bleu,这在地理空间序列建模和生成的背景下可能特别有用。顾名思义,这项工作是基于Bleu,是机器翻译研究中最受欢迎的措施之一,同时引入了空间接近N-Gram的想法。我们将此措施与已建立的基线进行比较,动态时间翘曲,将其应用于实际生成的地理空间序列。使用众群注释数据,关于从12,000例患者收集的地理空间序列之间的相似性,我们定量和定性地显示了所提出的方法的优势。
translated by 谷歌翻译
Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused. We propose a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run. We present this method as an automated understudy to skilled human judges which substitutes for them when there is need for quick or frequent evaluations. 1
translated by 谷歌翻译
“轨迹”是指由地理空间中的移动物体产生的迹线,通常由一系列按时间顺序排列的点表示,其中每个点由地理空间坐标集和时间戳组成。位置感应和无线通信技术的快速进步使我们能够收集和存储大量的轨迹数据。因此,许多研究人员使用轨迹数据来分析各种移动物体的移动性。在本文中,我们专注于“城市车辆轨迹”,这是指城市交通网络中车辆的轨迹,我们专注于“城市车辆轨迹分析”。城市车辆轨迹分析提供了前所未有的机会,可以了解城市交通网络中的车辆运动模式,包括以用户为中心的旅行经验和系统范围的时空模式。城市车辆轨迹数据的时空特征在结构上相互关联,因此,许多先前的研究人员使用了各种方法来理解这种结构。特别是,由于其强大的函数近似和特征表示能力,深度学习模型是由于许多研究人员的注意。因此,本文的目的是开发基于深度学习的城市车辆轨迹分析模型,以更好地了解城市交通网络的移动模式。特别是,本文重点介绍了两项研究主题,具有很高的必要性,重要性和适用性:下一个位置预测,以及合成轨迹生成。在这项研究中,我们向城市车辆轨迹分析提供了各种新型模型,使用深度学习。
translated by 谷歌翻译
The word alignment task, despite its prominence in the era of statistical machine translation (SMT), is niche and under-explored today. In this two-part tutorial, we argue for the continued relevance for word alignment. The first part provides a historical background to word alignment as a core component of the traditional SMT pipeline. We zero-in on GIZA++, an unsupervised, statistical word aligner with surprising longevity. Jumping forward to the era of neural machine translation (NMT), we show how insights from word alignment inspired the attention mechanism fundamental to present-day NMT. The second part shifts to a survey approach. We cover neural word aligners, showing the slow but steady progress towards surpassing GIZA++ performance. Finally, we cover the present-day applications of word alignment, from cross-lingual annotation projection, to improving translation.
translated by 谷歌翻译
图像字幕是当前的研究任务,用于使用场景中的对象及其关系来描述图像内容。为了应对这项任务,使用了两个重要的研究领域,人为的视觉和自然语言处理。在图像字幕中,就像在任何计算智能任务中一样,性能指标对于知道方法的性能(或坏)至关重要。近年来,已经观察到,基于n-gram的经典指标不足以捕获语义和关键含义来描述图像中的内容。为了衡量或不进行最新指标的集合,在本手稿中,我们对使用众所周知的COCO数据集进行了对几种图像字幕指标的评估以及它们之间的比较。为此,我们设计了两种情况。 1)一组人工构建字幕,以及2)比较某些最先进的图像字幕方法的比较。我们试图回答问题:当前的指标是否有助于制作高质量的标题?实际指标如何相互比较?指标真正测量什么?
translated by 谷歌翻译
We describe METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machineproduced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies.Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference. We evaluate METEOR by measuring the correlation between the metric scores and human judgments of translation quality. We compute the Pearson R correlation value between its scores and human quality assessments of the LDC TIDES 2003 Arabic-to-English and Chinese-to-English datasets.We perform segment-bysegment correlation, and show that METEOR gets an R correlation value of 0.347 on the Arabic data and 0.331 on the Chinese data. This is shown to be an improvement on using simply unigramprecision, unigram-recall and their harmonic F1 combination. We also perform experiments to show the relative contributions of the various mapping modules.
translated by 谷歌翻译
Automatically describing an image with a sentence is a long-standing challenge in computer vision and natural language processing. Due to recent progress in object detection, attribute classification, action recognition, etc., there is renewed interest in this area. However, evaluating the quality of descriptions has proven to be challenging. We propose a novel paradigm for evaluating image descriptions that uses human consensus. This paradigm consists of three main parts: a new triplet-based method of collecting human annotations to measure consensus, a new automated metric (CIDEr) that captures consensus, and two new datasets: PASCAL-50S and ABSTRACT-50S that contain 50 sentences describing each image. Our simple metric captures human judgment of consensus better than existing metrics across sentences generated by various sources. We also evaluate five state-of-the-art image description approaches using this new protocol and provide a benchmark for future comparisons. A version of CIDEr named CIDEr-D is available as a part of MS COCO evaluation server to enable systematic evaluation and benchmarking.
translated by 谷歌翻译
ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlapping units such as n-gram, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans. This paper introduces four different ROUGE measures: ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S included in the ROUGE summarization evaluation package and their evaluatio ns. Three of them have been used in the Document Understanding Conference (DUC) 2004, a large-scale summarization evaluation sponsored by NIST.
translated by 谷歌翻译
软件开发人员将源代码内的日志记录嵌入为现代软件开发中的命令占空税,因为日志文件是跟踪运行时系统问题和故障排除系统管理任务所必需的。但是,当前的日志记录过程主要是手动,因此,日志语句的适当放置和内容仍然是挑战。为了克服这些挑战,旨在自动化日志放置并预测其内容的方法,即“来到哪里以及登录的地方”,具有很高的兴趣。因此,我们专注于通过利用源代码克隆和自然语言处理(NLP)来预测日志语句的位置(即,其中)和描述(即,什么),因为这些方法为日志预测提供了额外的上下文和优点。具体而言,我们指导我们的研究三项研究问题(RQS):( RQ1)如何利用代码片段,即代码克隆,用于日志语句预测如何? (RQ2)如何扩展方法以自动执行日志语句的描述? (RQ3)所提出的方法是如何有效的日志位置和描述预测?为了追求我们的RQ,我们对七个开源Java项目进行了实验研究。我们介绍了更新和改进的日志感知代码克隆检测方法,以预测日志记录语句(RQ1)的位置。然后,我们纳入自然语言处理(NLP)和深度学习方法,以自动化日志语句的描述预测(RQ2)。我们的分析表明,我们的混合NLP和Code-CC'd检测方法(NLP CC'd)优于常规克隆探测器,平均地查找日志声明位置,并在Bleu和Rouge分数上实现了40.86%的性能,以预测伐木的描述与先前研究(RQ3)相比的陈述。
translated by 谷歌翻译
由于推荐基本上是比较(或排名)的过程,良好的解释应该向用户说明为什么一个项目被认为比另一个项目更好,即关于推荐项目的比较解释。理想情况下,在阅读解释之后,用户应达到与系统的相同的项目排名。不幸的是,尚未对这种比较解释支付的研究注意力。在这项工作中,我们开发了提取物和精炼架构,以解释来自推荐系统的一组排名项目之间的相对比较。对于每个推荐的项目,我们首先将一个句子从其相关审核中提取一个句子,最能诉诸于一组参考项的所需比较。然后,该提取的句子通过生成模型相对于目标用户进一步阐述,以更好地解释为什么建议该项目。我们根据BLEU设计一个新的解释质量指标,指导提取和细化组件的端到端培训,避免生成通用内容。对两个大型推荐基准数据集的广泛离线评估和针对一系列最先进的可解释的建议算法的严重用户研究表明了比较解释的必要性和我们解决方案的有效性。
translated by 谷歌翻译
A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years. We address both problems by presenting a novel data collection framework that produces highly parallel text data relatively inexpensively and on a large scale. The highly parallel nature of this data allows us to use simple n-gram comparisons to measure both the semantic adequacy and lexical dissimilarity of paraphrase candidates. In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments.
translated by 谷歌翻译
近年来,研究人员创建并引入了大量各种代码生成模型。由于对每个新模型版本的人类评估都是不可行的,因此社区采用了自动评估指标,例如BLEU来近似人类判断的结果。这些指标源自机器翻译域,目前尚不清楚它们是否适用于代码生成任务,以及他们与人类对此任务的评估有多一致。还有两个指标,即Codebleu和Ruby,它们是为了估计代码的相似性并考虑了代码属性的。但是,对于这些指标,几乎没有关于他们与人类评估一致的研究。尽管如此,公制得分的最小差异仍用于声称某些代码生成模型的优越性。在本文中,我们介绍了一项有关六个指标的适用性的研究-Bleu,Rouge-L,Meteor,Chrf,Codebleu,Ruby-用于评估代码生成模型。我们对两个不同的代码生成数据集进行了一项研究,并使用人类注释来评估这些数据集上运行的所有模型的质量。结果表明,对于Python单线的Conala数据集,如果模型得分的差异小于5分,则没有一个指标可以正确模拟人类判断,而$ 95 \%$确定性,则使用$> 95 \%$确定性。对于由特定结构类别组成的炉石传说数据集,至少2分的模型得分差异足以声称一种模型比另一个模型的优越性。使用我们的发现,我们得出了有关使用指标来估计代码生成任务的模型性能的几项建议。
translated by 谷歌翻译
这项工作适用于最低贝叶斯风险(MBR)解码,以优化翻译质量的各种自动化指标。机器翻译中的自动指标最近取得了巨大的进步。特别是,在人类评级(例如BLEurt,或Comet)上微调,在与人类判断的相关性方面是优于表面度量的微调。我们的实验表明,神经翻译模型与神经基于基于神经参考度量,BLEURT的组合导致自动和人类评估的显着改善。通过与经典光束搜索输出不同的翻译获得该改进:这些翻译的可能性较低,并且较少受到Bleu等表面度量的青睐。
translated by 谷歌翻译
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgements, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
translated by 谷歌翻译
We propose BERTSCORE, an automatic evaluation metric for text generation. Analogously to common metrics, BERTSCORE computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTSCORE correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task to show that BERTSCORE is more robust to challenging examples when compared to existing metrics.
translated by 谷歌翻译
标准自动指标,例如BLEU对于文档级MT评估不可靠。他们既不能区分翻译质量的文档级改进与句子级别的改进,也不能确定引起上下文反应翻译的话语现象。本文介绍了一种新颖的自动公制金发,以扩大自动MT评估的范围,从句子到文档级别。金发女郎通过对与话语相关的跨度进行分类并计算基于相似性的F1分类跨度来考虑话语一致性。我们对新建的数据集BWB进行了广泛的比较。实验结果表明,金发女郎在文档级别具有更好的选择性和可解释性,并且对文档级别的细微差别更为敏感。在一项大规模的人类研究中,与以前的指标相比,金发碧眼的皮尔逊与人类判断的相关性也明显更高。
translated by 谷歌翻译
临床票据是记录患者信息的有效方法,但难以破译非专家的难以破译。自动简化医学文本可以使患者提供有关其健康的有价值的信息,同时节省临床医生。我们提出了一种基于词频率和语言建模的医学文本自动简化的新方法,基于富裕的外行术语的医疗本体。我们发布了一对公开可用的医疗句子的新数据集,并由临床医生简化了它们的版本。此外,我们定义了一种新颖的文本简化公制和评估框架,我们用于对我们对现有技术的方法进行大规模人类评估。我们基于在医学论坛数据上培训的语言模型的方法在保留语法和原始含义时产生更简单的句子,超越现有技术。
translated by 谷歌翻译
An attentional mechanism has lately been used to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation. However, there has been little work exploring useful architectures for attention-based NMT. This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time. We demonstrate the effectiveness of both approaches on the WMT translation tasks between English and German in both directions. With local attention, we achieve a significant gain of 5.0 BLEU points over non-attentional systems that already incorporate known techniques such as dropout. Our ensemble model using different attention architectures yields a new state-of-the-art result in the WMT'15 English to German translation task with 25.9 BLEU points, an improvement of 1.0 BLEU points over the existing best system backed by NMT and an n-gram reranker. 1
translated by 谷歌翻译
The machine translation mechanism translates texts automatically between different natural languages, and Neural Machine Translation (NMT) has gained attention for its rational context analysis and fluent translation accuracy. However, processing low-resource languages that lack relevant training attributes like supervised data is a current challenge for Natural Language Processing (NLP). We incorporated a technique known Active Learning with the NMT toolkit Joey NMT to reach sufficient accuracy and robust predictions of low-resource language translation. With active learning, a semi-supervised machine learning strategy, the training algorithm determines which unlabeled data would be the most beneficial for obtaining labels using selected query techniques. We implemented two model-driven acquisition functions for selecting the samples to be validated. This work uses transformer-based NMT systems; baseline model (BM), fully trained model (FTM) , active learning least confidence based model (ALLCM), and active learning margin sampling based model (ALMSM) when translating English to Hindi. The Bilingual Evaluation Understudy (BLEU) metric has been used to evaluate system results. The BLEU scores of BM, FTM, ALLCM and ALMSM systems are 16.26, 22.56 , 24.54, and 24.20, respectively. The findings in this paper demonstrate that active learning techniques helps the model to converge early and improve the overall quality of the translation system.
translated by 谷歌翻译
我用Hunglish2语料库训练神经电脑翻译任务的模型。这项工作的主要贡献在培训NMT模型期间评估不同的数据增强方法。我提出了5种不同的增强方法,这些方法是结构感知的,这意味着而不是随机选择用于消隐或替换的单词,句子的依赖树用作增强的基础。我首先关于神经网络的详细文献综述,顺序建模,神经机翻译,依赖解析和数据增强。经过详细的探索性数据分析和Hunglish2语料库的预处理之后,我使用所提出的数据增强技术进行实验。匈牙利语的最佳型号达到了33.9的BLEU得分,而英国匈牙利最好的模型达到了28.6的BLEU得分。
translated by 谷歌翻译