We present the Word Mover's Distance (WMD), a novel distance function between text documents. Our work is based on recent results in word embeddings that learn semantically meaningful representations for words from local cooccurrences in sentences. The WMD distance measures the dissimilarity between two text documents as the minimum amount of distance that the embedded words of one document need to "travel" to reach the embedded words of another document. We show that this distance metric can be cast as an instance of the Earth Mover's Distance, a well studied transportation problem for which several highly efficient solvers have been developed. Our metric has no hyperparameters and is straight-forward to implement. Further, we demonstrate on eight real world document classification data sets, in comparison with seven stateof-the-art baselines, that the WMD metric leads to unprecedented low k-nearest neighbor document classification error rates. 'Obama' word2vec embedding 'President' 'speaks' 'Illinois' 'media' 'greets' 'press' 'Chicago' document 2 document 1 Obama speaks to the media in Illinois The President greets the press in Chicago
translated by 谷歌翻译
“移动”一词的距离(WMD)是测量两个文档相似性的基本技术。作为WMD的关键,它可以通过采用最佳传输配方来利用空间单词的基础几何形状。关于WMD的最初研究报告说,WMD在各种数据集中的大幅度边缘优于古典基线,例如词袋(Bow)和TF-IDF。在本文中,我们指出原始研究中的评估可能会产生误导。我们重新评估了WMD和经典基准的性能,并发现如果我们采用适当的预处理(即L1归一化),经典的基线与WMD具有竞争力。此外,我们引入了WMD和L1拟态化的弓之间的类比,发现不仅WMD的性能,而且距离值都类似于高维空间的弓形值。
translated by 谷歌翻译
Natural Language Understanding has seen an increasing number of publications in the last few years, especially after robust word embeddings models became prominent, when they proved themselves able to capture and represent semantic relationships from massive amounts of data. Nevertheless, traditional models often fall short in intrinsic issues of linguistics, such as polysemy and homonymy. Any expert system that makes use of natural language in its core, can be affected by a weak semantic representation of text, resulting in inaccurate outcomes based on poor decisions. To mitigate such issues, we propose a novel approach called Most Suitable Sense Annotation (MSSA), that disambiguates and annotates each word by its specific sense, considering the semantic effects of its context. Our approach brings three main contributions to the semantic representation scenario: (i) an unsupervised technique that disambiguates and annotates words by their senses, (ii) a multi-sense embeddings model that can be extended to any traditional word embeddings algorithm, and (iii) a recurrent methodology that allows our models to be re-used and their representations refined. We test our approach on six different benchmarks for the word similarity task, showing that our approach can produce state-of-the-art results and outperforms several more complex state-of-the-art systems.
translated by 谷歌翻译
The relationship between words in a sentence often tells us more about the underlying semantic content of a document than its actual words, individually. In this work, we propose two novel algorithms, called Flexible Lexical Chain II and Fixed Lexical Chain II. These algorithms combine the semantic relations derived from lexical chains, prior knowledge from lexical databases, and the robustness of the distributional hypothesis in word embeddings as building blocks forming a single system. In short, our approach has three main contributions: (i) a set of techniques that fully integrate word embeddings and lexical chains; (ii) a more robust semantic representation that considers the latent relation between words in a document; and (iii) lightweight word embeddings models that can be extended to any natural language task. We intend to assess the knowledge of pre-trained models to evaluate their robustness in the document classification task. The proposed techniques are tested against seven word embeddings algorithms using five different machine learning classifiers over six scenarios in the document classification task. Our results show the integration between lexical chains and word embeddings representations sustain state-of-the-art results, even against more complex systems.
translated by 谷歌翻译
使用机器学习算法从未标记的文本中提取知识可能很复杂。文档分类和信息检索是两个应用程序,可以从无监督的学习(例如文本聚类和主题建模)中受益,包括探索性数据分析。但是,无监督的学习范式提出了可重复性问题。初始化可能会导致可变性,具体取决于机器学习算法。此外,关于群集几何形状,扭曲可能会产生误导。在原因中,异常值和异常的存在可能是决定因素。尽管初始化和异常问题与文本群集和主题建模相关,但作者并未找到对它们的深入分析。这项调查提供了这些亚地区的系统文献综述(2011-2022),并提出了共同的术语,因为类似的程序具有不同的术语。作者描述了研究机会,趋势和开放问题。附录总结了与审查的作品直接或间接相关的文本矢量化,分解和聚类算法的理论背景。
translated by 谷歌翻译
The accuracy of k-nearest neighbor (kNN) classification depends significantly on the metric used to compute distances between different examples. In this paper, we show how to learn a Mahalanobis distance metric for kNN classification from labeled examples. The Mahalanobis metric can equivalently be viewed as a global linear transformation of the input space that precedes kNN classification using Euclidean distances. In our approach, the metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. As in support vector machines (SVMs), the margin criterion leads to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our approach requires no modification or extension for problems in multiway (as opposed to binary) classification. In our framework, the Mahalanobis distance metric is obtained as the solution to a semidefinite program. On several data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification. Sometimes these results can be further improved by clustering the training examples and learning an individual metric within each cluster. We show how to learn and combine these local metrics in a globally integrated manner.
translated by 谷歌翻译
我们研究了自然语言处理中出现的近似对相似矩阵的算法。通常,计算$ N $数据点的相似性矩阵需要$ \ omega(n ^ 2)$相似计算。这种二次缩放是一个重要的瓶颈,尤其是当通过昂贵的功能计算相似性时,例如,通过变压器模型计算。近似方法通过使用恰好计算的相似性的小子集来减少这种二次复杂性,以近似于完整成对相似性矩阵的其余部分。大量工作侧重于正半纤维(PSD)相似矩阵的有效近似,其在内核方法中。然而,关于无限期(非PSD)相似性矩阵的较少被理解得更少,这通常在NLP中产生。通过观察到,许多这些矩阵仍然有点接近PSD,我们将流行的NYSTR \“{o} M方法介绍到无限制地的概述。我们的算法可以应用于任何相似性矩阵并在Sublinear时间运行在矩阵的大小中,使用仅$ O(ns)$相似性计算产生秩的等级$近似。我们表明我们的方法以及CR Cur分解的简单变体,在近似各种相似度方面表现得非常好在NLP任务中产生的矩阵。我们在文档分类,句子相似度和跨文档COREREFED的下游任务中展示了近似相似性矩阵的高精度。
translated by 谷歌翻译
两个关键假设塑造了排名检索的通常视图:(1)搜索者可以为他们希望看到的文档中的疑问选择单词,并且(2)排名检索的文档就足以,因为搜索者将足够就足够了能够认识到他们希望找到的那些。当要搜索的文档处于搜索者未知的语言时,既不是真的。在这种情况下,需要跨语言信息检索(CLIR)。本章审查了艺术技术的交流信息检索,并概述了一些开放的研究问题。
translated by 谷歌翻译
科学世界正在快速改变,新技术正在开发,新的趋势正在进行频率增加。本文介绍了对学术出版物进行科学分析的框架,这对监测研究趋势并确定潜在的创新至关重要。该框架采用并结合了各种自然语言处理技术,例如Word Embedding和主题建模。嵌入单词嵌入用于捕获特定于域的单词的语义含义。我们提出了两种新颖的科学出版物嵌入,即PUB-G和PUB-W,其能够在各种研究领域学习一般的语义含义以及特定于域的单词。此后,主题建模用于识别这些更大的研究领域内的研究主题集群。我们策划了一个出版物数据集,由两条会议组成,并从1995年到2020年的两项期刊从两个研究领域组成。实验结果表明,与其他基线嵌入式的基于主题连贯性,我们的PUB-G和PUB-W嵌入式与其他基线嵌入式相比优越。
translated by 谷歌翻译
测量不同文本的语义相似性在数字人文研究中具有许多重要应用,例如信息检索,文档聚类和文本摘要。不同方法的性能取决于文本,域和语言的长度。本研究侧重于试验一些目前的芬兰方法,这是一种形态学丰富的语言。与此同时,我们提出了一种简单的方法TFW2V,它在处理长文本文档和有限的数据时显示出高效率。此外,我们设计了一种客观评估方法,可以用作基准标记文本相似性方法的框架。
translated by 谷歌翻译
Latent semantic models, such as LSA, intend to map a query to its relevant documents at the semantic level where keyword-based matching often fails. In this study we strive to develop a series of new latent semantic models with a deep structure that project queries and documents into a common low-dimensional space where the relevance of a document given a query is readily computed as the distance between them. The proposed deep structured semantic models are discriminatively trained by maximizing the conditional likelihood of the clicked documents given a query using the clickthrough data. To make our models applicable to large-scale Web search applications, we also use a technique called word hashing, which is shown to effectively scale up our semantic models to handle large vocabularies which are common in such tasks. The new models are evaluated on a Web document ranking task using a real-world data set. Results show that our best model significantly outperforms other latent semantic models, which were considered state-of-the-art in the performance prior to the work presented in this paper.
translated by 谷歌翻译
专利数据是创新研究知识的重要来源。尽管专利对之间的技术相似性是用于专利分析的关键指标。最近,研究人员一直在使用基于不同NLP嵌入模型的专利矢量空间模型来计算专利对之间的技术相似性,以帮助更好地了解创新,专利景观,技术映射和专利质量评估。据我们所知,没有一项全面的调查来建立嵌入模型的性能以计算专利相似性指标的大图。因此,在这项研究中,我们根据专利分类性能概述了这些算法的准确性。在详细的讨论中,我们报告了部分,类和子类级别的前3个算法的性能。基于专利的第一个主张的结果表明,专利,贝特(Bert-For)和tf-idf加权单词嵌入具有最佳准确性,可以在亚类级别计算句子嵌入。根据第一个结果,不同类别中模型的性能各不相同,这表明专利分析中的研究人员可以利用本研究的结果根据他们使用的专利数据的特定部分选择最佳的适当模型。
translated by 谷歌翻译
传统上,无监督的情感分析是通过计算存储在情感词典中的文本中的这些词,然后根据注册正面和否定词的比例分配标签的文字来执行的。尽管这些“计数”方法被认为是有益的,因为它们确定性地对文本进行评分,但当分析的文本简短或词汇与词典认为默认值的情况不同时,它们的分类率降低。本文提出的称为LEX2SENT的模型是一种无监督的情感分析方法,用于改善情感词典方法的分类。为此,对DOC2VEC模型进行了训练,以确定嵌入文档嵌入与情感词典正面和负部分的嵌入之间的距离。然后对这些距离进行评估,以在重新采样文档上多次执行DOC2VEC,并进行平均以执行分类任务。对于本文考虑的三个基准数据集,拟议的LEX2SENT优于每个评估的词典,包括Vader等最先进的词典或分类率的意见词典。
translated by 谷歌翻译
Deep Learning and Machine Learning based models have become extremely popular in text processing and information retrieval. However, the non-linear structures present inside the networks make these models largely inscrutable. A significant body of research has focused on increasing the transparency of these models. This article provides a broad overview of research on the explainability and interpretability of natural language processing and information retrieval methods. More specifically, we survey approaches that have been applied to explain word embeddings, sequence modeling, attention modules, transformers, BERT, and document ranking. The concluding section suggests some possible directions for future research on this topic.
translated by 谷歌翻译
Unsupervised vector-based approaches to semantics can model rich lexical meanings, but they largely fail to capture sentiment information that is central to many word meanings and important for a wide range of NLP tasks. We present a model that uses a mix of unsupervised and supervised techniques to learn word vectors capturing semantic term-document information as well as rich sentiment content. The proposed model can leverage both continuous and multi-dimensional sentiment information as well as non-sentiment annotations. We instantiate the model to utilize the document-level sentiment polarity annotations present in many online documents (e.g. star ratings). We evaluate the model using small, widely used sentiment and subjectivity corpora and find it out-performs several previously introduced methods for sentiment classification. We also introduce a large dataset of movie reviews to serve as a more robust benchmark for work in this area.
translated by 谷歌翻译
Word Mover的距离(WMD)计算单词和模型之间的距离与两个文本序列中的单词之间的移动成本相似。但是,它在句子相似性评估中没有提供良好的性能,因为它不包含单词重要性,并且在句子中未能将固有的上下文和结构信息纳入句子。提出了一种使用语法解析树(称为语法感知单词Mover的距离(SYNWMD))的改进的WMD方法,以解决这项工作中的这两个缺点。首先,基于从句子树的句法解析树中提取的一词共发生统计量建立了加权图。每个单词的重要性是从图形连接性推断出的。其次,在计算单词之间的距离时,考虑了单词的局部句法解析结构。为了证明拟议的SynWMD的有效性,我们对6个文本语义相似性(STS)数据集和4个句子分类数据集进行了实验。实验结果表明,SynWMD在STS任务上实现了最先进的性能。它还在句子分类任务上胜过其他基于WMD的方法。
translated by 谷歌翻译
Few-shot methods for accurate modeling under sparse label-settings have improved significantly. However, the applications of few-shot modeling in natural language processing remain solely in the field of document classification. With recent performance improvements, supervised few-shot methods, combined with a simple topic extraction method pose a significant challenge to unsupervised topic modeling methods. Our research shows that supervised few-shot learning, combined with a simple topic extraction method, can outperform unsupervised topic modeling techniques in terms of generating coherent topics, even when only a few labeled documents per class are used.
translated by 谷歌翻译
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
translated by 谷歌翻译
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
translated by 谷歌翻译
Current state-of-the-art approaches to text classification typically leverage BERT-style Transformer models with a softmax classifier, jointly fine-tuned to predict class labels of a target task. In this paper, we instead propose an alternative training objective in which we learn task-specific embeddings of text: our proposed objective learns embeddings such that all texts that share the same target class label should be close together in the embedding space, while all others should be far apart. This allows us to replace the softmax classifier with a more interpretable k-nearest-neighbor classification approach. In a series of experiments, we show that this yields a number of interesting benefits: (1) The resulting order induced by distances in the embedding space can be used to directly explain classification decisions. (2) This facilitates qualitative inspection of the training data, helping us to better understand the problem space and identify labelling quality issues. (3) The learned distances to some degree generalize to unseen classes, allowing us to incrementally add new classes without retraining the model. We present extensive experiments which show that the benefits of ante-hoc explainability and incremental learning come at no cost in overall classification accuracy, thus pointing to practical applicability of our proposed approach.
translated by 谷歌翻译