The emergence of pre-trained language models (PLMs) has shown great success in many Natural Language Processing (NLP) tasks including text classification. Due to the minimal to no feature engineering required when using these models, PLMs are becoming the de facto choice for any NLP task. However, for domain-specific corpora (e.g., financial, legal, and industrial), fine-tuning a pre-trained model for a specific task has shown to provide a performance improvement. In this paper, we compare the performance of four different PLMs on three public domain-free datasets and a real-world dataset containing domain-specific words, against a simple SVM linear classifier with TFIDF vectorized text. The experimental results on the four datasets show that using PLMs, even fine-tuned, do not provide significant gain over the linear SVM classifier. Hence, we recommend that for text classification tasks, traditional SVM along with careful feature engineering can pro-vide a cheaper and superior performance than PLMs.
translated by 谷歌翻译
Understanding customer feedback is becoming a necessity for companies to identify problems and improve their products and services. Text classification and sentiment analysis can play a major role in analyzing this data by using a variety of machine and deep learning approaches. In this work, different transformer-based models are utilized to explore how efficient these models are when working with a German customer feedback dataset. In addition, these pre-trained models are further analyzed to determine if adapting them to a specific domain using unlabeled data can yield better results than off-the-shelf pre-trained models. To evaluate the models, two downstream tasks from the GermEval 2017 are considered. The experimental results show that transformer-based models can reach significant improvements compared to a fastText baseline and outperform the published scores and previous models. For the subtask Relevance Classification, the best models achieve a micro-averaged $F1$-Score of 96.1 % on the first test set and 95.9 % on the second one, and a score of 85.1 % and 85.3 % for the subtask Polarity Classification.
translated by 谷歌翻译
这项研究提供了对僧伽罗文本分类的预训练语言模型的性能的首次全面分析。我们测试了一组不同的Sinhala文本分类任务,我们的分析表明,在包括Sinhala(XLM-R,Labse和Laser)的预训练的多语言模型中,XLM-R是迄今为止Sinhala文本的最佳模型分类。我们还预先培训了两种基于罗伯塔的单语僧伽罗模型,它们远远优于僧伽罗的现有预训练的语言模型。我们表明,在微调时,这些预训练的语言模型为僧伽罗文本分类树立了非常强大的基线,并且在标记数据不足以进行微调的情况下非常强大。我们进一步提供了一组建议,用于使用预训练的模型进行Sinhala文本分类。我们还介绍了新的注释数据集,可用于僧伽罗文本分类的未来研究,并公开发布我们的预培训模型。
translated by 谷歌翻译
Covid-19已遍布全球,已经开发了几种疫苗来应对其激增。为了确定与社交媒体帖子中与疫苗相关的正确情感,我们在与Covid-19疫苗相关的推文上微调了各种最新的预训练的变压器模型。具体而言,我们使用最近引入的最先进的预训练的变压器模型Roberta,XLNet和Bert,以及在CoVID-19的推文中预先训练的域特异性变压器模型CT-Bert和Bertweet。我们通过使用基于语言模型的过采样技术(LMOTE)过采样来进一步探索文本扩展的选项,以改善这些模型的准确性,特别是对于小样本数据集,在正面,负面和中性情感类别之间存在不平衡的类别分布。我们的结果总结了我们关于用于微调最先进的预训练的变压器模型的不平衡小样本数据集的文本过采样的适用性,以及针对分类任务的域特异性变压器模型的实用性。
translated by 谷歌翻译
NLP是与计算机或机器理解和解释人类语言的能力有关的人工智能和机器学习的一种形式。语言模型在文本分析和NLP中至关重要,因为它们允许计算机解释定性输入并将其转换为可以在其他任务中使用的定量数据。从本质上讲,在转移学习的背景下,语言模型通常在大型通用语料库上进行培训,称为预训练阶段,然后对特定的基本任务进行微调。结果,预训练的语言模型主要用作基线模型,该模型包含了对上下文的广泛掌握,并且可以进一步定制以在新的NLP任务中使用。大多数预训练的模型都经过来自Twitter,Newswire,Wikipedia和Web等通用领域的Corpora培训。在一般文本中训练的现成的NLP模型可能在专业领域效率低下且不准确。在本文中,我们提出了一个名为Securebert的网络安全语言模型,该模型能够捕获网络安全域中的文本含义,因此可以进一步用于自动化,用于许多重要的网络安全任务,否则这些任务将依靠人类的专业知识和繁琐的手动努力。 Securebert受到了我们从网络安全和一般计算域的各种来源收集和预处理的大量网络安全文本培训。使用我们提出的令牌化和模型权重调整的方法,Securebert不仅能够保留对一般英语的理解,因为大多数预训练的语言模型都可以做到,而且在应用于具有网络安全含义的文本时也有效。
translated by 谷歌翻译
Many prior language modeling efforts have shown that pre-training on an in-domain corpus can significantly improve performance on downstream domain-specific NLP tasks. However, the difficulties associated with collecting enough in-domain data might discourage researchers from approaching this pre-training task. In this paper, we conducted a series of experiments by pre-training Bidirectional Encoder Representations from Transformers (BERT) with different sizes of biomedical corpora. The results demonstrate that pre-training on a relatively small amount of in-domain data (4GB) with limited training steps, can lead to better performance on downstream domain-specific NLP tasks compared with fine-tuning models pre-trained on general corpora.
translated by 谷歌翻译
在网络和社交媒体上生成的大量数据增加了检测在线仇恨言论的需求。检测仇恨言论将减少它们对他人的负面影响和影响。在自然语言处理(NLP)域中的许多努力旨在宣传仇恨言论或检测特定的仇恨言论,如宗教,种族,性别或性取向。讨厌的社区倾向于使用缩写,故意拼写错误和他们的沟通中的编码词来逃避检测,增加了讨厌语音检测任务的更多挑战。因此,词表示将在检测仇恨言论中发挥越来越关的作用。本文研究了利用基于双向LSTM的深度模型中嵌入的域特定词语的可行性,以自动检测/分类仇恨语音。此外,我们调查转移学习语言模型(BERT)对仇恨语音问题作为二进制分类任务。实验表明,与双向LSTM基于LSTM的深层模型嵌入的域特异性词嵌入了93%的F1分数,而BERT在可用仇恨语音数据集中的组合平衡数据集上达到了高达96%的F1分数。
translated by 谷歌翻译
Laws and their interpretations, legal arguments and agreements\ are typically expressed in writing, leading to the production of vast corpora of legal text. Their analysis, which is at the center of legal practice, becomes increasingly elaborate as these collections grow in size. Natural language understanding (NLU) technologies can be a valuable tool to support legal practitioners in these endeavors. Their usefulness, however, largely depends on whether current state-of-the-art models can generalize across various tasks in the legal domain. To answer this currently open question, we introduce the Legal General Language Understanding Evaluation (LexGLUE) benchmark, a collection of datasets for evaluating model performance across a diverse set of legal NLU tasks in a standardized way. We also provide an evaluation and analysis of several generic and legal-oriented models demonstrating that the latter consistently offer performance improvements across multiple tasks.
translated by 谷歌翻译
专利数据是创新研究知识的重要来源。尽管专利对之间的技术相似性是用于专利分析的关键指标。最近,研究人员一直在使用基于不同NLP嵌入模型的专利矢量空间模型来计算专利对之间的技术相似性,以帮助更好地了解创新,专利景观,技术映射和专利质量评估。据我们所知,没有一项全面的调查来建立嵌入模型的性能以计算专利相似性指标的大图。因此,在这项研究中,我们根据专利分类性能概述了这些算法的准确性。在详细的讨论中,我们报告了部分,类和子类级别的前3个算法的性能。基于专利的第一个主张的结果表明,专利,贝特(Bert-For)和tf-idf加权单词嵌入具有最佳准确性,可以在亚类级别计算句子嵌入。根据第一个结果,不同类别中模型的性能各不相同,这表明专利分析中的研究人员可以利用本研究的结果根据他们使用的专利数据的特定部分选择最佳的适当模型。
translated by 谷歌翻译
Pre-trained Transformers currently dominate most NLP tasks. They impose, however, limits on the maximum input length (512 sub-words in BERT), which are too restrictive in the legal domain. Even sparse-attention models, such as Longformer and BigBird, which increase the maximum input length to 4,096 sub-words, severely truncate texts in three of the six datasets of LexGLUE. Simpler linear classifiers with TF-IDF features can handle texts of any length, require far less resources to train and deploy, but are usually outperformed by pre-trained Transformers. We explore two directions to cope with long legal texts: (i) modifying a Longformer warm-started from LegalBERT to handle even longer texts (up to 8,192 sub-words), and (ii) modifying LegalBERT to use TF-IDF representations. The first approach is the best in terms of performance, surpassing a hierarchical version of LegalBERT, which was the previous state of the art in LexGLUE. The second approach leads to computationally more efficient models at the expense of lower performance, but the resulting models still outperform overall a linear SVM with TF-IDF features in long legal document classification.
translated by 谷歌翻译
我们介绍了一种新的金融语言代表模型,称为财务嵌入性嵌入分析(Fineas)。在金融市场,新闻和投资者情绪是安全价格的重要驱动力。因此,利用现代NLP的财务情感分析方法的能力是识别可用于市场参与者和监管机构的模式和趋势的重要组成部分。近年来,使用从BERT等大型变压器的语言模型使用转移学习的方法已经实现了文本分类任务的最先进的结果,包括使用标记数据集的情感分析。研究人员迅速采用了这些方法的财务文本,但该领域的最佳实践不是很好的。在这项工作中,我们提出了一种基于标准BERT模型的监督微调句子嵌入的金融情绪分析的新模式。我们展示了我们的方法与Vanilla Bert,LSTM和Finbert,一项金融领域特定的伯爵相比实现了显着的改进。
translated by 谷歌翻译
The field of cybersecurity is evolving fast. Experts need to be informed about past, current and - in the best case - upcoming threats, because attacks are becoming more advanced, targets bigger and systems more complex. As this cannot be addressed manually, cybersecurity experts need to rely on machine learning techniques. In the texutual domain, pre-trained language models like BERT have shown to be helpful, by providing a good baseline for further fine-tuning. However, due to the domain-knowledge and many technical terms in cybersecurity general language models might miss the gist of textual information, hence doing more harm than good. For this reason, we create a high-quality dataset and present a language model specifically tailored to the cybersecurity domain, which can serve as a basic building block for cybersecurity systems that deal with natural language. The model is compared with other models based on 15 different domain-dependent extrinsic and intrinsic tasks as well as general tasks from the SuperGLUE benchmark. On the one hand, the results of the intrinsic tasks show that our model improves the internal representation space of words compared to the other models. On the other hand, the extrinsic, domain-dependent tasks, consisting of sequence tagging and classification, show that the model is best in specific application scenarios, in contrast to the others. Furthermore, we show that our approach against catastrophic forgetting works, as the model is able to retrieve the previously trained domain-independent knowledge. The used dataset and trained model are made publicly available
translated by 谷歌翻译
非结构化数据,尤其是文本,在各个领域继续迅速增长。特别是,在金融领域,有大量累积的非结构化财务数据,例如公司定期向监管机构提交的文本披露文件,例如证券和交易委员会(SEC)。这些文档通常很长,并且倾向于包含有关公司绩效的宝贵信息。因此,从这些长文本文档中学习预测模型是非常兴趣的,尤其是用于预测数值关键绩效指标(KPI)。尽管在训练有素的语言模型(LMS)中取得了长足的进步,这些模型从大量的文本数据中学习,但他们仍然在有效的长期文档表示方面挣扎。我们的工作满足了这种批判性需求,即如何开发更好的模型来从长文本文档中提取有用的信息,并学习有效的功能,这些功能可以利用软件财务和风险信息来进行文本回归(预测)任务。在本文中,我们提出并实施了一个深度学习框架,该框架将长文档分为大块,并利用预先训练的LMS处理和将块汇总为矢量表示,然后进行自我关注以提取有价值的文档级特征。我们根据美国银行的10-K公共披露报告以及美国公司提交的另一个报告数据集评估了模型。总体而言,我们的框架优于文本建模的强大基线方法以及仅使用数值数据的基线回归模型。我们的工作提供了更好的见解,即如何利用预先训练的域特异性和微调的长输入LMS来表示长文档可以提高文本数据的表示质量,从而有助于改善预测分析。
translated by 谷歌翻译
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
translated by 谷歌翻译
Text classification is a natural language processing (NLP) task relevant to many commercial applications, like e-commerce and customer service. Naturally, classifying such excerpts accurately often represents a challenge, due to intrinsic language aspects, like irony and nuance. To accomplish this task, one must provide a robust numerical representation for documents, a process known as embedding. Embedding represents a key NLP field nowadays, having faced a significant advance in the last decade, especially after the introduction of the word-to-vector concept and the popularization of Deep Learning models for solving NLP tasks, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformer-based Language Models (TLMs). Despite the impressive achievements in this field, the literature coverage regarding generating embeddings for Brazilian Portuguese texts is scarce, especially when considering commercial user reviews. Therefore, this work aims to provide a comprehensive experimental study of embedding approaches targeting a binary sentiment classification of user reviews in Brazilian Portuguese. This study includes from classical (Bag-of-Words) to state-of-the-art (Transformer-based) NLP models. The methods are evaluated with five open-source databases with pre-defined data partitions made available in an open digital repository to encourage reproducibility. The Fine-tuned TLMs achieved the best results for all cases, being followed by the Feature-based TLM, LSTM, and CNN, with alternate ranks, depending on the database under analysis.
translated by 谷歌翻译
预先训练的上下文化文本表示模型学习自然语言的有效表示,以使IT机器可以理解。在注意机制的突破之后,已经提出了新一代预磨模的模型,以便自变压器引入以来实现了良好的性能。来自变压器(BERT)的双向编码器表示已成为语言理解的最先进的模型。尽管取得了成功,但大多数可用的型号已经在印度欧洲语言中培训,但是对代表性的语言和方言的类似研究仍然稀疏。在本文中,我们调查了培训基于单语言变换器的语言模型的可行性,以获得代表语言的特定重点是突尼斯方言。我们评估了我们的语言模型对情感分析任务,方言识别任务和阅读理解问答任务。我们表明使用嘈杂的Web爬网数据而不是结构化数据(维基百科,文章等)更方便这些非标准化语言。此外,结果表明,相对小的Web爬网数据集导致与使用较大数据集获得的那些表现相同的性能。最后,我们在所有三个下游任务中达到或改善了最先进的Tunbert模型。我们释放出Tunbert净化模型和用于微调的数据集。
translated by 谷歌翻译
动机:生物医学研究人员和临床从业者的常年挑战是随着出版物和医疗票据的快速增长而待的。自然语言处理(NLP)已成为驯服信息超载的有希望的方向。特别是,大型神经语言模型通过预先绘制的文本预测,通过各种NLP应用中的BERT模型的成功示例,便于通过预先绘制的预先来进行学习。然而,用于结束任务的微调此类模型仍然具有挑战性,特别是具有小标记数据集,这些数据集是生物医学NLP的常见。结果:我们对生物医学NLP的微调稳定性进行了系统研究。我们表明FineTuning性能可能对预先预订的设置敏感,尤其是在低资源域中。大型型号有可能获得更好的性能,但越来越多的模型大小也加剧了FineTuning不稳定性。因此,我们对解决微调不稳定的技术进行了全面的探索。我们表明,这些技术可以大大提高低源生物医学NLP应用的微调性能。具体地,冻结下层有助于标准伯特基型号,而完整的衰减对于BERT-LARD和Electra型号更有效。对于低资源文本相似性任务,如生物,重新初始化顶层是最佳策略。总体而言,占星型词汇和预制促进更强大的微调模型。基于这些调查结果,我们在广泛的生物医学NLP应用方面建立了新的技术。可用性和实施​​:为了促进生物医学NLP的进展,我们释放了我们最先进的预订和微调模型:https://aka.ms/blurb。
translated by 谷歌翻译
潜在的生命危及危及生命的错误信息急剧上升是Covid-19大流行的副产品。计算支持,以识别关于该主题的大规模数据内的虚假信息至关重要,以防止伤害。研究人员提出了许多用于标记与Covid-19相关的在线错误信息的方法。但是,这些方法主要针对特定​​的内容类型(例如,新闻)或平台(例如,Twitter)。概括的方法的能力在很大程度上尚不清楚。我们在五十个COVID-19错误信息数据集中评估基于15个变压器的模型,包括社交媒体帖子,新闻文章和科学论文来填补这一差距。我们向Covid-19数据量身定制的标记和模型不提供普通目的的数据的显着优势。我们的研究为检测Covid-19错误信息的模型提供了逼真的评估。我们预计评估广泛的数据集和模型将使未来的开发错误信息检测系统进行未来的研究。
translated by 谷歌翻译
随着社交媒体平台上的开放文本数据的最新扩散,在过去几年中,文本的情感检测(ED)受到了更多关注。它有许多应用程序,特别是对于企业和在线服务提供商,情感检测技术可以通过分析客户/用户对产品和服务的感受来帮助他们做出明智的商业决策。在这项研究中,我们介绍了Armanemo,这是一个标记为七个类别的7000多个波斯句子的人类标记的情感数据集。该数据集是从不同资源中收集的,包括Twitter,Instagram和Digikala(伊朗电子商务公司)的评论。标签是基于埃克曼(Ekman)的六种基本情感(愤怒,恐惧,幸福,仇恨,悲伤,奇迹)和另一个类别(其他),以考虑Ekman模型中未包含的任何其他情绪。除数据集外,我们还提供了几种基线模型,用于情绪分类,重点是最新的基于变压器的语言模型。我们的最佳模型在我们的测试数据集中达到了75.39%的宏观平均得分。此外,我们还进行了转移学习实验,以将我们提出的数据集的概括与其他波斯情绪数据集进行比较。这些实验的结果表明,我们的数据集在现有的波斯情绪数据集中具有较高的概括性。 Armanemo可在https://github.com/arman-rayan-sharif/arman-text-emotion上公开使用。
translated by 谷歌翻译
社会科学的学术文献是记录人类文明并研究人类社会问题的文献。随着这种文献的大规模增长,快速找到有关相关问题的现有研究的方法已成为对研究人员的紧迫需求。先前的研究,例如SCIBERT,已经表明,使用特定领域的文本进行预训练可以改善这些领域中自然语言处理任务的性能。但是,没有针对社会科学的预训练的语言模型,因此本文提出了关于社会科学引文指数(SSCI)期刊上许多摘要的预培训模型。这些模型可在GitHub(https://github.com/s-t-full-text-knowledge-mining/ssci-bert)上获得,在学科分类和带有社会科学文学的抽象结构 - 功能识别任务方面表现出色。
translated by 谷歌翻译