社会科学的学术文献是记录人类文明并研究人类社会问题的文献。随着这种文献的大规模增长,快速找到有关相关问题的现有研究的方法已成为对研究人员的紧迫需求。先前的研究,例如SCIBERT,已经表明,使用特定领域的文本进行预训练可以改善这些领域中自然语言处理任务的性能。但是,没有针对社会科学的预训练的语言模型,因此本文提出了关于社会科学引文指数(SSCI)期刊上许多摘要的预培训模型。这些模型可在GitHub(https://github.com/s-t-full-text-knowledge-mining/ssci-bert)上获得,在学科分类和带有社会科学文学的抽象结构 - 功能识别任务方面表现出色。
translated by 谷歌翻译
Future work sentences (FWS) are the particular sentences in academic papers that contain the author's description of their proposed follow-up research direction. This paper presents methods to automatically extract FWS from academic papers and classify them according to the different future directions embodied in the paper's content. FWS recognition methods will enable subsequent researchers to locate future work sentences more accurately and quickly and reduce the time and cost of acquiring the corpus. The current work on automatic identification of future work sentences is relatively small, and the existing research cannot accurately identify FWS from academic papers, and thus cannot conduct data mining on a large scale. Furthermore, there are many aspects to the content of future work, and the subdivision of the content is conducive to the analysis of specific development directions. In this paper, Nature Language Processing (NLP) is used as a case study, and FWS are extracted from academic papers and classified into different types. We manually build an annotated corpus with six different types of FWS. Then, automatic recognition and classification of FWS are implemented using machine learning models, and the performance of these models is compared based on the evaluation metrics. The results show that the Bernoulli Bayesian model has the best performance in the automatic recognition task, with the Macro F1 reaching 90.73%, and the SCIBERT model has the best performance in the automatic classification task, with the weighted average F1 reaching 72.63%. Finally, we extract keywords from FWS and gain a deep understanding of the key content described in FWS, and we also demonstrate that content determination in FWS will be reflected in the subsequent research work by measuring the similarity between future work sentences and the abstracts.
translated by 谷歌翻译
在法律文本中预先培训的基于变压器的预训练语言模型(PLM)的出现,法律领域中的自然语言处理受益匪浅。有经过欧洲和美国法律文本的PLM,最著名的是Legalbert。但是,随着印度法律文件的NLP申请量的迅速增加以及印度法律文本的区别特征,也有必要在印度法律文本上预先培训LMS。在这项工作中,我们在大量的印度法律文件中介绍了基于变压器的PLM。我们还将这些PLM应用于印度法律文件的几个基准法律NLP任务,即从事实,法院判决的语义细分和法院判决预测中的法律法规识别。我们的实验证明了这项工作中开发的印度特定PLM的实用性。
translated by 谷歌翻译
随着文献资源的丰富,研究人员面临着信息爆炸和知识过载的不断增长的问题。为了帮助学者检索文学并成功获得知识,澄清学术文学中内容的语义结构已成为基本的研究问题。在识别学术文章中章节的结构功能的研究中,只有几项研究使用了深度学习模型,并探索了特征输入的优化。这限制了研究任务深度学习模型的应用,优化潜力。本文将ACL会议的文章作为语料库。我们采用传统的机器学习模型和深度学习模型,基于各种特征输入构建分类器。实验结果表明,(1)与章节内容相比,章节标题更有利于识别学术文章的结构功能。 (2)相对位置是建立传统模型的有价值的功能。 (3)受到(2)的启发,本文进一步将上下文信息引入深度学习模型,取得了显着的结果。同时,我们的模型在包含200个采样的非训练样本的开放式测试中显示出良好的迁移能力。近五年我们还基于表演模型的最佳实践,并对整体语料库进行了时间序列分析,近五年注释了ACL主要会议文件。这项工作通过多个比较实验探索并总结了此任务的实际功能和模型,并为相关文本分类任务提供了参考。最后,我们表示当前模型的局限性和缺点以及进一步优化的方向。
translated by 谷歌翻译
Pre-trained language models are trained on large-scale unsupervised data, and they can be fine-tuned on small-scale labeled datasets and achieve good results. Multilingual pre-trained language models can be trained on multiple languages and understand multiple languages at the same time. At present, the research on pre-trained models mainly focuses on rich-resource language, while there is relatively little research on low-resource languages such as minority languages, and the public multilingual pre-trained language model can not work well for minority languages. Therefore, this paper constructs a multilingual pre-trained language model named MiLMo that performs better on minority language tasks, including Mongolian, Tibetan, Uyghur, Kazakh and Korean. To solve the problem of scarcity of datasets on minority languages and verify the effectiveness of the MiLMo model, this paper constructs a minority multilingual text classification dataset named MiTC, and trains a word2vec model for each language. By comparing the word2vec model and the pre-trained model in the text classification task, this paper provides an optimal scheme for the downstream task research of minority languages. The final experimental results show that the performance of the pre-trained model is better than that of the word2vec model, and it has achieved the best results in minority multilingual text classification. The multilingual pre-trained language model MiLMo, multilingual word2vec model and multilingual text classification dataset MiTC are published on https://milmo.cmli-nlp.com.
translated by 谷歌翻译
在科学研究中,该方法是解决科学问题和关键研究对象的必不可少手段。随着科学的发展,正在提出,修改和使用许多科学方法。作者在抽象和身体文本中描述了该方法的详细信息,并且反映该方法名称的学术文献中的关键实体称为方法实体。在大量的学术文献中探索各种方法实体有助于学者了解现有方法,为研究任务选择适当的方法并提出新方法。此外,方法实体的演变可以揭示纪律的发展并促进知识发现。因此,本文对方法论和经验作品进行了系统的综述,重点是从全文学术文献中提取方法实体,并努力使用这些提取的方法实体来建立知识服务。首先提出了本综述涉及的关键概念的定义。基于这些定义,我们系统地审查了提取和评估方法实体的方法和指标,重点是每种方法的利弊。我们还调查了如何使用提取的方法实体来构建新应用程序。最后,讨论了现有作品的限制以及潜在的下一步。
translated by 谷歌翻译
专利数据是创新研究知识的重要来源。尽管专利对之间的技术相似性是用于专利分析的关键指标。最近,研究人员一直在使用基于不同NLP嵌入模型的专利矢量空间模型来计算专利对之间的技术相似性,以帮助更好地了解创新,专利景观,技术映射和专利质量评估。据我们所知,没有一项全面的调查来建立嵌入模型的性能以计算专利相似性指标的大图。因此,在这项研究中,我们根据专利分类性能概述了这些算法的准确性。在详细的讨论中,我们报告了部分,类和子类级别的前3个算法的性能。基于专利的第一个主张的结果表明,专利,贝特(Bert-For)和tf-idf加权单词嵌入具有最佳准确性,可以在亚类级别计算句子嵌入。根据第一个结果,不同类别中模型的性能各不相同,这表明专利分析中的研究人员可以利用本研究的结果根据他们使用的专利数据的特定部分选择最佳的适当模型。
translated by 谷歌翻译
The field of cybersecurity is evolving fast. Experts need to be informed about past, current and - in the best case - upcoming threats, because attacks are becoming more advanced, targets bigger and systems more complex. As this cannot be addressed manually, cybersecurity experts need to rely on machine learning techniques. In the texutual domain, pre-trained language models like BERT have shown to be helpful, by providing a good baseline for further fine-tuning. However, due to the domain-knowledge and many technical terms in cybersecurity general language models might miss the gist of textual information, hence doing more harm than good. For this reason, we create a high-quality dataset and present a language model specifically tailored to the cybersecurity domain, which can serve as a basic building block for cybersecurity systems that deal with natural language. The model is compared with other models based on 15 different domain-dependent extrinsic and intrinsic tasks as well as general tasks from the SuperGLUE benchmark. On the one hand, the results of the intrinsic tasks show that our model improves the internal representation space of words compared to the other models. On the other hand, the extrinsic, domain-dependent tasks, consisting of sequence tagging and classification, show that the model is best in specific application scenarios, in contrast to the others. Furthermore, we show that our approach against catastrophic forgetting works, as the model is able to retrieve the previously trained domain-independent knowledge. The used dataset and trained model are made publicly available
translated by 谷歌翻译
文本分类是具有各种有趣应用程序的典型自然语言处理或计算语言学任务。随着社交媒体平台上的用户数量的增加,数据加速促进了有关社交媒体文本分类(SMTC)或社交媒体文本挖掘的新兴研究。与英语相比,越南人是低资源语言之一,仍然没有集中精力并彻底利用。受胶水成功的启发,我们介绍了社交媒体文本分类评估(SMTCE)基准,作为各种SMTC任务的数据集和模型的集合。借助拟议的基准,我们实施和分析了各种基于BERT的模型(Mbert,XLM-R和Distilmbert)和基于单语的BERT模型(Phobert,Vibert,Vibert,Velectra和Vibert4news)的有效性SMTCE基准。单语模型优于多语言模型,并实现所有文本分类任务的最新结果。它提供了基于基准的多语言和单语言模型的客观评估,该模型将使越南语言中有关贝尔特兰的未来研究有利。
translated by 谷歌翻译
虽然罕见疾病的特征在于患病率低,但大约3亿人受到罕见疾病的影响。对这些条件的早期和准确诊断是一般从业者的主要挑战,没有足够的知识来识别它们。除此之外,罕见疾病通常会显示各种表现形式,这可能会使诊断更加困难。延迟的诊断可能会对患者的生命产生负面影响。因此,迫切需要增加关于稀有疾病的科学和医学知识。自然语言处理(NLP)和深度学习可以帮助提取有关罕见疾病的相关信息,以促进其诊断和治疗。本文探讨了几种深度学习技术,例如双向长期内存(BILSTM)网络或基于来自变压器(BERT)的双向编码器表示的深层语境化词表示,以识别罕见疾病及其临床表现(症状和症状) Raredis语料库。该毒品含有超过5,000名罕见疾病和近6,000个临床表现。 Biobert,基于BERT和培训的生物医学Corpora培训的域特定语言表示,获得了最佳结果。特别是,该模型获得罕见疾病的F1分数为85.2%,表现优于所有其他模型。
translated by 谷歌翻译
名人认可是品牌交流中最重要的策略之一。如今,越来越多的公司试图为自己建立生动的特征。因此,他们的品牌身份交流应符合人类和法规的某些特征。但是,以前的作品主要是通过假设停止的,而不是提出一种特定的品牌和名人之间匹配的方式。在本文中,我们建议基于自然语言处理(NLP)技术的品牌名人匹配模型(BCM)。鉴于品牌和名人,我们首先从互联网上获得了一些描述性文档,然后总结了这些文档,最后计算品牌和名人之间的匹配程度,以确定它们是否匹配。根据实验结果,我们提出的模型以0.362 F1得分和精度的6.3%优于最佳基线,这表明我们模型在现实世界中的有效性和应用值。更重要的是,据我们所知,拟议的BCM模型是使用NLP解决认可问题的第一项工作,因此它可以为以下工作提供一些新颖的研究思想和方法。
translated by 谷歌翻译
这项研究提供了对僧伽罗文本分类的预训练语言模型的性能的首次全面分析。我们测试了一组不同的Sinhala文本分类任务,我们的分析表明,在包括Sinhala(XLM-R,Labse和Laser)的预训练的多语言模型中,XLM-R是迄今为止Sinhala文本的最佳模型分类。我们还预先培训了两种基于罗伯塔的单语僧伽罗模型,它们远远优于僧伽罗的现有预训练的语言模型。我们表明,在微调时,这些预训练的语言模型为僧伽罗文本分类树立了非常强大的基线,并且在标记数据不足以进行微调的情况下非常强大。我们进一步提供了一组建议,用于使用预训练的模型进行Sinhala文本分类。我们还介绍了新的注释数据集,可用于僧伽罗文本分类的未来研究,并公开发布我们的预培训模型。
translated by 谷歌翻译
Many prior language modeling efforts have shown that pre-training on an in-domain corpus can significantly improve performance on downstream domain-specific NLP tasks. However, the difficulties associated with collecting enough in-domain data might discourage researchers from approaching this pre-training task. In this paper, we conducted a series of experiments by pre-training Bidirectional Encoder Representations from Transformers (BERT) with different sizes of biomedical corpora. The results demonstrate that pre-training on a relatively small amount of in-domain data (4GB) with limited training steps, can lead to better performance on downstream domain-specific NLP tasks compared with fine-tuning models pre-trained on general corpora.
translated by 谷歌翻译
自然语言处理(NLP)是一个人工智能领域,它应用信息技术来处理人类语言,在一定程度上理解并在各种应用中使用它。在过去的几年中,该领域已经迅速发展,现在采用了深层神经网络的现代变体来从大型文本语料库中提取相关模式。这项工作的主要目的是调查NLP在药理学领域的最新使用。正如我们的工作所表明的那样,NLP是药理学高度相关的信息提取和处理方法。它已被广泛使用,从智能搜索到成千上万的医疗文件到在社交媒体中找到对抗性药物相互作用的痕迹。我们将覆盖范围分为五个类别,以调查现代NLP方法论,常见的任务,相关的文本数据,知识库和有用的编程库。我们将这五个类别分为适当的子类别,描述其主要属性和想法,并以表格形式进行总结。最终的调查介绍了该领域的全面概述,对从业者和感兴趣的观察者有用。
translated by 谷歌翻译
预先训练的上下文化文本表示模型学习自然语言的有效表示,以使IT机器可以理解。在注意机制的突破之后,已经提出了新一代预磨模的模型,以便自变压器引入以来实现了良好的性能。来自变压器(BERT)的双向编码器表示已成为语言理解的最先进的模型。尽管取得了成功,但大多数可用的型号已经在印度欧洲语言中培训,但是对代表性的语言和方言的类似研究仍然稀疏。在本文中,我们调查了培训基于单语言变换器的语言模型的可行性,以获得代表语言的特定重点是突尼斯方言。我们评估了我们的语言模型对情感分析任务,方言识别任务和阅读理解问答任务。我们表明使用嘈杂的Web爬网数据而不是结构化数据(维基百科,文章等)更方便这些非标准化语言。此外,结果表明,相对小的Web爬网数据集导致与使用较大数据集获得的那些表现相同的性能。最后,我们在所有三个下游任务中达到或改善了最先进的Tunbert模型。我们释放出Tunbert净化模型和用于微调的数据集。
translated by 谷歌翻译
BERT,ROBERTA或GPT-3等复杂的基于注意力的语言模型的外观已允许在许多场景中解决高度复杂的任务。但是,当应用于特定域时,这些模型会遇到相当大的困难。诸如Twitter之类的社交网络就是这种情况,Twitter是一种不断变化的信息流,以非正式和复杂的语言编写的信息流,鉴于人类的重要作用,每个信息都需要仔细评估,即使人类也需要理解。通过自然语言处理解决该领域的任务涉及严重的挑战。当将强大的最先进的多语言模型应用于这种情况下,特定语言的细微差别用来迷失翻译。为了面对这些挑战,我们提出了\ textbf {bertuit},这是迄今为止针对西班牙语提出的较大变压器,使用Roberta Optimization进行了230m西班牙推文的大规模数据集进行了预培训。我们的动机是提供一个强大的资源,以更好地了解西班牙Twitter,并用于专注于该社交网络的应用程序,特别强调致力于解决该平台中错误信息传播的解决方案。对Bertuit进行了多个任务评估,并与M-Bert,XLM-Roberta和XLM-T进行了比较,该任务非常具有竞争性的多语言变压器。在这种情况下,使用应用程序显示了我们方法的实用性:一种可视化骗局和分析作者群体传播虚假信息的零击方法。错误的信息在英语以外的其他语言等平台上疯狂地传播,这意味着在英语说话之外转移时,变形金刚的性能可能会受到影响。
translated by 谷歌翻译
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
translated by 谷歌翻译
Motivation: Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text mining often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this article, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora. Results: We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Our analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts.
translated by 谷歌翻译
The rapid advancement of AI technology has made text generation tools like GPT-3 and ChatGPT increasingly accessible, scalable, and effective. This can pose serious threat to the credibility of various forms of media if these technologies are used for plagiarism, including scientific literature and news sources. Despite the development of automated methods for paraphrase identification, detecting this type of plagiarism remains a challenge due to the disparate nature of the datasets on which these methods are trained. In this study, we review traditional and current approaches to paraphrase identification and propose a refined typology of paraphrases. We also investigate how this typology is represented in popular datasets and how under-representation of certain types of paraphrases impacts detection capabilities. Finally, we outline new directions for future research and datasets in the pursuit of more effective paraphrase detection using AI.
translated by 谷歌翻译
NLP是与计算机或机器理解和解释人类语言的能力有关的人工智能和机器学习的一种形式。语言模型在文本分析和NLP中至关重要,因为它们允许计算机解释定性输入并将其转换为可以在其他任务中使用的定量数据。从本质上讲,在转移学习的背景下,语言模型通常在大型通用语料库上进行培训,称为预训练阶段,然后对特定的基本任务进行微调。结果,预训练的语言模型主要用作基线模型,该模型包含了对上下文的广泛掌握,并且可以进一步定制以在新的NLP任务中使用。大多数预训练的模型都经过来自Twitter,Newswire,Wikipedia和Web等通用领域的Corpora培训。在一般文本中训练的现成的NLP模型可能在专业领域效率低下且不准确。在本文中,我们提出了一个名为Securebert的网络安全语言模型,该模型能够捕获网络安全域中的文本含义,因此可以进一步用于自动化,用于许多重要的网络安全任务,否则这些任务将依靠人类的专业知识和繁琐的手动努力。 Securebert受到了我们从网络安全和一般计算域的各种来源收集和预处理的大量网络安全文本培训。使用我们提出的令牌化和模型权重调整的方法,Securebert不仅能够保留对一般英语的理解,因为大多数预训练的语言模型都可以做到,而且在应用于具有网络安全含义的文本时也有效。
translated by 谷歌翻译