智能论文笔记

Pre-training Transformers on Indian Legal Text

Shounak Paul , Arpan Mandal , Pawan Goyal , Saptarshi Ghosh

分类：自然语言处理 | 人工智能 | 机器学习

2022-09-13

在法律文本中预先培训的基于变压器的预训练语言模型（PLM）的出现，法律领域中的自然语言处理受益匪浅。有经过欧洲和美国法律文本的PLM，最著名的是Legalbert。但是，随着印度法律文件的NLP申请量的迅速增加以及印度法律文本的区别特征，也有必要在印度法律文本上预先培训LMS。在这项工作中，我们在大量的印度法律文件中介绍了基于变压器的PLM。我们还将这些PLM应用于印度法律文件的几个基准法律NLP任务，即从事实，法院判决的语义细分和法院判决预测中的法律法规识别。我们的实验证明了这项工作中开发的印度特定PLM的实用性。

translated by 谷歌翻译

LexGLUE: A Benchmark Dataset for Legal Language Understanding in English

Ilias Chalkidis , Abhik Jana , Dirk Hartung , Michael Bommarito , Ion Androutsopoulos , Daniel Martin Katz , Nikolaos Aletras

分类：自然语言处理

2021-10-03

Laws and their interpretations, legal arguments and agreements\ are typically expressed in writing, leading to the production of vast corpora of legal text. Their analysis, which is at the center of legal practice, becomes increasingly elaborate as these collections grow in size. Natural language understanding (NLU) technologies can be a valuable tool to support legal practitioners in these endeavors. Their usefulness, however, largely depends on whether current state-of-the-art models can generalize across various tasks in the legal domain. To answer this currently open question, we introduce the Legal General Language Understanding Evaluation (LexGLUE) benchmark, a collection of datasets for evaluating model performance across a diverse set of legal NLU tasks in a standardized way. We also provide an evaluation and analysis of several generic and legal-oriented models demonstrating that the latter consistently offer performance improvements across multiple tasks.

translated by 谷歌翻译

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin , Ming-Wei Chang , Kenton Lee , Kristina Toutanova

分类：

2018-10-11

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

translated by 谷歌翻译

SsciBERT: A Pre-trained Language Model for Social Science Texts

Si Shen , Jiangfeng Liu , Litao Lin , Ying Huang , Lin Zhang , Chang Liu , Yutong Feng , Dongbo Wang

分类：自然语言处理

2022-06-09

社会科学的学术文献是记录人类文明并研究人类社会问题的文献。随着这种文献的大规模增长，快速找到有关相关问题的现有研究的方法已成为对研究人员的紧迫需求。先前的研究，例如SCIBERT，已经表明，使用特定领域的文本进行预训练可以改善这些领域中自然语言处理任务的性能。但是，没有针对社会科学的预训练的语言模型，因此本文提出了关于社会科学引文指数（SSCI）期刊上许多摘要的预培训模型。这些模型可在GitHub（https://github.com/s-t-full-text-knowledge-mining/ssci-bert）上获得，在学科分类和带有社会科学文学的抽象结构 - 功能识别任务方面表现出色。

translated by 谷歌翻译

Pre-Training with Whole Word Masking for Chinese BERT

Yiming Cui , Wanxiang Che , Ting Liu , Bing Qin , Ziqing Yang

分类：自然语言处理 | 机器学习

2019-06-19

来自变压器（BERT）的双向编码器表示显示了各种NLP任务的奇妙改进，并且已经提出了其连续的变体来进一步提高预先训练的语言模型的性能。在本文中，我们的目标是首先介绍中国伯特的全文掩蔽（WWM）策略，以及一系列中国预培训的语言模型。然后我们还提出了一种简单但有效的型号，称为Macbert，这在几种方面提高了罗伯塔。特别是，我们提出了一种称为MLM作为校正（MAC）的新掩蔽策略。为了展示这些模型的有效性，我们创建了一系列中国预先培训的语言模型，作为我们的基线，包括BERT，Roberta，Electra，RBT等。我们对十个中国NLP任务进行了广泛的实验，以评估创建的中国人托管语言模型以及提议的麦克白。实验结果表明，Macbert可以在许多NLP任务上实现最先进的表演，我们还通过几种可能有助于未来的研究的调查结果来消融细节。我们开源我们的预先培训的语言模型，以进一步促进我们的研究界。资源可用：https://github.com/ymcui/chinese-bert-wwm

translated by 谷歌翻译

Semantic Segmentation of Legal Documents via Rhetorical Roles

Vijit Malik , Rishabh Sanjay , Shouvik Kumar Guha , Shubham Kumar Nigam , Angshuman Hazarika , Arnab Bhattacharya , Ashutosh Modi

分类：自然语言处理 | 人工智能 | 机器学习

2021-12-03

法律文件是非结构化的，使用法律术语，并且具有相当长的长度，使得难以通过传统文本处理技术自动处理。如果文档可以在语义上分割成连贯的信息单位，法律文件处理系统将基本上受益。本文提出了一种修辞职位（RR）系统，用于将法律文件分组成语义连贯的单位：事实，论点，法规，问题，先例，裁决和比例。在法律专家的帮助下，我们提出了一套13个细粒度的修辞标志标签，并创建了与拟议的RR批发的新的法律文件有条件。我们开发一个系统，以将文件分段为修辞职位单位。特别是，我们开发了一种基于多任务学习的深度学习模型，文档修辞角色标签作为分割法律文件的辅助任务。我们在广泛地尝试各种深度学习模型，用于预测文档中的修辞角色，并且所提出的模型对现有模型显示出卓越的性能。此外，我们应用RR以预测法律案件的判断，并表明与基于变压器的模型相比，使用RR增强了预测。

translated by 谷歌翻译

Corpus for Automatic Structuring of Legal Documents

Prathamesh Kalamkar , Aman Tiwari , Astha Agarwal , Saurabh Karn , Smita Gupta , Vivek Raghavan , Ashutosh Modi

分类：自然语言处理 | 人工智能 | 机器学习

2022-01-31

在人口稠密的国家中，悬而未决的法律案件呈指数增长。需要开发处理和组织法律文件的技术。在本文中，我们引入了一个新的语料库来构建法律文件。特别是，我们介绍了用英语的法律判断文件进行的，这些文件被分割为局部和连贯的部分。这些零件中的每一个都有注释，标签来自预定义角色的列表。我们开发基线模型，以根据注释语料库自动预测法律文档中的修辞角色。此外，我们展示了修辞角色在提高总结和法律判断预测任务的绩效方面的应用。我们发布了语料库和基线模型代码以及纸张。

translated by 谷歌翻译

Unsupervised Law Article Mining based on Deep Pre-Trained Language Representation Models with Application to the Italian Civil Code

Andrea Tagarelli , Andrea Simeri

分类：自然语言处理 | 人工智能

2021-12-02

建模法检索和检索作为预测问题最近被出现为法律智能的主要方法。专注于法律文章检索任务，我们展示了一个名为Lamberta的深度学习框架，该框架被设计用于民法代码，并在意大利民法典上专门培训。为了我们的知识，这是第一项研究提出了基于伯特（来自变压器的双向编码器表示）学习框架的意大利法律制度对意大利法律制度的高级法律文章预测的研究，最近引起了深度学习方法的增加，呈现出色的有效性在几种自然语言处理和学习任务中。我们通过微调意大利文章或其部分的意大利预先训练的意大利预先训练的伯爵来定义Lamberta模型，因为法律文章作为分类任务检索。我们Lamberta框架的一个关键方面是我们构思它以解决极端的分类方案，其特征在于课程数量大，少量学习问题，以及意大利法律预测任务的缺乏测试查询基准。为了解决这些问题，我们为法律文章的无监督标签定义了不同的方法，原则上可以应用于任何法律制度。我们提供了深入了解我们Lamberta模型的解释性和可解释性，并且我们对单一标签以及多标签评估任务进行了广泛的查询模板实验分析。经验证据表明了Lamberta的有效性，以及对广泛使用的深度学习文本分类器和一些构思的几次学习者来说，其优越性是对属性感知预测任务的优势。

translated by 谷歌翻译

A Language Model for Text Analytics in Cybersecurity

Ehsan Aghaei , Xi Niu , Waseem Shadid , Ehab Al-Shaer

分类：自然语言处理 | 人工智能

2022-04-06

NLP是与计算机或机器理解和解释人类语言的能力有关的人工智能和机器学习的一种形式。语言模型在文本分析和NLP中至关重要，因为它们允许计算机解释定性输入并将其转换为可以在其他任务中使用的定量数据。从本质上讲，在转移学习的背景下，语言模型通常在大型通用语料库上进行培训，称为预训练阶段，然后对特定的基本任务进行微调。结果，预训练的语言模型主要用作基线模型，该模型包含了对上下文的广泛掌握，并且可以进一步定制以在新的NLP任务中使用。大多数预训练的模型都经过来自Twitter，Newswire，Wikipedia和Web等通用领域的Corpora培训。在一般文本中训练的现成的NLP模型可能在专业领域效率低下且不准确。在本文中，我们提出了一个名为Securebert的网络安全语言模型，该模型能够捕获网络安全域中的文本含义，因此可以进一步用于自动化，用于许多重要的网络安全任务，否则这些任务将依靠人类的专业知识和繁琐的手动努力。 Securebert受到了我们从网络安全和一般计算域的各种来源收集和预处理的大量网络安全文本培训。使用我们提出的令牌化和模型权重调整的方法，Securebert不仅能够保留对一般英语的理解，因为大多数预训练的语言模型都可以做到，而且在应用于具有网络安全含义的文本时也有效。

translated by 谷歌翻译

BERTifying Sinhala -- A Comprehensive Analysis of Pre-trained Language Models for Sinhala Text Classification

Vinura Dhananjaya , Piyumal Demotte , Surangika Ranathunga , Sanath Jayasena

分类：自然语言处理

2022-08-16

这项研究提供了对僧伽罗文本分类的预训练语言模型的性能的首次全面分析。我们测试了一组不同的Sinhala文本分类任务，我们的分析表明，在包括Sinhala（XLM-R，Labse和Laser）的预训练的多语言模型中，XLM-R是迄今为止Sinhala文本的最佳模型分类。我们还预先培训了两种基于罗伯塔的单语僧伽罗模型，它们远远优于僧伽罗的现有预训练的语言模型。我们表明，在微调时，这些预训练的语言模型为僧伽罗文本分类树立了非常强大的基线，并且在标记数据不足以进行微调的情况下非常强大。我们进一步提供了一组建议，用于使用预训练的模型进行Sinhala文本分类。我们还介绍了新的注释数据集，可用于僧伽罗文本分类的未来研究，并公开发布我们的预培训模型。

translated by 谷歌翻译

Processing Long Legal Documents with Pre-trained Transformers: Modding LegalBERT and Longformer

Dimitris Mamakas , Petros Tsotsi , Ion Androutsopoulos , Ilias Chalkidis

分类：自然语言处理

2022-11-02

Pre-trained Transformers currently dominate most NLP tasks. They impose, however, limits on the maximum input length (512 sub-words in BERT), which are too restrictive in the legal domain. Even sparse-attention models, such as Longformer and BigBird, which increase the maximum input length to 4,096 sub-words, severely truncate texts in three of the six datasets of LexGLUE. Simpler linear classifiers with TF-IDF features can handle texts of any length, require far less resources to train and deploy, but are usually outperformed by pre-trained Transformers. We explore two directions to cope with long legal texts: (i) modifying a Longformer warm-started from LegalBERT to handle even longer texts (up to 8,192 sub-words), and (ii) modifying LegalBERT to use TF-IDF representations. The first approach is the best in terms of performance, surpassing a hierarchical version of LegalBERT, which was the previous state of the art in LexGLUE. The second approach leads to computationally more efficient models at the expense of lower performance, but the resulting models still outperform overall a linear SVM with TF-IDF features in long legal document classification.

translated by 谷歌翻译

Domain Adaptation of Transformer-Based Models using Unlabeled Data for Relevance and Polarity Classification of German Customer Feedback

Ahmad Idrissi-Yaghir , Henning Schäfer , Nadja Bauer , Christoph M. Friedrich

分类：自然语言处理 | 机器学习

2022-12-12

Understanding customer feedback is becoming a necessity for companies to identify problems and improve their products and services. Text classification and sentiment analysis can play a major role in analyzing this data by using a variety of machine and deep learning approaches. In this work, different transformer-based models are utilized to explore how efficient these models are when working with a German customer feedback dataset. In addition, these pre-trained models are further analyzed to determine if adapting them to a specific domain using unlabeled data can yield better results than off-the-shelf pre-trained models. To evaluate the models, two downstream tasks from the GermEval 2017 are considered. The experimental results show that transformer-based models can reach significant improvements compared to a fastText baseline and outperform the published scores and previous models. For the subtask Relevance Classification, the best models achieve a micro-averaged $F1$-Score of 96.1 % on the first test set and 95.9 % on the second one, and a score of 85.1 % and 85.3 % for the subtask Polarity Classification.

translated by 谷歌翻译

LegalRelectra: Mixed-domain Language Modeling for Long-range Legal Text Comprehension

Wenyue Hua , Yuchen Zhang , Zhe Chen , Josie Li , Melanie Weber

分类：自然语言处理

2022-12-16

The application of Natural Language Processing (NLP) to specialized domains, such as the law, has recently received a surge of interest. As many legal services rely on processing and analyzing large collections of documents, automating such tasks with NLP tools emerges as a key challenge. Many popular language models, such as BERT or RoBERTa, are general-purpose models, which have limitations on processing specialized legal terminology and syntax. In addition, legal documents may contain specialized vocabulary from other domains, such as medical terminology in personal injury text. Here, we propose LegalRelectra, a legal-domain language model that is trained on mixed-domain legal and medical corpora. We show that our model improves over general-domain and single-domain medical and legal language models when processing mixed-domain (personal injury) text. Our training architecture implements the Electra framework, but utilizes Reformer instead of BERT for its generator and discriminator. We show that this improves the model's performance on processing long passages and results in better long-range text comprehension.

translated by 谷歌翻译

How to Fine-Tune BERT for Text Classification?

Chi Sun , Xipeng Qiu , Yige Xu , Xuanjing Huang

分类：

2019-05-14

Language model pre-training has proven to be useful in learning universal language representations. As a state-of-the-art language model pre-training model, BERT (Bidirectional Encoder Representations from Transformers) has achieved amazing results in many language understanding tasks. In this paper, we conduct exhaustive experiments to investigate different fine-tuning methods of BERT on text classification task and provide a general solution for BERT fine-tuning. Finally, the proposed solution obtains new state-of-the-art results on eight widely-studied text classification datasets. 1

translated by 谷歌翻译

Towards mapping the contemporary art world with ArtLM: an art-specific NLP model

Qinkai Chen , Mohamed El-Mennaoui , Antoine Fosset , Amine Rebei , Haoyang Cao , Christy Eóin O'Beirne , Sasha Shevchenko , Mathieu Rosenbaum

分类：自然语言处理 | 机器学习

2022-12-14

With an increasing amount of data in the art world, discovering artists and artworks suitable to collectors' tastes becomes a challenge. It is no longer enough to use visual information, as contextual information about the artist has become just as important in contemporary art. In this work, we present a generic Natural Language Processing framework (called ArtLM) to discover the connections among contemporary artists based on their biographies. In this approach, we first continue to pre-train the existing general English language models with a large amount of unlabelled art-related data. We then fine-tune this new pre-trained model with our biography pair dataset manually annotated by a team of professionals in the art industry. With extensive experiments, we demonstrate that our ArtLM achieves 85.6% accuracy and 84.0% F1 score and outperforms other baseline models. We also provide a visualisation and a qualitative analysis of the artist network built from ArtLM's outputs.

translated by 谷歌翻译

The Effects of In-domain Corpus Size on pre-training BERT

Chris Sanchez , Zheyuan Zhang

分类：自然语言处理

2022-12-15

Many prior language modeling efforts have shown that pre-training on an in-domain corpus can significantly improve performance on downstream domain-specific NLP tasks. However, the difficulties associated with collecting enough in-domain data might discourage researchers from approaching this pre-training task. In this paper, we conducted a series of experiments by pre-training Bidirectional Encoder Representations from Transformers (BERT) with different sizes of biomedical corpora. The results demonstrate that pre-training on a relatively small amount of in-domain data (4GB) with limited training steps, can lead to better performance on downstream domain-specific NLP tasks compared with fine-tuning models pre-trained on general corpora.

translated by 谷歌翻译

Important Sentence Identification in Legal Cases Using Multi-Class Classification

Sahan Jayasinghe , Lakith Rambukkanage , Ashan Silva , Nisansa de Silva , Amal Shehan Perera

分类：自然语言处理 | 人工智能 | 机器学习

2021-11-10

自然语言处理的进步（NLP）正在通过实际应用和学术利益的形式传播各个域。本质上，法律域包含大量数据以文本格式。因此，它需要将NLP应用于迎合对域的分析要求苛刻的需求。识别法律案例中的重要句子，事实和论点是法律专业人员这么繁琐的任务。在本研究中，我们探讨了句子嵌入的使用，以确定法律案件中的重要句子，在案件中的主要缔约方的角度。此外，定义了特定于任务的丢失功能，以提高通过分类交叉熵损失的直接使用限制的准确性。

translated by 谷歌翻译

ERNIE: Enhanced Language Representation with Informative Entities

Zhengyan Zhang , Xu Han , Zhiyuan Liu , Xin Jiang , Maosong Sun , Qun Liu

分类：

2019-05-17

Neural language representation models such as BERT pre-trained on large-scale corpora can well capture rich semantic patterns from plain text, and be fine-tuned to consistently improve the performance of various NLP tasks. However, the existing pre-trained language models rarely consider incorporating knowledge graphs (KGs), which can provide rich structured knowledge facts for better language understanding. We argue that informative entities in KGs can enhance language representation with external knowledge. In this paper, we utilize both large-scale textual corpora and KGs to train an enhanced language representation model (ERNIE), which can take full advantage of lexical, syntactic, and knowledge information simultaneously. The experimental results have demonstrated that ERNIE achieves significant improvements on various knowledge-driven tasks, and meanwhile is comparable with the state-of-the-art model BERT on other common NLP tasks. The source code and experiment details of this paper can be obtained from https:// github.com/thunlp/ERNIE.

translated by 谷歌翻译

Mining Legal Arguments in Court Decisions

Ivan Habernal , Daniel Faber , Nicola Recchia , Sebastian Bretthauer , Iryna Gurevych , Indra Spiecker genannt Döhmann , Christoph Burchard

分类：自然语言处理

2022-08-12

自论证挖掘领域成立以来，在法律话语中识别，分类和分析的论点一直是研究的重要领域。但是，自然语言处理（NLP）研究人员的模型模型与法院决策中的注释论点与法律专家理解和分析法律论证的方式之间存在重大差异。尽管计算方法通常将论点简化为通用的前提和主张，但法律研究中的论点通常表现出丰富的类型，对于获得一般法律的特定案例和应用很重要。我们解决了这个问题，并做出了一些实质性的贡献，以推动该领域的前进。首先，我们在欧洲人权法院（ECHR）诉讼中为法律论点设计了新的注释计划，该计划深深植根于法律论证研究的理论和实践中。其次，我们编译和注释了373项法院判决（230万令牌和15K注释的论点跨度）的大量语料库。最后，我们训练一个论证挖掘模型，该模型胜过法律NLP领域中最先进的模型，并提供了彻底的基于专家的评估。所有数据集和源代码均可在https://github.com/trusthlt/mining-legal-arguments的开放lincenses下获得。

translated by 谷歌翻译

Topic Segmentation in the Wild: Towards Segmentation of Semi-structured & Unstructured Chats

Reshmi Ghosh , Harjeet Singh Kajal , Sharanya Kamath , Dhuri Shrivastava , Samyadeep Basu , Soundararajan Srinivasan

分类：自然语言处理 | 人工智能

2022-11-27

Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this paper, we comprehensively analyze the generalization capabilities of state-of-the-art topic segmentation models on unstructured texts. We find that: (a) Current strategies of pre-training on a large corpus of structured text such as Wiki-727K do not help in transferability to unstructured texts. (b) Training from scratch with only a relatively small-sized dataset of the target unstructured domain improves the segmentation results by a significant margin.

translated by 谷歌翻译