Recent work has shown that Pre-trained Language Models (PLMs) store the relational knowledge learned from data and utilize it for performing downstream tasks. However, commonsense knowledge across different regions may vary. For instance, the color of bridal dress is white in American weddings whereas it is red in Chinese weddings. In this paper, we introduce a benchmark dataset, Geo-Diverse Commonsense Multilingual Language Models Analysis (GeoMLAMA), for probing the diversity of the relational knowledge in multilingual PLMs. GeoMLAMA contains 3,125 prompts in English, Chinese, Hindi, Persian, and Swahili, with a wide coverage of concepts shared by people from American, Chinese, Indian, Iranian and Kenyan cultures. We benchmark 11 standard multilingual PLMs on GeoMLAMA. Interestingly, we find that 1) larger multilingual PLMs variants do not necessarily store geo-diverse concepts better than its smaller variant; 2) multilingual PLMs are not intrinsically biased towards knowledge from the Western countries (the United States); 3) the native language of a country may not be the best language to probe its knowledge and 4) a language may better probe knowledge about a non-native country than its native country. Code and data are released at https://github.com/WadeYin9712/GeoMLAMA.
translated by 谷歌翻译
GPT-3等大型自回归语言模型是几秒钟的学习者,可以在没有微调的情况下执行各种语言任务。虽然已知这些模型能够共同代表许多不同的语言,但他们的培训数据由英语主导,可能限制了它们的交叉概括。在这项工作中,我们在覆盖多种语言的平衡语料库上培训多语言自回归语言模型,并在广泛的任务中研究他们几乎没有零点的学习能力。我们最大的模型,具有75亿参数,在20多种代表语言中,在几种代表语言中,在几种代表性语言中,在几种代表性语言中,在多语言型号推理中表现出可比大小的GPT-3(在0次设置和0次拍摄设置中的绝对精度改善+ 7.4% 4-拍摄设置中的9.4%)和自然语言推理(每次拍摄和4次设置中的每一个+ 5.4%)。在Flores-101机器翻译基准测试中,我们的模型优于GPT-3在182个翻译方向上有32个培训例子,同时超过45个方向的官方监督基线。我们介绍了模型成功和失败的位置的详细分析,特别是它尤其显示在某些任务中实现交叉语境的内容学习,而仍然存在改善表面的鲁棒性和适应没有a的任务的余地自然冻结形式。最后,我们评估我们在仇恨语音检测中以五种语言的仇恨语音检测的模型,并发现它具有与可比大小的GPT-3模型类似的限制。
translated by 谷歌翻译
Conceptual knowledge is fundamental to human cognition and knowledge bases. However, existing knowledge probing works only focus on evaluating factual knowledge of pre-trained language models (PLMs) and ignore conceptual knowledge. Since conceptual knowledge often appears as implicit commonsense behind texts, designing probes for conceptual knowledge is hard. Inspired by knowledge representation schemata, we comprehensively evaluate conceptual knowledge of PLMs by designing three tasks to probe whether PLMs organize entities by conceptual similarities, learn conceptual properties, and conceptualize entities in contexts, respectively. For the tasks, we collect and annotate 24k data instances covering 393 concepts, which is COPEN, a COnceptual knowledge Probing bENchmark. Extensive experiments on different sizes and types of PLMs show that existing PLMs systematically lack conceptual knowledge and suffer from various spurious correlations. We believe this is a critical bottleneck for realizing human-like cognition in PLMs. COPEN and our codes are publicly released at https://github.com/THU-KEG/COPEN.
translated by 谷歌翻译
Pretrained language models (PLMs) often fail to fairly represent target users from certain world regions because of the under-representation of those regions in training datasets. With recent PLMs trained on enormous data sources, quantifying their potential biases is difficult, due to their black-box nature and the sheer scale of the data sources. In this work, we devise an approach to study the geographic bias (and knowledge) present in PLMs, proposing a Geographic-Representation Probing Framework adopting a self-conditioning method coupled with entity-country mappings. Our findings suggest PLMs' representations map surprisingly well to the physical world in terms of country-to-country associations, but this knowledge is unequally shared across languages. Last, we explain how large PLMs despite exhibiting notions of geographical proximity, over-amplify geopolitical favouritism at inference time.
translated by 谷歌翻译
We study politeness phenomena in nine typologically diverse languages. Politeness is an important facet of communication and is sometimes argued to be cultural-specific, yet existing computational linguistic study is limited to English. We create TyDiP, a dataset containing three-way politeness annotations for 500 examples in each language, totaling 4.5K examples. We evaluate how well multilingual models can identify politeness levels -- they show a fairly robust zero-shot transfer ability, yet fall short of estimated human accuracy significantly. We further study mapping the English politeness strategy lexicon into nine languages via automatic translation and lexicon induction, analyzing whether each strategy's impact stays consistent across languages. Lastly, we empirically study the complicated relationship between formality and politeness through transfer experiments. We hope our dataset will support various research questions and applications, from evaluating multilingual models to constructing polite multilingual agents.
translated by 谷歌翻译
Recent progress in pretraining language models on large textual corpora led to a surge of improvements for downstream NLP tasks. Whilst learning linguistic knowledge, these models may also be storing relational knowledge present in the training data, and may be able to answer queries structured as "fillin-the-blank" cloze statements. Language models have many advantages over structured knowledge bases: they require no schema engineering, allow practitioners to query about an open class of relations, are easy to extend to more data, and require no human supervision to train. We present an in-depth analysis of the relational knowledge already present (without fine-tuning) in a wide range of state-of-theart pretrained language models. We find that (i) without fine-tuning, BERT contains relational knowledge competitive with traditional NLP methods that have some access to oracle knowledge, (ii) BERT also does remarkably well on open-domain question answering against a supervised baseline, and (iii) certain types of factual knowledge are learned much more readily than others by standard language model pretraining approaches. The surprisingly strong ability of these models to recall factual knowledge without any fine-tuning demonstrates their potential as unsupervised open-domain QA systems. The code to reproduce our analysis is available at https: //github.com/facebookresearch/LAMA.
translated by 谷歌翻译
在原始文本中训练的语言模型(LMS)无法直接访问物理世界。 Gordon和Van Durme(2013)指出,LMS因此可能会遭受报告偏见的困扰:文本很少报告常见事实,而是关注情况的异常方面。如果LMS仅接受文本语料库的培训,并天真地记住当地的同时出现统计数据,那么他们自然会学会对物理世界的偏见。虽然先前的研究反复验证了较小尺度的LM(例如Roberta,GPT-2)放大了报告偏差,但在模型扩展时,这种趋势是否继续。我们从较大语言模型(LLM)(例如Palm和GPT-3)中从颜色的角度研究报告偏见。具体而言,我们查询llms对物体的典型颜色,这是一种简单的感知扎根的物理常识。令人惊讶的是,我们发现LLM在确定对象的典型颜色和更紧密地跟踪人类判断方面的表现明显优于较小的LMS,而不是过于适应文本中存储的表面图案。这表明,仅凭语言的大型语言模型就能克服以局部共发生为特征的某些类型的报告偏差。
translated by 谷歌翻译
通过自我监督的学习预先训练的大型语言模型在各种各样的任务上表现出令人印象深刻的零击功能。在这项工作中,我们介绍了Welm:一种针对中文的精心读取的预训练的语言模型,能够无缝执行不同类型的任务,以零或几次演示。 Welm通过“阅读”涵盖广泛主题的精选高质量语料库来接受10b参数的培训。我们表明,韦尔姆拥有有关各种领域和语言的广泛知识。在18个单语(中文)任务中,WELM可以大大优于现有的预训练模型,尺寸相似,并匹配高达25倍大的模型的性能。韦尔姆还表现出强大的多种语言和代码转换理解的能力,优于预先对30种语言进行预培训的现有多语言模型。此外,我们收集了人工编写的提示,并通过多次培训进行了大量的中文和微调韦尔姆的监督数据集。最终的模型可以实现对看不见的任务类型的强烈概括,并在零射门学习中优于无监督的韦尔姆。最后,我们证明韦尔姆具有解释和校准自己的决策的基本技能,这可能是未来研究的有希望的方向。我们的模型可以从https://welm.weixin.qq.com/docs/api/应用。
translated by 谷歌翻译
Task agnostic generative pretraining (GPT) has recently proved promising for zero- and few-shot learning, gradually diverting attention from the expensive supervised learning paradigm. Although the community is accumulating knowledge as to capabilities of English-language autoregressive models such as GPT-3 adopting this generative approach, scholarship about these models remains acutely Anglocentric. Consequently, the community currently has serious gaps in its understanding of this class of models, their potential, and their societal impacts in diverse settings, linguistic traditions, and cultures. To alleviate this issue for Arabic, a collection of diverse languages and language varieties with more than $400$ million population, we introduce JASMINE, a suite of powerful Arabic autoregressive Transformer language models ranging in size between 300 million-13 billion parameters. We pretrain our new models with large amounts of diverse data (400GB of text) from different Arabic varieties and domains. We evaluate JASMINE extensively in both intrinsic and extrinsic settings, using a comprehensive benchmark for zero- and few-shot learning across a wide range of NLP tasks. We also carefully develop and release a novel benchmark for both automated and human evaluation of Arabic autoregressive models focused at investigating potential social biases, harms, and toxicity in these models. We aim to responsibly release our models with interested researchers, along with code for experimenting with them
translated by 谷歌翻译
petroni等。 (2019)证明,可以通过将它们表达为冻结式提示并将模型的预测准确性解释为下限,作为其编码的事实信息量的较低限制,从预先接收的语言模型中检索世界事实。随后的工作已经尝试通过搜索更好的提示来缩回估计,使用不相交的事实作为培训数据。在这项工作中,我们制作两个互补贡献,以更好地了解这些事实探测技术。首先,我们提出了OptiPrompt,一种新颖的和有效的方法,直接在连续嵌入空间中优化。我们发现这种简单的方法能够预测喇嘛基准中的额外6.4%的事实。其次,我们提出了一个更重要的问题:我们真的可以将这些探测结果解释为下限吗?这些提示搜索方法是否有可能从培训数据中学习?我们发现,有些令人惊讶的是,这些方法使用的培训数据包含了潜在的事实分布的某些规则,以及所有现有的提示方法,包括我们的方法,可以利用它们以获得更好的事实预测。我们开展一系列控制实验来解除“学习”从“学习召回”,提供了更详细的图片,不同的提示可以揭示关于预先接受的语言模型。
translated by 谷歌翻译
大型语言模型在零拍摄设置中的许多自然语言处理(NLP)任务中表现出令人印象深刻的性能。我们询问这些模型是否展示了致辞语言 - NLP应用的关键组成部分 - 通过评估四个偶数基准的模型。我们发现大型语言模型的令人印象深刻的零射击性能主要是由于我们的基准测试中的数据集偏差。我们还表明,零拍摄性能对基准的超参数和相似性敏感到预训练数据集。此外,当在几次拍摄设置中评估模型时,我们没有观察大量改进。最后,与以前的工作相比,我们发现利用明确的致辞知识并没有产生重大改善。
translated by 谷歌翻译
交叉思考的预培训使用单晶和双语纯文本语料库取得了巨大的成功。然而,大多数预先训练的模型忽略了多语言知识,这是语言不可知的,但包括丰富的交叉结构对齐。在本文中,我们提出了一种XLM-K,这是一种跨语言模型,其在预训练中结合了多语言知识。xlm-k增强了具有两个知识任务的现有多语言预培训,即屏蔽实体预测任务和对象引入任务。我们评估MLQA,NER和XNLI的XLM-K。实验结果清楚地表明了对现有的多语言语言模型的显着改进。MLQA和NER上的结果展示了知识相关任务中的XLM-K的优越性。XNLI中的成功显示了在XLM-k中获得的更好的交叉翻转性。更重要的是,我们提供了详细的探测分析,以确认我们在培训前方案中捕获的所需知识。
translated by 谷歌翻译
语言可以用作再现和执行有害刻板印象和偏差的手段,并被分析在许多研究中。在本文中,我们对自然语言处理中的性别偏见进行了304篇论文。我们分析了社会科学中性别及其类别的定义,并将其连接到NLP研究中性别偏见的正式定义。我们调查了在对性别偏见的研究中应用的Lexica和数据集,然后比较和对比方法来检测和减轻性别偏见。我们发现对性别偏见的研究遭受了四个核心限制。 1)大多数研究将性别视为忽视其流动性和连续性的二元变量。 2)大部分工作都在单机设置中进行英语或其他高资源语言进行。 3)尽管在NLP方法中对性别偏见进行了无数的论文,但我们发现大多数新开发的算法都没有测试他们的偏见模型,并无视他们的工作的伦理考虑。 4)最后,在这一研究线上发展的方法基本缺陷涵盖性别偏差的非常有限的定义,缺乏评估基线和管道。我们建议建议克服这些限制作为未来研究的指导。
translated by 谷歌翻译
In this work, we introduce IndicXTREME, a benchmark consisting of nine diverse tasks covering 18 languages from the Indic sub-continent belonging to four different families. Across languages and tasks, IndicXTREME contains a total of 103 evaluation sets, of which 51 are new contributions to the literature. To maintain high quality, we only use human annotators to curate or translate\footnote{for IndicXParaphrase, where an automatic translation system is used, a second human verification and correction step is done.} our datasets. To the best of our knowledge, this is the first effort toward creating a standard benchmark for Indic languages that aims to test the zero-shot capabilities of pretrained language models. We also release IndicCorp v2, an updated and much larger version of IndicCorp that contains 20.9 billion tokens in 24 languages. We pretrain IndicBERT v2 on IndicCorp v2 and evaluate it on IndicXTREME to show that it outperforms existing multilingual language models such as XLM-R and MuRIL.
translated by 谷歌翻译
最近已被证明大型语言模型在各种任务集中获得合理的零射普通化(Brown等,2020)。它已经假设这是语言模型的隐式多任务学习的结果,在语言模型中的预押(Radford等,2019)。可以通过明确的多任务学习直接引起零拍常规化?为了以缩放测试这个问题,我们开发一个系统,以便轻松地将任何自然语言任务映射到人类可读的提示表单中。我们转换一组大量的监督数据集,每个数据集都有多个提示,具有不同的措辞。这些提示的数据集允许基准测试模型执行完全看不见的任务的能力。我们介绍了一个普拉克尔编码器 - 解码器模型(Raffel等,2020; Lester等,2021),覆盖各种任务。该模型在多个标准数据集中达到强大的零点性能,通常优于其尺寸的型号超过16倍。此外,我们的方法对来自Big-替补基准测试的任务子集具有强烈性能,优于其尺寸的6倍。所有提示和培训的型号都可以在https://github.com/ bigscience-workshop / protectsource / httpsource / https://huggingface.co/bigscience/t0pp。
translated by 谷歌翻译
基于变压器的语言模型最近在许多自然语言任务中取得了显着的结果。但是,通常通过利用大量培训数据来实现排行榜的性能,并且很少通过将明确的语言知识编码为神经模型。这使许多人质疑语言学对现代自然语言处理的相关性。在本文中,我介绍了几个案例研究,以说明理论语言学和神经语言模型仍然相互关联。首先,语言模型通过提供一个客观的工具来测量语义距离,这对语言学家很有用,语义距离很难使用传统方法。另一方面,语言理论通过提供框架和数据源来探究我们的语言模型,以了解语言理解的特定方面,从而有助于语言建模研究。本论文贡献了三项研究,探讨了语言模型中语法 - 听觉界面的不同方面。在论文的第一部分中,我将语言模型应用于单词类灵活性的问题。我将Mbert作为语义距离测量的来源,我提供了有利于将单词类灵活性分析为方向过程的证据。在论文的第二部分中,我提出了一种方法来测量语言模型中间层的惊奇方法。我的实验表明,包含形态句法异常的句子触发了语言模型早期的惊喜,而不是语义和常识异常。最后,在论文的第三部分中,我适应了一些心理语言学研究,以表明语言模型包含了论证结构结构的知识。总而言之,我的论文在自然语言处理,语言理论和心理语言学之间建立了新的联系,以为语言模型的解释提供新的观点。
translated by 谷歌翻译
语言模型(LMS)已被证明在各种下游应用程序中很有用,例如摘要,翻译,问答和文本分类。由于它们可以存储的大量信息,LMS正在成为人工智能中越来越重要的工具。在这项工作中,我们提出了道具(提示为探测),该道具利用GPT-3(最初由OpenAI在2020年提出的大型语言模型)来执行知识基础构建任务(KBC)。 Prop实施了一种多步骤方法,该方法结合了各种提示技术来实现这一目标。我们的结果表明,手动提示策划是必不可少的,必须鼓励LM给出可变长度的答案集,特别是包括空的答案集,True/False问题是提高LM生成的建议精度的有用设备。 LM的大小是至关重要的因素,并且实体字典别名提高了LM评分。我们的评估研究表明,这些提出的技术可以大大提高最终预测的质量:Prop赢得了LM-KBC竞争的轨道2,表现优于基线36.4个百分点。我们的实施可在https://github.com/hemile/iswc-challenge上获得。
translated by 谷歌翻译
现在,通过复杂的神经网络模型(例如蒙版的神经语言模型(MNLM))学习了许多上下文化的单词表示形式,这些模型由巨大的神经网络结构组成,并经过训练以恢复蒙面文本。这样的表示表明在某些阅读理解(RC)任务中表现出超人的表现,这些任务在给出问题的上下文中提取了适当的答案。但是,由于许多模型参数,确定在MNLM中训练的详细知识是具有挑战性的。本文提供了有关MNLMS中包含的常识性知识的新见解和经验分析。首先,我们使用诊断测试来评估常识性知识是否在MNLMS中进行了适当的培训。我们观察到,在MNLMS中没有适当训练很多常识性知识,并且MNLMS并不经常准确地理解关系的语义含义。此外,我们发现基于MNLM的RC模型仍然容易受到需要常识知识的语义变化的影响。最后,我们发现了未经训练的知识的基本原因。我们进一步建议,利用外常识性知识存储库可以是一个有效的解决方案。我们说明了通过在受控实验中以外常识性知识存储库来丰富文本的经文,以克服基于MNLM的RC模型的局限性的可能性。
translated by 谷歌翻译
The remarkable success of pretrained language models has motivated the study of what kinds of knowledge these models learn during pretraining. Reformulating tasks as fillin-the-blanks problems (e.g., cloze tests) is a natural approach for gauging such knowledge, however, its usage is limited by the manual effort and guesswork required to write suitable prompts. To address this, we develop AUTOPROMPT, an automated method to create prompts for a diverse set of tasks, based on a gradient-guided search. Using AUTO-PROMPT, we show that masked language models (MLMs) have an inherent capability to perform sentiment analysis and natural language inference without additional parameters or finetuning, sometimes achieving performance on par with recent state-of-the-art supervised models. We also show that our prompts elicit more accurate factual knowledge from MLMs than the manually created prompts on the LAMA benchmark, and that MLMs can be used as relation extractors more effectively than supervised relation extraction models. These results demonstrate that automatically generated prompts are a viable parameter-free alternative to existing probing methods, and as pretrained LMs become more sophisticated and capable, potentially a replacement for finetuning.
translated by 谷歌翻译
Open-Domain Generative Question Answering has achieved impressive performance in English by combining document-level retrieval with answer generation. These approaches, which we refer to as GenQA, can generate complete sentences, effectively answering both factoid and non-factoid questions. In this paper, we extend GenQA to the multilingual and cross-lingual settings. For this purpose, we first introduce GenTyDiQA, an extension of the TyDiQA dataset with well-formed and complete answers for Arabic, Bengali, English, Japanese, and Russian. Based on GenTyDiQA, we design a cross-lingual generative model that produces full-sentence answers by exploiting passages written in multiple languages, including languages different from the question. Our cross-lingual generative system outperforms answer sentence selection baselines for all 5 languages and monolingual generative pipelines for three out of five languages studied.
translated by 谷歌翻译