Language Models appear to perform poorly on quantification. We ask how badly. 'Few'-type quantifiers, as in 'few children like vegetables' might pose a particular challenge for Language Models, since the sentence components without the quantifier are likely to co-occur, and because 'few'-type quantifiers are rare. We present 960 sentences stimuli from two human neurolinguistic experiments to 22 autoregressive transformer models of differing sizes. Not only do the models perform poorly on 'few'-type quantifiers, but overall the larger the model, the worse its performance. We interpret this inverse scaling as suggesting that larger models increasingly reflect online rather than offline human processing, and argue that decreasing performance of larger models may challenge uses of Language Models as the basis for Natural Language Systems.
translated by 谷歌翻译
Are the predictions of humans and language models affected by similar things? Research suggests that while comprehending language, humans make predictions about upcoming words, with more predictable words being processed more easily. However, evidence also shows that humans display a similar processing advantage for highly anomalous words when these words are semantically related to the preceding context or to the most probable continuation. Using stimuli from 3 psycholinguistic experiments, we find that this is also almost always also the case for 8 contemporary transformer language models (BERT, ALBERT, RoBERTa, XLM-R, GPT-2, GPT-Neo, GPT-J, and XGLM). We then discuss the implications of this phenomenon for our understanding of both human language comprehension and the predictions made by language models.
translated by 谷歌翻译
某些语言允许在某些情况下省略参数。然而,人类语言理解者可靠地推断出这些零代词的预期参考人,部分原因是他们建立了对哪些参考人更有可能的期望。我们询问神经语言模型是否也提取了相同的期望。我们测试了12种当代语言模型是否显示出反映人类行为的期望,这些句子暴露于Carminati(2005)中意大利五个行为实验中的零代词中。我们发现三个模型-XGLM 2.9B,4.5B和7.5B-从所有实验中捕获人类行为,而其他实验则成功地对某些结果进行了建模。该结果表明,人类对核心的期望可以从接触语言中得出,并且还指示了语言模型的特征,使他们能够更好地反映人类的行为。
translated by 谷歌翻译
Targeted syntactic evaluations of language models ask whether models show stable preferences for syntactically acceptable content over minimal-pair unacceptable inputs. Most targeted syntactic evaluation datasets ask models to make these judgements with just a single context-free sentence as input. This does not match language models' training regime, in which input sentences are always highly contextualized by the surrounding corpus. This mismatch raises an important question: how robust are models' syntactic judgements in different contexts? In this paper, we investigate the stability of language models' performance on targeted syntactic evaluations as we vary properties of the input context: the length of the context, the types of syntactic phenomena it contains, and whether or not there are violations of grammaticality. We find that model judgements are generally robust when placed in randomly sampled linguistic contexts. However, they are substantially unstable for contexts containing syntactic structures matching those in the critical test content. Among all tested models (GPT-2 and five variants of OPT), we significantly improve models' judgements by providing contexts with matching syntactic structures, and conversely significantly worsen them using unacceptable contexts with matching but violated syntactic structures. This effect is amplified by the length of the context, except for unrelated inputs. We show that these changes in model performance are not explainable by simple features matching the context and the test inputs, such as lexical overlap and dependency overlap. This sensitivity to highly specific syntactic features of the context can only be explained by the models' implicit in-context learning abilities.
translated by 谷歌翻译
Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors? In this paper, we analyze the intermediate training checkpoints of differently sized OPT models (Zhang et al.,2022)--from 125M to 175B parameters--on next-token prediction, sequence-level generation, and downstream tasks. We find that 1) at a given perplexity and independent of model sizes, a similar subset of training tokens see the most significant reduction in loss, with the rest stagnating or showing double-descent behavior; 2) early in training, all models learn to reduce the perplexity of grammatical sequences that contain hallucinations, with small models halting at this suboptimal distribution and larger ones eventually learning to assign these sequences lower probabilities; 3) perplexity is a strong predictor of in-context learning performance on 74 multiple-choice tasks from BIG-Bench, and this holds independent of the model size. Together, these results show that perplexity is more predictive of model behaviors than model size or training computation.
translated by 谷歌翻译
Task agnostic generative pretraining (GPT) has recently proved promising for zero- and few-shot learning, gradually diverting attention from the expensive supervised learning paradigm. Although the community is accumulating knowledge as to capabilities of English-language autoregressive models such as GPT-3 adopting this generative approach, scholarship about these models remains acutely Anglocentric. Consequently, the community currently has serious gaps in its understanding of this class of models, their potential, and their societal impacts in diverse settings, linguistic traditions, and cultures. To alleviate this issue for Arabic, a collection of diverse languages and language varieties with more than $400$ million population, we introduce JASMINE, a suite of powerful Arabic autoregressive Transformer language models ranging in size between 300 million-13 billion parameters. We pretrain our new models with large amounts of diverse data (400GB of text) from different Arabic varieties and domains. We evaluate JASMINE extensively in both intrinsic and extrinsic settings, using a comprehensive benchmark for zero- and few-shot learning across a wide range of NLP tasks. We also carefully develop and release a novel benchmark for both automated and human evaluation of Arabic autoregressive models focused at investigating potential social biases, harms, and toxicity in these models. We aim to responsibly release our models with interested researchers, along with code for experimenting with them
translated by 谷歌翻译
抽象推理是智能系统的关键能力。大型语言模型在抽象推理任务上实现了高度的性能,但表现出许多缺陷。但是,人类的抽象推理也是不完美的,并且取决于我们对推理问题内容的知识和信念。例如,人类对在日常情况下基于逻辑规则的逻辑规则比关于抽象属性的任意规则更可靠地理解。语言模型的培训经验类似地赋予了他们先前的期望,这些期望反映了人类的知识和信念。因此,我们假设语言模型会显示出类似人类的内容对抽象推理问题的影响。我们在三个逻辑推理任务中探讨了这一假设:自然语言推论,判断三段论的逻辑有效性和ison选择任务(Wason,1968)。我们发现,最新的大语言模型(具有7或700亿个参数; Hoffman等,2022)反映了这些任务中人类在人类中观察到的许多相同模式 - 像人类一样,模型对可信情况的理由更有效地理由不现实或抽象的。我们的发现对理解这些认知效应以及有助于语言模型表现的因素具有影响。
translated by 谷歌翻译
Current language models are considered to have sub-human capabilities at natural language tasks like question-answering or writing code. However, language models are not trained to perform well at these tasks, they are trained to accurately predict the next token given previous tokes in tokenized text. It is not clear whether language models are better or worse than humans at next token prediction. To try to answer this question, we performed two distinct experiments to directly compare humans and language models on this front: one measuring top-1 accuracy and the other measuring perplexity. In both experiments, we find humans to be consistently \emph{worse} than even relatively small language models like GPT3-Ada at next-token prediction.
translated by 谷歌翻译
大型语言模型在零拍摄设置中的许多自然语言处理(NLP)任务中表现出令人印象深刻的性能。我们询问这些模型是否展示了致辞语言 - NLP应用的关键组成部分 - 通过评估四个偶数基准的模型。我们发现大型语言模型的令人印象深刻的零射击性能主要是由于我们的基准测试中的数据集偏差。我们还表明,零拍摄性能对基准的超参数和相似性敏感到预训练数据集。此外,当在几次拍摄设置中评估模型时,我们没有观察大量改进。最后,与以前的工作相比,我们发现利用明确的致辞知识并没有产生重大改善。
translated by 谷歌翻译
基于变压器的语言模型最近在许多自然语言任务中取得了显着的结果。但是,通常通过利用大量培训数据来实现排行榜的性能,并且很少通过将明确的语言知识编码为神经模型。这使许多人质疑语言学对现代自然语言处理的相关性。在本文中,我介绍了几个案例研究,以说明理论语言学和神经语言模型仍然相互关联。首先,语言模型通过提供一个客观的工具来测量语义距离,这对语言学家很有用,语义距离很难使用传统方法。另一方面,语言理论通过提供框架和数据源来探究我们的语言模型,以了解语言理解的特定方面,从而有助于语言建模研究。本论文贡献了三项研究,探讨了语言模型中语法 - 听觉界面的不同方面。在论文的第一部分中,我将语言模型应用于单词类灵活性的问题。我将Mbert作为语义距离测量的来源,我提供了有利于将单词类灵活性分析为方向过程的证据。在论文的第二部分中,我提出了一种方法来测量语言模型中间层的惊奇方法。我的实验表明,包含形态句法异常的句子触发了语言模型早期的惊喜,而不是语义和常识异常。最后,在论文的第三部分中,我适应了一些心理语言学研究,以表明语言模型包含了论证结构结构的知识。总而言之,我的论文在自然语言处理,语言理论和心理语言学之间建立了新的联系,以为语言模型的解释提供新的观点。
translated by 谷歌翻译
我们研究语言模型是否可以评估自己主张的有效性,并预测他们能够正确回答的问题。我们首先表明,当以正确的格式提供时,较大的模型在多样化的多项选择和True/False问题上进行了很好的校准。因此,我们可以通过要求模型首先提出答案,然后评估其答案正确的概率“ p(true)”来对开放式采样任务进行自我评估。我们发现在各种任务中,P(true)的表现,校准和缩放令人鼓舞。当我们允许模型考虑自己的许多样本之前,在预测一种特定可能性的有效性之前,自我评估的性能进一步改善。接下来,我们研究是否可以培训模型来预测“ P(ik)”,即“我知道”问题的概率,而无需参考任何特定提出的答案。模型在预测P(IK)方面表现良好,并且在跨任务中部分概括,尽管它们在新任务上的P(IK)校准方面遇到了困难。预测的p(IK)概率在存在相关的原始材料的情况下以及对数学单词问题解决方案的提示也适当增加。我们希望这些观察结果为培训更诚实的模型提供了基础,并研究了诚实对模型模仿人类写作以外的其他目标培训的案例的普遍性。
translated by 谷歌翻译
People constantly use language to learn about the world. Computational linguists have capitalized on this fact to build large language models (LLMs) that acquire co-occurrence-based knowledge from language corpora. LLMs achieve impressive performance on many tasks, but the robustness of their world knowledge has been questioned. Here, we ask: do LLMs acquire generalized knowledge about real-world events? Using curated sets of minimal sentence pairs (n=1215), we tested whether LLMs are more likely to generate plausible event descriptions compared to their implausible counterparts. We found that LLMs systematically distinguish possible and impossible events (The teacher bought the laptop vs. The laptop bought the teacher) but fall short of human performance when distinguishing likely and unlikely events (The nanny tutored the boy vs. The boy tutored the nanny). In follow-up analyses, we show that (i) LLM scores are driven by both plausibility and surface-level sentence features, (ii) LLMs generalize well across syntactic sentence variants (active vs passive) but less well across semantic sentence variants (synonymous sentences), (iii) some, but not all LLM deviations from ground-truth labels align with crowdsourced human judgments, and (iv) explicit event plausibility information emerges in middle LLM layers and remains high thereafter. Overall, our analyses reveal a gap in LLMs' event knowledge, highlighting their limitations as generalized knowledge bases. We conclude by speculating that the differential performance on impossible vs. unlikely events is not a temporary setback but an inherent property of LLMs, reflecting a fundamental difference between linguistic knowledge and world knowledge in intelligent systems.
translated by 谷歌翻译
Pragmatics is an essential part of communication, but it remains unclear what mechanisms underlie human pragmatic communication and whether NLP systems capture pragmatic language understanding. To investigate both these questions, we perform a fine-grained comparison of language models and humans on seven pragmatic phenomena, using zero-shot prompting on an expert-curated set of English materials. We ask whether models (1) select pragmatic interpretations of speaker utterances, (2) make similar error patterns as humans, and (3) use similar linguistic cues as humans to solve the tasks. We find that the largest models achieve high accuracy and match human error patterns: within incorrect responses, models favor the literal interpretation of an utterance over heuristic-based distractors. We also find evidence that models and humans are sensitive to similar linguistic cues. Our results suggest that even paradigmatic pragmatic phenomena may be solved without explicit representations of other agents' mental states, and that artificial models can be used to gain mechanistic insights into human pragmatic processing.
translated by 谷歌翻译
具有更多数据,计算和参数的缩放语言模型在自然语言处理方面取得了重大进展。例如,由于缩放,GPT-3能够在内心学习任务上实现强烈结果。但是,培训这些大密度模型需要大量的计算资源。在本文中,我们提出并开发了名为Glam(通用语言模型)的语言模型系列,它使用稀疏激活的专家架构来规模模型容量,同时与致密变体相比,也产生显着更少的训练成本。最大的Glam具有1.2万亿参数,比GPT-3大约为7倍。它仅消耗了用于训练GPT-3的1/3的能量,并且需要一半的计算拖鞋进行推理,同时仍然在29个NLP任务中实现更好的整体零射击和一次性性能。
translated by 谷歌翻译
我们研究了现代神经语言模型容易受到结构启动的程度,这种现象使句子的结构在后续句子中更有可能使相同的结构更有可能。我们探索如何使用启动来研究这些模型学习抽象结构信息的潜力,这是需要自然语言理解技能的任务良好表现的先决条件。我们引入了一种新型的度量标准和释放Prime-LM,这是一个大型语料库,我们可以控制与启动强度相互作用的各种语言因素。我们发现,变压器模型确实显示了结构启动的证据,但他们所学到的概括在某种程度上是由语义信息调节的。我们的实验还表明,模型获得的表示不仅可以编码抽象的顺序结构,而且还涉及一定级别的层次句法信息。更普遍的是,我们的研究表明,启动范式是一种有用的,可用于洞悉语言模型能力的有用的,并为未来的基于底漆的调查打开了探测模型内部状态的未来大门。
translated by 谷歌翻译
扩展语言模型已被证明可以预测提高各种下游任务的性能和样本效率。相反,本文讨论了一种不可预测的现象,我们将其称为大语言模型的新兴能力。如果在较小的模型中不存在,而是在较大的模型中存在,那么我们认为它可以突然出现。因此,不仅可以通过推断较小模型的性能来预测紧急能力。这种出现的存在意味着额外的扩展可以进一步扩大语言模型的能力范围。
translated by 谷歌翻译
We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even becoming competitive with prior state-ofthe-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous nonsparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks. We also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora.
translated by 谷歌翻译
先前的工作表明,语言模型(LMS)的大小(LMS)与它们在不同下游NLP任务上的零拍摄性能之间存在缩放定律。在这项工作中,我们表明,在用否定提示的任务评估大型LM时,这种现象并不存在,而是显示了逆缩放定律。我们对(1)验证的LMS(OPT&GPT -3)的否定提示评估了9个不同的任务,该任务的不同尺寸(125m -175b),(2)LMS进一步预处理以推广到新颖的提示(指令),(3)提供的LMS,(3)LMS。示例很少,(4)LMS专门针对否定的提示进行了微调;所有LM类型在否定的提示上的表现较差,并在比较原始提示和否定提示的平均得分时显示人类绩效之间的巨大性能差距。通过强调现有LMS和方法的关键局限,我们敦促社区开发开发实际遵循给定指示的LMS的新方法。我们提供代码和数据集,以探索https://github.com/joeljang/negated-prompts-for-llms的否定提示。
translated by 谷歌翻译
当前的语言模型可以产生高质量的文本。他们只是复制他们之前看到的文本,或者他们学习了普遍的语言抽象吗?要取笑这些可能性,我们介绍了乌鸦,这是一套评估生成文本的新颖性,专注于顺序结构(n-gram)和句法结构。我们将这些分析应用于四种神经语言模型(LSTM,变压器,变换器-XL和GPT-2)。对于本地结构 - 例如,单个依赖性 - 模型生成的文本比来自每个模型的测试集的人类生成文本的基线显着不那么新颖。对于大规模结构 - 例如,总句结构 - 模型生成的文本与人生成的基线一样新颖甚至更新颖,但模型仍然有时复制,在某些情况下,在训练集中重复超过1000字超过1,000字的通道。我们还表现了广泛的手动分析,表明GPT-2的新文本通常在形态学和语法中形成良好,但具有合理的语义问题(例如,是自相矛盾)。
translated by 谷歌翻译
The recent success of large language models for text generation poses a severe threat to academic integrity, as plagiarists can generate realistic paraphrases indistinguishable from original work. However, the role of large autoregressive transformers in generating machine-paraphrased plagiarism and their detection is still developing in the literature. This work explores T5 and GPT-3 for machine-paraphrase generation on scientific articles from arXiv, student theses, and Wikipedia. We evaluate the detection performance of six automated solutions and one commercial plagiarism detection software and perform a human study with 105 participants regarding their detection performance and the quality of generated examples. Our results suggest that large models can rewrite text humans have difficulty identifying as machine-paraphrased (53% mean acc.). Human experts rate the quality of paraphrases generated by GPT-3 as high as original texts (clarity 4.0/5, fluency 4.2/5, coherence 3.8/5). The best-performing detection model (GPT-3) achieves a 66% F1-score in detecting paraphrases.
translated by 谷歌翻译