我的博士研究侧重于理解培训的神经网络模型中的语义知识,通过绘制认知科学的概念和类别研究的洞察来预测自然语言(称为语言模型或LMS)。我提出了一个受“归纳推理”启发的框架,这是一种揭示人类如何利用背景知识来实现归纳的现象,并从关于概念及其属性的新信息中展开归纳。从研究归纳推理的实验中绘制,我建议使用人类感应文献中观察到的现象分析LMS中的语义感应概括,调查诸如隐性推理和紧急特征识别等任务的归纳行为,以及分析和将感应动态分析到学习的概念表示空间。
translated by 谷歌翻译
Reasoning, as an essential ability for complex problem-solving, can provide back-end support for various real-world applications, such as medical diagnosis, negotiation, etc. This paper provides a comprehensive survey of cutting-edge research on reasoning with language model prompting. We introduce research works with comparisons and summaries and provide systematic resources to help beginners. We also discuss the potential reasons for emerging such reasoning abilities and highlight future research directions.
translated by 谷歌翻译
自然语言处理的机器学习快速进步有可能改变有关人类学习语言的辩论。但是,当前人工学习者和人类的学习环境和偏见以削弱从学习模拟获得的证据的影响的方式分歧。例如,当今最有效的神经语言模型接受了典型儿童可用的语言数据量的大约一千倍。为了增加计算模型的可学习性结果的相关性,我们需要培训模型学习者,而没有比人类具有显着优势的学习者。如果合适的模型成功地获得了一些目标语言知识,则可以提供一个概念证明,即在假设的人类学习方案中可以学习目标。合理的模型学习者将使我们能够进行实验操作,以对学习环境中的变量进行因果推断,并严格测试史密斯风格的贫困声明,主张根据人类对人类的先天语言知识,基于有关可学习性的猜测。由于实用和道德的考虑因素,人类受试者将永远无法实现可比的实验,从而使模型学习者成为必不可少的资源。到目前为止,试图剥夺当前模型的不公平优势,为关键语法行为(例如可接受性判断)获得亚人类结果。但是,在我们可以合理地得出结论,语言学习需要比当前模型拥有更多的特定领域知识,我们必须首先以多模式刺激和多代理互动的形式探索非语言意见,以使学习者更有效地学习学习者来自有限的语言输入。
translated by 谷歌翻译
Language models (LMs) are trained on collections of documents, written by individual human agents to achieve specific goals in an outside world. During training, LMs have access only to text of these documents, with no direct evidence of the internal states of the agents that produced them -- a fact often used to argue that LMs are incapable of modeling goal-directed aspects of human language production and comprehension. Can LMs trained on text learn anything at all about the relationship between language and use? I argue that LMs are models of intentional communication in a specific, narrow sense. When performing next word prediction given a textual context, an LM can infer and represent properties of an agent likely to have produced that context. These representations can in turn influence subsequent LM generation in the same way that agents' communicative intentions influence their language. I survey findings from the recent literature showing that -- even in today's non-robust and error-prone models -- LMs infer and use representations of fine-grained communicative intentions and more abstract beliefs and goals. Despite the limited nature of their training data, they can thus serve as building blocks for systems that communicate and act intentionally.
translated by 谷歌翻译
Pre-trained language models (LMs) have shown remarkable reasoning performance using explanations (or ``chain-of-thought'' (CoT)) for in-context learning. On the other hand, these reasoning tasks are usually presumed to be more approachable for symbolic programming. To make progress towards understanding in-context learning, we curate synthetic datasets containing equivalent (natural, symbolic) data pairs, where symbolic examples contain first-order logic rules and predicates from knowledge bases (KBs). Then we revisit neuro-symbolic approaches and use Language Models as Logic Programmer (LMLP) that learns from demonstrations containing logic rules and corresponding examples to iteratively reason over KBs, recovering Prolog's backward chaining algorithm. Comprehensive experiments are included to systematically compare LMLP with CoT in deductive reasoning settings, showing that LMLP enjoys more than 25% higher accuracy than CoT on length generalization benchmarks even with fewer parameters.
translated by 谷歌翻译
先前的工作表明,语言模型(LMS)的大小(LMS)与它们在不同下游NLP任务上的零拍摄性能之间存在缩放定律。在这项工作中,我们表明,在用否定提示的任务评估大型LM时,这种现象并不存在,而是显示了逆缩放定律。我们对(1)验证的LMS(OPT&GPT -3)的否定提示评估了9个不同的任务,该任务的不同尺寸(125m -175b),(2)LMS进一步预处理以推广到新颖的提示(指令),(3)提供的LMS,(3)LMS。示例很少,(4)LMS专门针对否定的提示进行了微调;所有LM类型在否定的提示上的表现较差,并在比较原始提示和否定提示的平均得分时显示人类绩效之间的巨大性能差距。通过强调现有LMS和方法的关键局限,我们敦促社区开发开发实际遵循给定指示的LMS的新方法。我们提供代码和数据集,以探索https://github.com/joeljang/negated-prompts-for-llms的否定提示。
translated by 谷歌翻译
基于神经网络的深层语言模型(LMS)越来越多地应用于大规模蛋白质序列数据以预测蛋白质功能。然而,作为黑框模型,当前的蛋白质LM方法并不促进对序列功能映射的基本理解,而阻碍了基于规则的生物治疗药物开发,因此目前的蛋白质LM方法不大。我们认为,从语言学中得出的指导是从自然语言数据中提取分析规则的领域,可以帮助构建学习相关领域特定规则的更容易解释的蛋白质LM。与自然语言LMS相比,蛋白质序列数据和语言序列数据之间的差异需要在蛋白质LMS中集成更多的域特异性知识。在这里,我们为培训数据,令牌化,令牌嵌入,序列嵌入和模型解释提供了基于语言学的路线图。将语言学与蛋白质LMS结合起来,可以发展下一代可解释的机器学习模型,并有可能发现序列功能关系基础的生物学机制。
translated by 谷歌翻译
致致辞问题答案(CQA)旨在测试模型是否可以回答有关每个人都知道的勤杂朗语言的问题。结合外部知识库的事先作品已经显示了有希望的结果,但知识库是昂贵的构造,并且通常限于固定的一组关系。在本文中,我们专注于更好地利用\ Texit {隐式知识}存储在预先接受预先接受的语言模型中。虽然研究人员发现嵌入在预先接受预先训练的语言模型中的知识,但可以通过填写仔细设计的提取和文本分类的谨慎设计的空白来提取,但如果我们可以在输入和输入的CQA中采用此范例,仍然不清楚输出采取更灵活的形式。为此,我们调查了四种翻译方法,可以将自然问题转化为渗出风格的句子,从语言模型中更好地征求致辞知识,包括基于句法的模型,无监督的神经模型和两个监督的神经模型。此外,要结合不同的翻译方法,我们建议鼓励模型预测与未标记数据不同翻译问题的一致性。我们展示了我们在零拍摄设置中三个CQA数据集上的方法的有效性。我们表明,我们的方法与知识库改进的模型互补,并结合它们可以导致最先进的零射击性能。分析还揭示了不同的强化翻译方法的明显特征,并为什么结合它们导致巨大改进提供了洞察。
translated by 谷歌翻译
Inductive reasoning is a core component of human intelligence. In the past research of inductive reasoning within computer science, logic language is used as representations of knowledge (facts and rules, more specifically). However, logic language can cause systematic problems for inductive reasoning such as disability of handling raw input such as natural language, sensitiveness to mislabeled data, and incapacity to handle ambiguous input. To this end, we propose a new task, which is to induce natural language rules from natural language facts, and create a dataset termed DEER containing 1.2k rule-fact pairs for the task, where rules and facts are written in natural language. New automatic metrics are also proposed and analysed for the evaluation of this task. With DEER, we investigate a modern approach for inductive reasoning where we use natural language as representation for knowledge instead of logic language and use pretrained language models as ''reasoners''. Moreover, we provide the first and comprehensive analysis of how well pretrained language models can induce natural language rules from natural language facts. We also propose a new framework drawing insights from philosophy literature for this task, which we show in the experiment section that surpasses baselines in both automatic and human evaluations.
translated by 谷歌翻译
When people think of everyday things like an "egg," they typically have a mental image associated with it. This commonsense knowledge helps us understand how these everyday things work and how to interact with them. For example, when someone tries to make a fried egg, they know that it has a shell and that it can be cracked open to reveal the egg white and yolk inside. However, if a system does not have a coherent picture of such everyday things, thinking that the egg yolk surrounds the shell, then it might have to resort to ridiculous approaches such as trying to scrape the egg yolk off the shell into the pan. Do language models have a coherent picture of such everyday things? To investigate this, we propose a benchmark dataset consisting of 100 everyday things, their parts, and the relationships between these parts. We observe that state-of-the-art pre-trained language models (LMs) like GPT-3 and Macaw have fragments of knowledge about these entities, but they fail to produce consistent parts mental models. We propose a simple extension to these LMs where we apply a constraint satisfaction layer on top of raw predictions from LMs to produce more consistent and accurate parts mental models of everyday things.
translated by 谷歌翻译
Neural language models (LMs) have achieved impressive results on various language-based reasoning tasks by utilizing latent knowledge encoded in their own pretrained parameters. To make this reasoning process more explicit, recent works retrieve a rationalizing LM's internal knowledge by training or prompting it to generate free-text rationales, which can be used to guide task predictions made by either the same LM or a separate reasoning LM. However, rationalizing LMs require expensive rationale annotation and/or computation, without any assurance that their generated rationales improve LM task performance or faithfully reflect LM decision-making. In this paper, we propose PINTO, an LM pipeline that rationalizes via prompt-based learning, and learns to faithfully reason over rationales via counterfactual regularization. First, PINTO maps out a suitable reasoning process for the task input by prompting a frozen rationalizing LM to generate a free-text rationale. Second, PINTO's reasoning LM is fine-tuned to solve the task using the generated rationale as context, while regularized to output less confident predictions when the rationale is perturbed. Across four datasets, we show that PINTO significantly improves the generalization ability of the reasoning LM, yielding higher performance on both in-distribution and out-of-distribution test sets. Also, we find that PINTO's rationales are more faithful to its task predictions than those generated by competitive baselines.
translated by 谷歌翻译
When answering a question, people often draw upon their rich world knowledge in addition to the particular context. Recent work has focused primarily on answering questions given some relevant document or context, and required very little general background. To investigate question answering with prior knowledge, we present COMMONSENSEQA: a challenging new dataset for commonsense question answering. To capture common sense beyond associations, we extract from CON-CEPTNET (Speer et al., 2017) multiple target concepts that have the same semantic relation to a single source concept. Crowd-workers are asked to author multiple-choice questions that mention the source concept and discriminate in turn between each of the target concepts. This encourages workers to create questions with complex semantics that often require prior knowledge. We create 12,247 questions through this procedure and demonstrate the difficulty of our task with a large number of strong baselines. Our best baseline is based on BERT-large (Devlin et al., 2018) and obtains 56% accuracy, well below human performance, which is 89%.
translated by 谷歌翻译
我们研究了现代神经语言模型容易受到结构启动的程度,这种现象使句子的结构在后续句子中更有可能使相同的结构更有可能。我们探索如何使用启动来研究这些模型学习抽象结构信息的潜力,这是需要自然语言理解技能的任务良好表现的先决条件。我们引入了一种新型的度量标准和释放Prime-LM,这是一个大型语料库,我们可以控制与启动强度相互作用的各种语言因素。我们发现,变压器模型确实显示了结构启动的证据,但他们所学到的概括在某种程度上是由语义信息调节的。我们的实验还表明,模型获得的表示不仅可以编码抽象的顺序结构,而且还涉及一定级别的层次句法信息。更普遍的是,我们的研究表明,启动范式是一种有用的,可用于洞悉语言模型能力的有用的,并为未来的基于底漆的调查打开了探测模型内部状态的未来大门。
translated by 谷歌翻译
基于变压器的语言模型最近在许多自然语言任务中取得了显着的结果。但是,通常通过利用大量培训数据来实现排行榜的性能,并且很少通过将明确的语言知识编码为神经模型。这使许多人质疑语言学对现代自然语言处理的相关性。在本文中,我介绍了几个案例研究,以说明理论语言学和神经语言模型仍然相互关联。首先,语言模型通过提供一个客观的工具来测量语义距离,这对语言学家很有用,语义距离很难使用传统方法。另一方面,语言理论通过提供框架和数据源来探究我们的语言模型,以了解语言理解的特定方面,从而有助于语言建模研究。本论文贡献了三项研究,探讨了语言模型中语法 - 听觉界面的不同方面。在论文的第一部分中,我将语言模型应用于单词类灵活性的问题。我将Mbert作为语义距离测量的来源,我提供了有利于将单词类灵活性分析为方向过程的证据。在论文的第二部分中,我提出了一种方法来测量语言模型中间层的惊奇方法。我的实验表明,包含形态句法异常的句子触发了语言模型早期的惊喜,而不是语义和常识异常。最后,在论文的第三部分中,我适应了一些心理语言学研究,以表明语言模型包含了论证结构结构的知识。总而言之,我的论文在自然语言处理,语言理论和心理语言学之间建立了新的联系,以为语言模型的解释提供新的观点。
translated by 谷歌翻译
Pre-trained language models, despite their rapid advancements powered by scale, still fall short of robust commonsense capabilities. And yet, scale appears to be the winning recipe; after all, the largest models seem to have acquired the largest amount of commonsense capabilities. Or is it? In this paper, we investigate the possibility of a seemingly impossible match: can smaller language models with dismal commonsense capabilities (i.e., GPT-2), ever win over models that are orders of magnitude larger and better (i.e., GPT-3), if the smaller models are powered with novel commonsense distillation algorithms? The key intellectual question we ask here is whether it is possible, if at all, to design a learning algorithm that does not benefit from scale, yet leads to a competitive level of commonsense acquisition. In this work, we study the generative models of commonsense knowledge, focusing on the task of generating generics, statements of commonsense facts about everyday concepts, e.g., birds can fly. We introduce a novel commonsense distillation framework, I2D2, that loosely follows the Symbolic Knowledge Distillation of West et al. but breaks the dependence on the extreme-scale models as the teacher model by two innovations: (1) the novel adaptation of NeuroLogic Decoding to enhance the generation quality of the weak, off-the-shelf language models, and (2) self-imitation learning to iteratively learn from the model's own enhanced commonsense acquisition capabilities. Empirical results suggest that scale is not the only way, as novel algorithms can be a promising alternative. Moreover, our study leads to a new corpus of generics, Gen-A-Tomic, that is of the largest and highest quality available to date.
translated by 谷歌翻译
We present LogiGAN, an unsupervised adversarial pre-training framework for improving logical reasoning abilities of language models. Upon automatic identifying logical reasoning phenomena in massive text corpus via detection heuristics, we train language models to predict the masked-out logical statements. Inspired by the facilitation effect of reflective thinking in human learning, we analogically simulate the learning-thinking process with an adversarial Generator-Verifier architecture to assist logic learning. LogiGAN implements a novel sequential GAN approach that (a) circumvents the non-differentiable challenge of the sequential GAN by leveraging the Generator as a sentence-level generative likelihood scorer with a learning objective of reaching scoring consensus with the Verifier; (b) is computationally feasible for large-scale pre-training with arbitrary target length. Both base and large size language models pre-trained with LogiGAN demonstrate obvious performance improvement on 12 datasets requiring general reasoning abilities, revealing the fundamental role of logic in broad reasoning, as well as the effectiveness of LogiGAN. Ablation studies on LogiGAN components reveal the relative orthogonality between linguistic and logic abilities and suggest that reflective thinking's facilitation effect might also generalize to machine learning.
translated by 谷歌翻译
尽管预训练的语言模型(LMS)在许多NLP任务中都取得了重大改进,但人们越来越关注探索LMS的能力并解释其预测。但是,现有作品通常仅着眼于某些下游任务的特定功能。缺乏直接评估蒙版单词预测性能和预训练LMS的解释性的数据集。为了填补空白,我们提出了一个新颖的评估基准,以提供英语和中文注释的数据。它在多个维度(即语法,语义,知识,推理和计算)中测试LMS能力。此外,它提供了满足足够和紧凑性的仔细注释的令牌级别的理由。它包含每个原始实例的扰动实例,以便将扰动下的基本原理一致性用作忠实的指标,即解释性的观点。我们在几个广泛使用的预训练的LMS上进行实验。结果表明,他们在知识和计算的维度上表现较差。而且它们在所有维度上的合理性远非令人满意,尤其是当理由缩短时。此外,我们评估的预训练的LMS在语法感知数据上并不强大。我们将以\ url {http:// xyz}发布此评估基准,并希望它可以促进预训练的LMS的研究进度。
translated by 谷歌翻译
Reasoning is a fundamental aspect of human intelligence that plays a crucial role in activities such as problem solving, decision making, and critical thinking. In recent years, large language models (LLMs) have made significant progress in natural language processing, and there is observation that these models may exhibit reasoning abilities when they are sufficiently large. However, it is not yet clear to what extent LLMs are capable of reasoning. This paper provides a comprehensive overview of the current state of knowledge on reasoning in LLMs, including techniques for improving and eliciting reasoning in these models, methods and benchmarks for evaluating reasoning abilities, findings and implications of previous research in this field, and suggestions on future directions. Our aim is to provide a detailed and up-to-date review of this topic and stimulate meaningful discussion and future work.
translated by 谷歌翻译
促使模型表现出令人印象深刻的几次学习能力。在测试时间与单个模型或多个模型的组成一起重复相互作用,进一步扩展了功能。这些组成是概率模型,可以用具有随机变量的图形模型的语言表示,其值是复杂的数据类型,例如字符串。具有控制流和动态结构的情况需要概率编程的技术,这些技术允许以统一语言实施不同的模型结构和推理策略。我们从这个角度正式化了几种现有技术,包括刮擦板 /思想链,验证者,星星,选择 - 推动和工具使用。我们将结果程序称为语言模型级联。
translated by 谷歌翻译
人类使用自然语言来撰写普通概念,将他们的环境归结为合理的日常场景描述。然而,这种生成的致辞推理(GCSR)技能缺乏最先进的文本生成方法。关于由神经文本生成模型(例如,预先接受的文本到文本变压器)生成的任意概念的描述性句子通常是语法流畅的,但可能与人类常识不相符,这主要是由于它们缺乏捕获概念关系的机制识别隐式概念,并对看不见的概念组成来执行概括的推理。在本文中,我们提出了一种想象的 - 言语(I&V)方法,其学会在输入概念之间的关系中想象一个关系场景知识图(SKG),并在生成合理的场景描述时利用SKG作为约束。我们收集和协调来自不同领域和方式的一套知识资源,为I&v提供丰富的辅助监督信号。该实验展示了I&V在提高概念到句子和概念到故事的生成任务上的语言模型的有效性,同时使模型能够从更少的任务示例中学习并生成对人类注入者常识的SKG。
translated by 谷歌翻译