Reasoning, as an essential ability for complex problem-solving, can provide back-end support for various real-world applications, such as medical diagnosis, negotiation, etc. This paper provides a comprehensive survey of cutting-edge research on reasoning with language model prompting. We introduce research works with comparisons and summaries and provide systematic resources to help beginners. We also discuss the potential reasons for emerging such reasoning abilities and highlight future research directions.
translated by 谷歌翻译
Mathematical reasoning is a fundamental aspect of human intelligence and is applicable in various fields, including science, engineering, finance, and everyday life. The development of artificial intelligence (AI) systems capable of solving math problems and proving theorems has garnered significant interest in the fields of machine learning and natural language processing. For example, mathematics serves as a testbed for aspects of reasoning that are challenging for powerful deep learning models, driving new algorithmic and modeling advances. On the other hand, recent advances in large-scale neural language models have opened up new benchmarks and opportunities to use deep learning for mathematical reasoning. In this survey paper, we review the key tasks, datasets, and methods at the intersection of mathematical reasoning and deep learning over the past decade. We also evaluate existing benchmarks and methods, and discuss future research directions in this domain.
translated by 谷歌翻译
Reasoning is a fundamental aspect of human intelligence that plays a crucial role in activities such as problem solving, decision making, and critical thinking. In recent years, large language models (LLMs) have made significant progress in natural language processing, and there is observation that these models may exhibit reasoning abilities when they are sufficiently large. However, it is not yet clear to what extent LLMs are capable of reasoning. This paper provides a comprehensive overview of the current state of knowledge on reasoning in LLMs, including techniques for improving and eliciting reasoning in these models, methods and benchmarks for evaluating reasoning abilities, findings and implications of previous research in this field, and suggestions on future directions. Our aim is to provide a detailed and up-to-date review of this topic and stimulate meaningful discussion and future work.
translated by 谷歌翻译
With the increasing ability of large language models (LLMs), in-context learning (ICL) has become a new paradigm for natural language processing (NLP), where LLMs make predictions only based on contexts augmented with a few training examples. It has been a new trend exploring ICL to evaluate and extrapolate the ability of LLMs. In this paper, we aim to survey and summarize the progress, challenges, and future work in ICL. We first present a formal definition of ICL and clarify its correlation to related studies. Then, we organize and discuss advanced techniques of ICL, including training strategies, prompting strategies, and so on. Finally, we present the challenges of ICL and provide potential directions for further research. We hope our work can encourage more research on uncovering how ICL works and improving ICL in future work.
translated by 谷歌翻译
GPT-3和Palm等大型语言模型在几次学习中表现出色。但是,他们仍然在推理任务(例如算术基准GSM8K)上挣扎。最近的进步故意指导语言模型在产生最终答案之前生成一系列推理步骤,从而成功地将GSM8K基准从17.9%提高到58.1%,以解决问题的解决率。在本文中,我们提出了一种新的方法,即多样化的方法(关于推理步骤的多样化验证者),以进一步提高其推理能力。多样性首先探索不同的提示,以增强推理路径的多样性。其次,Diverse介绍了一个验证者,以区分好的答案和不良答案,从而获得更好的权重投票。最后,多样性验证每个步骤的正确性,而不是整体上的所有步骤。我们使用最新的语言型号Davinci-002进行广泛的实验,并证明多样化可以在八分之六的推理基准中实现新的最先进的性能(例如,GSM8K 74.4%至83.2%),超过棕榈具有540B参数的模型。
translated by 谷歌翻译
Current large language models can perform reasonably well on complex tasks that require step-by-step reasoning with few-shot learning. Are these models applying reasoning skills they have learnt during pre-training and reason outside of their training context, or are they simply memorizing their training corpus at finer granularity and have learnt to better understand their context? To tease apart these possibilities, we introduce ALERT, a benchmark and suite of analyses for assessing language models' reasoning ability comparing pre-trained and finetuned models on complex tasks that require reasoning skills to solve. ALERT provides a test bed to asses any language model on fine-grained reasoning skills, which spans over 20 datasets and covers 10 different reasoning skills. We leverage ALERT to further investigate the role of finetuning. With extensive empirical analysis we find that language models learn more reasoning skills such as textual entailment, abductive reasoning, and analogical reasoning during finetuning stage compared to pretraining state. We also find that when language models are finetuned they tend to overfit to the prompt template, which hurts the robustness of models causing generalization problems.
translated by 谷歌翻译
Language models (LMs) have demonstrated remarkable performance on downstream tasks, using in-context exemplars or human instructions. Recent works have shown that chain-of-thought (CoT) prompting can elicit models to solve complex reasoning tasks, step-by-step. However, the efficacy of prompt-based CoT methods is restricted to very large LMs such as GPT-3 (175B), thus limiting deployability. In this paper, we revisit the fine-tuning approach to enable complex reasoning in smaller LMs, optimized to efficiently perform a specific task. We propose Fine-tune-CoT, a method that leverages the capabilities of very large LMs to generate reasoning samples and teach smaller models via fine-tuning. We evaluate our method on publicly available LMs across a wide range of complex tasks and model sizes. We find that Fine-tune-CoT enables substantial reasoning capability in small models, whereas previous prompt-based baselines exhibit near-random performance. Student models can even outperform the teacher in some tasks while reducing model size requirements by several orders of magnitude. We conduct extensive ablations and sample studies to understand the reasoning capabilities of student models. We also identify several important nuances that have been overlooked in concurrent fine-tuning works on CoT and address them in our analysis.
translated by 谷歌翻译
从预训练的语言模型中进行的引导已被证明是用于建立基础视觉模型(VLM)的有效方法,例如图像字幕或视觉问题的答案。但是,很难用它来使模型符合用户的理由来获得特定答案。为了引起和加强常识性原因,我们提出了一个迭代采样和调整范式,称为Illume,执行以下循环:给定图像问题提示提示,VLM采样了多个候选人,并通过人类评论家通过偏好提供最小的反馈。选择,用于微调。该循环增加了训练数据,并逐渐雕刻出VLM的合理化功能。我们的详尽实验表明,Illume在使用较少的培训数据的同时,仅需要最少的反馈,与标准监督的微调竞争。
translated by 谷歌翻译
The rapid development and application of natural language generation (NLG) techniques has revolutionized the field of automatic text production. However, these techniques are still limited in their ability to produce human-like text that is truly reasonable and informative. In this paper, we explore the importance of NLG being guided by knowledge, in order to convey human-like reasoning through language generation. We propose ten goals for intelligent NLG systems to pursue, and briefly review the achievement of NLG techniques guided by knowledge and reasoning. We also conclude by envisioning future directions and challenges in the pursuit of these goals.
translated by 谷歌翻译
Despite the success of large language models (LLMs) in various natural language processing (NLP) tasks, the stored knowledge in these models may inevitably be incomplete, out-of-date, or incorrect. This motivates the need to utilize external knowledge to assist LLMs. Unfortunately, current methods for incorporating external knowledge often require additional training or fine-tuning, which can be costly and may not be feasible for LLMs. To address this issue, we propose a novel post-processing approach, rethinking with retrieval (RR), which retrieves relevant external knowledge based on the decomposed reasoning steps obtained from the chain-of-thought (CoT) prompting. This lightweight approach does not require additional training or fine-tuning and is not limited by the input length of LLMs. We evaluate the effectiveness of RR through extensive experiments with GPT-3 on three complex reasoning tasks: commonsense reasoning, temporal reasoning, and tabular reasoning. Our results show that RR can produce more faithful explanations and improve the performance of LLMs.
translated by 谷歌翻译
Pre-trained language models (LMs) have shown remarkable reasoning performance using explanations (or ``chain-of-thought'' (CoT)) for in-context learning. On the other hand, these reasoning tasks are usually presumed to be more approachable for symbolic programming. To make progress towards understanding in-context learning, we curate synthetic datasets containing equivalent (natural, symbolic) data pairs, where symbolic examples contain first-order logic rules and predicates from knowledge bases (KBs). Then we revisit neuro-symbolic approaches and use Language Models as Logic Programmer (LMLP) that learns from demonstrations containing logic rules and corresponding examples to iteratively reason over KBs, recovering Prolog's backward chaining algorithm. Comprehensive experiments are included to systematically compare LMLP with CoT in deductive reasoning settings, showing that LMLP enjoys more than 25% higher accuracy than CoT on length generalization benchmarks even with fewer parameters.
translated by 谷歌翻译
Recently, there has been significant progress in teaching language models to perform step-by-step reasoning to solve complex numerical reasoning tasks. Chain-of-thoughts prompting (CoT) is by far the state-of-art method for these tasks. CoT uses language models to perform both reasoning and computation in the multi-step `thought' process. To disentangle computation from reasoning, we propose `Program of Thoughts' (PoT), which uses language models (mainly Codex) to express the reasoning process as a program. The computation is relegated to an external computer, which executes the generated programs to derive the answer. We evaluate PoT on five math word problem datasets (GSM, AQuA, SVAMP, TabMWP, MultiArith) and three financial-QA datasets (FinQA, ConvFinQA, TATQA) for both few-shot and zero-shot setups. Under both few-shot and zero-shot settings, PoT can show an average performance gain over CoT by around 12\% across all the evaluated datasets. By combining PoT with self-consistency decoding, we can achieve SoTA performance on all math problem datasets and near-SoTA performance on financial datasets. All of our data and code are released in Github\footnote{\url{https://github.com/wenhuchen/Program-of-Thoughts}}.
translated by 谷歌翻译
我们探索如何产生一系列思想 - 一系列中间推理步骤 - 显着提高了大语言模型执行复杂推理的能力。特别是,我们通过一种称为“思想链”提示的简单方法在足够大的语言模型中自然出现这种推理能力,在此过程中,一些思想示范被作为提示的示例提供了。三种大语模型的实验表明,促使思想链提高了一系列算术,常识和象征性推理任务的性能。经验收益可能会引人注目。例如,仅使用八个思想范围的540B参数语言模型才能在数学单词问题的GSM8K基准上实现最新的精度,甚至超过了带有验证器的Fineted GPT-3。
translated by 谷歌翻译
在维持预审预定序列模型的灵活性的同时,是否有利于常识性推理,这仍然是一个悬而未决的问题。为了调查这个问题,我们开发了生成的知识提示,该提示包括从语言模型中生成知识,然后在回答问题时提供知识作为附加输入。我们的方法不需要特定于任务的监督知识集成或访问结构化的知识库,但它可以提高四个常识性推理任务上的大规模,最先进的模型的性能,从而实现最先进-ART结果取决于数值常识(NumerSense),通用常识性(Commonsenseqa 2.0)和科学常识(QASC)基准。产生的知识促使大型语言模型是灵活的外部知识来源,以改善常识性推理。我们的代码可从https://github.com/liujch1998/gkp获得
translated by 谷歌翻译
最近的研究表明,理性或逐步思想链可用于改善多步推理任务的性能。我们重新考虑了理由的提示,提示了几次射击中的内部学习学习,其中(输入 - >输出)提示将扩展到(输入,理由 - >输出)提示。对于以理由为提示的提示,我们证明了现有的方法(依赖手动及时工程)如何受到可能损害绩效的次级理由。为了减轻这种脆弱性,我们提出了一个统一的授权合奏的统一框架,在该框架中,我们将输出空间中的理由抽样确定为可鲁棒提高性能的关键组成部分。该框架是一般的,可以轻松地扩展到常见的自然语言处理任务,即使传统上不利于中间步骤的任务,例如问题回答,单词感官歧义和情感分析。我们证明,与现有的提示方法相比,以理由为原理的合奏获得了更准确和可解释的结果 - 包括标准提示,没有理由和基于理由的链链链,同时通过相关理性同时提高了模型预测的解释性。
translated by 谷歌翻译
致致辞问题答案(CQA)旨在测试模型是否可以回答有关每个人都知道的勤杂朗语言的问题。结合外部知识库的事先作品已经显示了有希望的结果,但知识库是昂贵的构造,并且通常限于固定的一组关系。在本文中,我们专注于更好地利用\ Texit {隐式知识}存储在预先接受预先接受的语言模型中。虽然研究人员发现嵌入在预先接受预先训练的语言模型中的知识,但可以通过填写仔细设计的提取和文本分类的谨慎设计的空白来提取,但如果我们可以在输入和输入的CQA中采用此范例,仍然不清楚输出采取更灵活的形式。为此,我们调查了四种翻译方法,可以将自然问题转化为渗出风格的句子,从语言模型中更好地征求致辞知识,包括基于句法的模型,无监督的神经模型和两个监督的神经模型。此外,要结合不同的翻译方法,我们建议鼓励模型预测与未标记数据不同翻译问题的一致性。我们展示了我们在零拍摄设置中三个CQA数据集上的方法的有效性。我们表明,我们的方法与知识库改进的模型互补,并结合它们可以导致最先进的零射击性能。分析还揭示了不同的强化翻译方法的明显特征,并为什么结合它们导致巨大改进提供了洞察。
translated by 谷歌翻译
基础模型由于在广泛的下游应用中的有效性而受到了很多关注。尽管在体系结构方面存在很大的融合,但大多数审慎的模型通常仍用于特定任务或模式。在这项工作中,我们建议将语言模型用作各种基础模型的通用接口。一系列预处理的编码者感知到了多种方式(例如视觉和语言),并与扮演通用任务层角色的语言模型对接。我们提出了一个半伴侣的语言建模目标,以共同确定界面和模块化编码器。我们从因果关系和非因果建模中涵盖了优势和能力,从而结合了两个世界的最佳状态。具体而言,所提出的方法不仅从因果语言建模中继承了内在学习和开放式生成的能力,而且由于双向编码器而有利于填补。更重要的是,我们的方法无缝地解锁了上述功能的组合,例如,通过填充编码器启用了文本学习或指导。各种仅语言和视觉语言基准的实验结果表明,我们的模型表现优于或与鉴定,零弹性概括和几乎没有的学习的专业模型竞争。
translated by 谷歌翻译
在有问题的回答需要常识的问题上,语言模型(例如,GPT-3)已用于生成表达有助于提高性能的背景知识的文本。然而,使用此类模型的成本很高。在这项工作中,我们对较小的语言模型产生有用的中间上下文,此处称为阐述。我们的框架在更新两个语言模型之间交替使用 - 阐述生成器和一个答案预测变量 - 允许每个语言都影响彼此。我们的模型使用少于GPT-3的参数的0.5%优于具有相似尺寸的替代方案,并在四个常识性问题上回答基准测试的GPT-3上的差距缩小。人类评估表明,生成的阐述的质量很高。
translated by 谷歌翻译
Natural Language Processing (NLP) has been revolutionized by the use of Pre-trained Language Models (PLMs) such as BERT. Despite setting new records in nearly every NLP task, PLMs still face a number of challenges including poor interpretability, weak reasoning capability, and the need for a lot of expensive annotated data when applied to downstream tasks. By integrating external knowledge into PLMs, \textit{\underline{K}nowledge-\underline{E}nhanced \underline{P}re-trained \underline{L}anguage \underline{M}odels} (KEPLMs) have the potential to overcome the above-mentioned limitations. In this paper, we examine KEPLMs systematically through a series of studies. Specifically, we outline the common types and different formats of knowledge to be integrated into KEPLMs, detail the existing methods for building and evaluating KEPLMS, present the applications of KEPLMs in downstream tasks, and discuss the future research directions. Researchers will benefit from this survey by gaining a quick and comprehensive overview of the latest developments in this field.
translated by 谷歌翻译
主张神经符号人工智能(NESY)断言,将深度学习与象征性推理相结合将导致AI更强大,而不是本身。像深度学习一样成功,人们普遍认为,即使我们最好的深度学习系统也不是很擅长抽象推理。而且,由于推理与语言密不可分,因此具有直觉的意义,即自然语言处理(NLP)将成为NESY特别适合的候选人。我们对实施NLP实施NESY的研究进行了结构化审查,目的是回答Nesy是否确实符合其承诺的问题:推理,分布概括,解释性,学习和从小数据的可转让性以及新的推理到新的域。我们研究了知识表示的影响,例如规则和语义网络,语言结构和关系结构,以及隐式或明确的推理是否有助于更高的承诺分数。我们发现,将逻辑编译到神经网络中的系统会导致满足最NESY的目标,而其他因素(例如知识表示或神经体系结构的类型)与实现目标没有明显的相关性。我们发现在推理的定义方式上,特别是与人类级别的推理有关的许多差异,这会影响有关模型架构的决策并推动结论,这些结论在整个研究中并不总是一致的。因此,我们倡导采取更加有条不紊的方法来应用人类推理的理论以及适当的基准的发展,我们希望这可以更好地理解该领域的进步。我们在GitHub上提供数据和代码以进行进一步分析。
translated by 谷歌翻译