The formalization of existing mathematical proofs is a notoriously difficult process. Despite decades of research on automation and proof assistants, writing formal proofs remains arduous and only accessible to a few experts. While previous studies to automate formalization focused on powerful search algorithms, no attempts were made to take advantage of available informal proofs. In this work, we introduce Draft, Sketch, and Prove (DSP), a method that maps informal proofs to formal proof sketches, and uses the sketches to guide an automated prover by directing its search to easier sub-problems. We investigate two relevant setups where informal proofs are either written by humans or generated by a language model. Our experiments and ablation studies show that large language models are able to produce well-structured formal sketches that follow the same reasoning steps as the informal proofs. Guiding an automated prover with these sketches enhances its performance from 20.9% to 39.3% on a collection of mathematical competition problems.
translated by 谷歌翻译
Mathematical reasoning is a fundamental aspect of human intelligence and is applicable in various fields, including science, engineering, finance, and everyday life. The development of artificial intelligence (AI) systems capable of solving math problems and proving theorems has garnered significant interest in the fields of machine learning and natural language processing. For example, mathematics serves as a testbed for aspects of reasoning that are challenging for powerful deep learning models, driving new algorithmic and modeling advances. On the other hand, recent advances in large-scale neural language models have opened up new benchmarks and opportunities to use deep learning for mathematical reasoning. In this survey paper, we review the key tasks, datasets, and methods at the intersection of mathematical reasoning and deep learning over the past decade. We also evaluate existing benchmarks and methods, and discuss future research directions in this domain.
translated by 谷歌翻译
语言模型在需要自然语言理解的各种任务上取得了非凡的表现。然而,最先进的模型通常在需要定量推理的任务上挣扎,例如在大学一级解决数学,科学和工程问题。为了帮助缩小这一差距,我们介绍了Minerva,Minerva是一种在一般自然语言数据上鉴定的大型语言模型,并进一步培训了技术内容。该模型在不使用外部工具的情况下实现了技术基准测试的最新性能。我们还评估了我们在需要定量推理的物理学,生物学,化学,经济学和其他科学方面的200多个本科生问题上评估我们的模型,并发现该模型可以正确回答其中几乎三分之一。
translated by 谷歌翻译
General mathematical reasoning is computationally undecidable, but humans routinely solve new problems. Moreover, discoveries developed over centuries are taught to subsequent generations quickly. What structure enables this, and how might that inform automated mathematical reasoning? We posit that central to both puzzles is the structure of procedural abstractions underlying mathematics. We explore this idea in a case study on 5 sections of beginning algebra on the Khan Academy platform. To define a computational foundation, we introduce Peano, a theorem-proving environment where the set of valid actions at any point is finite. We use Peano to formalize introductory algebra problems and axioms, obtaining well-defined search problems. We observe existing reinforcement learning methods for symbolic reasoning to be insufficient to solve harder problems. Adding the ability to induce reusable abstractions ("tactics") from its own solutions allows an agent to make steady progress, solving all problems. Furthermore, these abstractions induce an order to the problems, seen at random during training. The recovered order has significant agreement with the expert-designed Khan Academy curriculum, and second-generation agents trained on the recovered curriculum learn significantly faster. These results illustrate the synergistic role of abstractions and curricula in the cultural transmission of mathematics.
translated by 谷歌翻译
大型语言模型,例如OpenAI的法典和DeepMind的字母,可以生成代码来解决以自然语言表达的各种问题。这项技术已经在至少一项广泛使用的编程编辑器扩展程序中进行了商业化:Github Copilot。在本文中,我们探讨了具有大型语言模型(LLM辅助编程)的编程与程序员协助的先前概念化相似,并且与众不同。我们借鉴了公开可用的经验报告,有关LLM辅助编程以及先前的可用性和设计研究。我们发现,尽管LLM辅助编程通过搜索和重用分享了一些编译,配对编程和编程的属性,但技术可能性和实践经验都存在根本差异。因此,应该将LLM辅助编程视为具有自己独特的属性和挑战的新方法。最后,我们借鉴了用户研究的观察结果,在该观察中,非专家最终用户程序员使用LLM辅助工具来求解电子表格中的数据任务。我们讨论可能出现的问题,并在将大型语言模型应用于最终用户编程时,尤其是对于几乎没有编程专业知识的用户。
translated by 谷歌翻译
我们在HOL4互动定理证明书的顶部实施了自动战术证据Tacticeoe。Tactice从人类证据中学习,数学技术适用于每个证明情况。然后在蒙特卡罗树搜索算法中使用这种知识来探索有前途的策略级证明路径。在一个CPU上,时间限制为60秒,Tactictoe在Hol4的标准图书馆中证明了7164定理的66.4%,而自动调度的电子箴言解决了34.5%。通过结合Tactice和电子证明者的结果,成功率上升至69.0%。
translated by 谷歌翻译
虽然编程是现代社会中最广泛适用的技能之一,但现代机器学习模型仍然无法对基本问题的解决方案。尽管重要的是,对评估代码生成令人惊讶的是,很少有效,并且难以准确地评估代码生成性能。为了满足这一挑战,我们介绍了一个用于代码生成的基准。与在更受限制的设置中的事先工作不同,我们的基准测试衡量模型采取任意自然语言规范的能力,并生成满意的Python代码。类似于公司如何评估候选软件开发人员,然后我们通过检查测试用例的生成代码来评估模型。我们的基准测试包括10,000个问题,从具有简单的单线解决方案来实现实质性算法挑战。我们在GitHub和我们的培训集上微调大型语言模型,我们发现语法错误的普遍性随着模型的提高而导致呈指数级递减。最近的模型如GPT-Neo可以通过大约20%的介绍性问题的测试用例,因此我们发现机器学习模型现在开始学习如何代码。随着自动代码生成的社会意义在未来几年增加,我们的基准可以提供跟踪进步的重要措施。
translated by 谷歌翻译
Despite recent success in large language model (LLM) reasoning, LLMs still struggle with hierarchical multi-step reasoning like generating complex programs. In these cases, humans often start with a high-level algorithmic design and implement each part gradually. We introduce Parsel, a framework enabling automatic implementation and validation of complex algorithms with code LLMs, based on hierarchical function descriptions in natural language. Parsel can be used across domains requiring hierarchical reasoning, e.g. code synthesis, theorem proving, and robotic planning. We demonstrate Parsel's capabilities by using it to generate complex programs that cannot currently be automatically implemented from one description and backtranslating Python programs in the APPS dataset. Beyond modeling capabilities, Parsel allows problem-solving with high-level algorithmic designs, benefiting both students and professional programmers.
translated by 谷歌翻译
这项工作表明了如何以编程难题的形式使用大规模语言模型(LMS)与经过验证的解决方案合成编程问题,然后可以将其用于微调相同的模型,从而提高其性能。这项工作以最近的两项发展为基础。首先,LMS在非平凡的推理和算法实施中取得了突破,生成可以解决某些中级竞争性编程问题的代码。但是,培训代码LMS涉及策划的一组自然语言问题描述以及源代码测试和解决方案,这些测试和解决方案的大小有限。其次,引入了一种新的编程挑战格式,称为编程难题,该格式不需要自然语言描述,并通过源代码测试直接指定。在这项工作中,我们展示了如何使用Python解释器验证的合成编程难题和解决方案,可用于改善从P3求解测试难题的性能,P3是一套Python公共基准的Python编程难题。此外,我们发布了由Codex模型生成的100万个难题和解决方案的数据集,我们证明可以通过微调改善较小的模型。
translated by 谷歌翻译
象征性推理,基于规则的符号操作,是人类智慧的标志。然而,基于规则的系统的成功有限与基于学习的系统在外面的正式域之外的竞争中,例如自动定理证明。我们假设这是由于过去尝试中的规则的手动构建。在这项工作中,我们询问我们如何构建基于规则的系统,可以推理自然语言输入,但没有手动构建规则。我们提出了Metaqnl,这是一种“准自然”语言,可以表达正式逻辑和自然语言句子,并梅多斯诱惑,一种学习算法,它从训练数据组成的训练和答案,有或没有中间推理步骤。我们的方法在多个推理基准上实现了最先进的准确性;它学习具有更少数据的紧凑型号,不仅可以答案,而且产生答案。此外,对现实世界的形态学分析基准测试的实验表明,我们可以处理噪音和歧义。代码将在https://github.com/princeton-vl/metaqnl发布。
translated by 谷歌翻译
大型语言模型已经证明了能够在自然语言和编程语言文本上进行条件和生成的能力。这样的模型打开了多语言代码生成的可能性:代码生成模型是否可以将知识从一种语言推广到另一种语言?尽管当代代码生成模型可以生成语义上正确的Python代码,但对它们使用其他语言的能力知之甚少。我们通过提出Multipl-E来促进该主题的探索,这是自然语言到代码生成的第一个多语言平行基准。 Multipl-E扩展了HumaneVal基准(Chen等,2021),以支持另外18种编程语言,涵盖了一系列编程范式和受欢迎程度。我们在Multipl-E:Codex和Incoder上评估了两个最先进的代码生成模型。我们发现,在几种语言上,法典匹配,甚至超过了其在Python上的性能。在多型E中表示的编程语言范围使我们能够探索语言频率和语言功能对模型性能的影响。最后,将代码生成基准分配给新编程语言的多重方法既可扩展又可扩展。我们描述了一种通用方法,可以轻松地增加对新基准和语言的支持。
translated by 谷歌翻译
我们探索如何产生一系列思想 - 一系列中间推理步骤 - 显着提高了大语言模型执行复杂推理的能力。特别是,我们通过一种称为“思想链”提示的简单方法在足够大的语言模型中自然出现这种推理能力,在此过程中,一些思想示范被作为提示的示例提供了。三种大语模型的实验表明,促使思想链提高了一系列算术,常识和象征性推理任务的性能。经验收益可能会引人注目。例如,仅使用八个思想范围的540B参数语言模型才能在数学单词问题的GSM8K基准上实现最新的精度,甚至超过了带有验证器的Fineted GPT-3。
translated by 谷歌翻译
自然语言处理(NLP)已成为当前人工智能繁荣中的主要应用领域之一。转移学习已经启用了大量深入学习的神经网络,接受了语言建模任务,以大大提高了所有语言任务的性能。有趣的是,当模型培训使用包含软件代码的数据培训时,它们在从自然语言规范中生成功能计算机代码时展示了显着的能力。我们认为这是一种难题,用于神经模型为生成词组结构语法提供了一种替代理论,以说明语言有效。由于编程语言的语法由短语结构语法决定,因此成功的神经模型显然是对编程语言的理论基础的理论基础,以及通过扩展,自然语言来实现。我们认为语言模型的术语模型是误导性的,因为深度学习模型不是语言的理论模型,并提出采用语料库模型,这更好地反映了模型的成因和内容。
translated by 谷歌翻译
Reasoning is a fundamental aspect of human intelligence that plays a crucial role in activities such as problem solving, decision making, and critical thinking. In recent years, large language models (LLMs) have made significant progress in natural language processing, and there is observation that these models may exhibit reasoning abilities when they are sufficiently large. However, it is not yet clear to what extent LLMs are capable of reasoning. This paper provides a comprehensive overview of the current state of knowledge on reasoning in LLMs, including techniques for improving and eliciting reasoning in these models, methods and benchmarks for evaluating reasoning abilities, findings and implications of previous research in this field, and suggestions on future directions. Our aim is to provide a detailed and up-to-date review of this topic and stimulate meaningful discussion and future work.
translated by 谷歌翻译
评论是源代码的重要组成部分,是文档的主要来源。这引起了人们对使用大量注释的兴趣训练或评估消耗或生产它们的工具,例如生成甲骨文,甚至是从注释中生成代码,或自动生成代码摘要。这项工作大部分对评论的结构和质量做出了强烈的假设,例如假设它们主要由适当的英语句子组成。但是,我们对这些用例的现有评论的实际质量知之甚少。评论通常包含在其他类型的文本中看不到的独特结构和元素,并且从中过滤或提取信息需要额外的谨慎。本文探讨了来自GitHub的840个最受欢迎的开源项目和Srilab数据集的8422个项目的Python评论的内容和质量,并且Na \“ Ive vs.深入过滤的影响都可以使用现有注释来用于使用现有注释。培训和评估产生评论的系统。
translated by 谷歌翻译
Pre-trained language models (LMs) have shown remarkable reasoning performance using explanations (or ``chain-of-thought'' (CoT)) for in-context learning. On the other hand, these reasoning tasks are usually presumed to be more approachable for symbolic programming. To make progress towards understanding in-context learning, we curate synthetic datasets containing equivalent (natural, symbolic) data pairs, where symbolic examples contain first-order logic rules and predicates from knowledge bases (KBs). Then we revisit neuro-symbolic approaches and use Language Models as Logic Programmer (LMLP) that learns from demonstrations containing logic rules and corresponding examples to iteratively reason over KBs, recovering Prolog's backward chaining algorithm. Comprehensive experiments are included to systematically compare LMLP with CoT in deductive reasoning settings, showing that LMLP enjoys more than 25% higher accuracy than CoT on length generalization benchmarks even with fewer parameters.
translated by 谷歌翻译
We are currently unable to specify human goals and societal values in a way that reliably directs AI behavior. Law-making and legal interpretation form a computational engine that converts opaque human values into legible directives. "Law Informs Code" is the research agenda capturing complex computational legal processes, and embedding them in AI. Similar to how parties to a legal contract cannot foresee every potential contingency of their future relationship, and legislators cannot predict all the circumstances under which their proposed bills will be applied, we cannot ex ante specify rules that provably direct good AI behavior. Legal theory and practice have developed arrays of tools to address these specification problems. For instance, legal standards allow humans to develop shared understandings and adapt them to novel situations. In contrast to more prosaic uses of the law (e.g., as a deterrent of bad behavior through the threat of sanction), leveraged as an expression of how humans communicate their goals, and what society values, Law Informs Code. We describe how data generated by legal processes (methods of law-making, statutory interpretation, contract drafting, applications of legal standards, legal reasoning, etc.) can facilitate the robust specification of inherently vague human goals. This increases human-AI alignment and the local usefulness of AI. Toward society-AI alignment, we present a framework for understanding law as the applied philosophy of multi-agent alignment. Although law is partly a reflection of historically contingent political power - and thus not a perfect aggregation of citizen preferences - if properly parsed, its distillation offers the most legitimate computational comprehension of societal values available. If law eventually informs powerful AI, engaging in the deliberative political process to improve law takes on even more meaning.
translated by 谷歌翻译
许多智力努力需要解决数学问题,但这种技能仍然超出了计算机的能力。为了测量机器学习模型中的这种能力,我们介绍了数学,这是一个12,500个挑战性竞争数学问题的新数据集。数学中的每个问题都有一个完整的逐步解决方案,可用于教授模型来生成答案派生和解释。为了促进未来的研究和提高数学准确性,我们还提供了一个大型辅助预制数据集,有助于教导模型数学的基本原则。尽管我们能够提高数学准确性,但我们的结果表明,即使有巨大的变压器模型,即使有巨大的变压器模型也是相对较低的。此外,我们发现,如果缩放趋势持续,则无法增加预算和模型参数计数对于实现强大的数学推理,这将是不切实际的。虽然缩放变压器正在自动解决大多数基于文本的任务,但缩放目前没有解决数学。为了在数学问题上进行更多牵引,我们可能需要更广泛的研究界的新算法进步。
translated by 谷歌翻译
The problem of reversing the compilation process, decompilation, is an important tool in reverse engineering of computer software. Recently, researchers have proposed using techniques from neural machine translation to automate the process in decompilation. Although such techniques hold the promise of targeting a wider range of source and assembly languages, to date they have primarily targeted C code. In this paper we argue that existing neural decompilers have achieved higher accuracy at the cost of requiring language-specific domain knowledge such as tokenizers and parsers to build an abstract syntax tree (AST) for the source language, which increases the overhead of supporting new languages. We explore a different tradeoff that, to the extent possible, treats the assembly and source languages as plain text, and show that this allows us to build a decompiler that is easily retargetable to new languages. We evaluate our prototype decompiler, Beyond The C (BTC), on Go, Fortran, OCaml, and C, and examine the impact of parameters such as tokenization and training data selection on the quality of decompilation, finding that it achieves comparable decompilation results to prior work in neural decompilation with significantly less domain knowledge. We will release our training data, trained decompilation models, and code to help encourage future research into language-agnostic decompilation.
translated by 谷歌翻译
The Codex model has demonstrated extraordinary competence in synthesizing code from natural language problem descriptions. However, in order to reveal unknown failure modes and hidden biases, such large-scale models must be systematically subjected to multiple and diverse evaluation studies. In this work, we evaluate the code synthesis capabilities of the Codex model based on a set of 115 Python problem statements from a popular competitive programming portal: HackerRank. Our evaluation shows that Codex is indeed proficient in Python, solving 96% of the problems in a zero-shot setting, and 100% of the problems in a few-shot setting. However, Codex exhibits clear signs of generating memorized code based on our evaluation. This is alarming, especially since the adoption and use of such models could directly impact how code is written and produced in the foreseeable future. With this in mind, we further discuss and highlight some of the prominent risks associated with large-scale models of source code. Finally, we propose a framework for code-synthesis evaluation using variations of problem statements based on mutations.
translated by 谷歌翻译