Transformers have been essential to pretraining success in NLP. Other architectures have been used, but require attention layers to match benchmark accuracy. This work explores pretraining without attention. We test recently developed routing layers based on state-space models (SSM) and model architectures based on multiplicative gating. Used together these modeling choices have a large impact on pretraining accuracy. Empirically the proposed Bidirectional Gated SSM (BiGS) replicates BERT pretraining results without attention and can be extended to long-form pretraining of 4096 tokens without approximation.
translated by 谷歌翻译
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
现在,可以使用最先进的神经语言模型通过零射门提示来解决临时语言任务,而无需进行监督培训。近年来,这种方法已广受欢迎,研究人员证明了提示在特定的NLP任务上实现强烈准确的提示。但是,找到新任务的提示需要实验。具有不同措辞选择的不同提示模板会导致明显的准确性差异。提示允许用户尝试及时变化,可视化及时性能,并迭代优化提示。我们开发了一个工作流程,该工作流程允许用户首先使用少量数据专注于模型反馈,然后再进入大型数据制度,该数据制度允许使用任务的定量度量来实现有希望的提示的经验基础。然后,该工具可以轻松部署新创建的临时模型。我们使用多种现实世界用例演示了Fackide(http://prompt.vizhub.ai)和我们的工作流程的实用性。
translated by 谷歌翻译
结构分布,即组合空间的分布,通常用于学习观察到数据的潜在概率表示。然而,缩放这些模型是由高计算和内存复杂度相对于潜在表示的大小的瓶颈。诸如隐藏的马尔可夫模型(HMMS)和概率的无内容语法(PCFG)的常见模型在隐藏状态的数量中需要时间和空间二次和立方。这项工作展示了一种简单的方法来降低大类结构化模型的计算和内存复杂性。我们展示通过将中央推理步骤视为矩阵 - 矢量产品,并使用低秩约束,我们可以通过等级进行模型表达性和速度。用神经参数化结构化模型进行语言建模,复音音乐建模,无监督语法诱导和视频建模的实验表明,我们的方法在提供实用加速度的同时匹配大状态空间的标准模型的准确性。
translated by 谷歌翻译
最近已被证明大型语言模型在各种任务集中获得合理的零射普通化(Brown等,2020)。它已经假设这是语言模型的隐式多任务学习的结果,在语言模型中的预押(Radford等,2019)。可以通过明确的多任务学习直接引起零拍常规化?为了以缩放测试这个问题,我们开发一个系统,以便轻松地将任何自然语言任务映射到人类可读的提示表单中。我们转换一组大量的监督数据集,每个数据集都有多个提示,具有不同的措辞。这些提示的数据集允许基准测试模型执行完全看不见的任务的能力。我们介绍了一个普拉克尔编码器 - 解码器模型(Raffel等,2020; Lester等,2021),覆盖各种任务。该模型在多个标准数据集中达到强大的零点性能,通常优于其尺寸的型号超过16倍。此外,我们的方法对来自Big-替补基准测试的任务子集具有强烈性能,优于其尺寸的6倍。所有提示和培训的型号都可以在https://github.com/ bigscience-workshop / protectsource / httpsource / https://huggingface.co/bigscience/t0pp。
translated by 谷歌翻译
序列模型是现代NLP系统的关键组成部分,但它们的预测难以解释。我们考虑虽然可以解释单个模型预测的基础,但是可以解释各种模型预测的上下文的模型解释。通过解决组合优化来找到顺序律师:最佳理由是输入令牌的最小子集,这些令牌将预测与完整序列相同的输出。枚举所有子集是棘手的,因此我们提出了一种高效的贪婪算法来近似这个目标。称为贪婪合理化的算法适用于任何模型。对于这种方法有效,模型应该在对上下文的不完整子集进行预测时形成兼容的条件分布。这种情况可以用短的微调步骤强制执行。我们研究语言建模与机器翻译的贪婪合理化。与现有的基线相比,贪婪合理化是最优化组合目标的,并提供最忠实的理由。在注释的顺序理由的新数据集中,贪婪的理由与人类理由最相似。
translated by 谷歌翻译
We propose a notation for tensors with named axes, which relieves the author, reader, and future implementers of machine learning models from the burden of keeping track of the order of axes and the purpose of each. The notation makes it easy to lift operations on low-order tensors to higher order ones, for example, from images to minibatches of images, or from an attention mechanism to multiple attention heads. After a brief overview and formal definition of the notation, we illustrate it through several examples from modern machine learning, from building blocks like attention and convolution to full models like Transformers and LeNet. We then discuss differential calculus in our notation and compare with some alternative notations. Our proposals build on ideas from many previous papers and software libraries. We hope that our notation will encourage more authors to use named tensors, resulting in clearer papers and more precise implementations.
translated by 谷歌翻译
Recent progress in natural language processing has been driven by advances in both model architecture and model pretraining. Transformer architectures have facilitated building higher-capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks. Transformers is an open-source library with the goal of opening up these advances to the wider machine learning community. The library consists of carefully engineered stateof-the art Transformer architectures under a unified API. Backing this library is a curated collection of pretrained models made by and available for the community. Transformers is designed to be extensible by researchers, simple for practitioners, and fast and robust in industrial deployments. The library is available at https://github.com/ huggingface/transformers.
translated by 谷歌翻译
We describe an open-source toolkit for neural machine translation (NMT). The toolkit prioritizes efficiency, modularity, and extensibility with the goal of supporting NMT research into model architectures, feature representations, and source modalities, while maintaining competitive performance and reasonable training requirements. The toolkit consists of modeling and translation support, as well as detailed pedagogical documentation about the underlying techniques.
translated by 谷歌翻译
Summarization based on text extraction is inherently limited, but generation-style abstractive methods have proven challenging to build. In this work, we propose a fully data-driven approach to abstractive sentence summarization. Our method utilizes a local attention-based model that generates each word of the summary conditioned on the input sentence. While the model is structurally simple, it can easily be trained end-to-end and scales to a large amount of training data. The model shows significant performance gains on the DUC-2004 shared task compared with several strong baselines.
translated by 谷歌翻译