从互联网上刮下来的数据集对于大规模机器学习的成功至关重要。然而,这一成功使未来互联网衍生的数据集的实用性处于潜在的风险,因为模型输出开始取代人类注释作为监督的来源。在这项工作中,我们首先将与一个模型的交互作用记录为历史并将其作为培训数据的系统形式化。然后,我们通过跟踪对测试时间偏差统计量的变化(例如,模型预测的性别偏差)来分析其稳定性。我们发现,偏置扩增的程度与模型的输出的行为是否像训练分布中的样本一样,我们表征并将其定义为一致的校准。在三种条件预测方案中的实验 - 图像分类,视觉角色标记和语言生成 - 证明表现出样本样行为的模型更加校准,因此更稳定。基于这种见解,我们提出了一项干预措施,以帮助校准和稳定不稳定的反馈系统。代码可从https://github.com/rtaori/da​​ta_feedback获得。
translated by 谷歌翻译
我们介绍了Sparrow,这是一个寻求信息的对话代理,与提示的语言模型基线相比,训练有素,更有帮助,正确和无害。我们使用从人类反馈中的强化学习来培训我们的模型,以帮助人类评估者判断代理人的行为。首先,为了使我们的代理人更有帮助和无害,我们将良好对话的要求分解为代理人应遵循的自然语言规则,并分别向评估者询问每个规则。我们证明,这种崩溃使我们能够收集对代理行为的更多针对性的人类判断,并允许更有效的规则条件奖励模型。其次,我们的代理商在收集对模型声明的偏好判决时提供了支持事实主张的来源的证据。对于事实问题,麻雀提供的证据支持了78%的时间。比基线比基线更享受麻雀,同时对人类的对抗性探测更具弹性,在探测时只有8%的时间违反了我们的规则。最后,我们进行了广泛的分析,表明尽管我们的模型学会遵守我们的规则,但它可以表现出分布偏见。
translated by 谷歌翻译
The ability to quickly and accurately identify covariate shift at test time is a critical and often overlooked component of safe machine learning systems deployed in high-risk domains. While methods exist for detecting when predictions should not be made on out-of-distribution test examples, identifying distributional level differences between training and test time can help determine when a model should be removed from the deployment setting and retrained. In this work, we define harmful covariate shift (HCS) as a change in distribution that may weaken the generalization of a predictive model. To detect HCS, we use the discordance between an ensemble of classifiers trained to agree on training data and disagree on test data. We derive a loss function for training this ensemble and show that the disagreement rate and entropy represent powerful discriminative statistics for HCS. Empirically, we demonstrate the ability of our method to detect harmful covariate shift with statistical certainty on a variety of high-dimensional datasets. Across numerous domains and modalities, we show state-of-the-art performance compared to existing methods, particularly when the number of observed test samples is small.
translated by 谷歌翻译
We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even becoming competitive with prior state-ofthe-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous nonsparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks. We also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora.
translated by 谷歌翻译
现代神经语言模型广泛用于任务中的任务,跨越培训数据记忆敏感信息。由于模型继续扩大参数,培训数据和计算,从学习理论的角度来看,培训数据和计算中的记忆既重要性也很重要,并且在现实世界应用中实际上至关重要。在语言模型中记忆的研究中的一个开放问题是如何过滤掉“常见的”记忆。事实上,大多数记忆标准与培训集的出现数量强烈关联,捕获“常见”记忆,例如熟悉的短语,公共知识或模板文本。在本文中,我们提供了由心理学中人类记忆分类的理性观点。从这个角度来看,我们制定了反事实记忆的概念,这表征了模型的预测如何改变,如果在训练期间省略了特定文件。我们在标准文本数据集中识别并研究了反复记忆培训示例。我们进一步估计每个训练示例对验证集和生成文本的影响,并显示这可以提供在测试时间的记忆源的直接证据。
translated by 谷歌翻译
当前的语言模型可以产生高质量的文本。他们只是复制他们之前看到的文本,或者他们学习了普遍的语言抽象吗?要取笑这些可能性,我们介绍了乌鸦,这是一套评估生成文本的新颖性,专注于顺序结构(n-gram)和句法结构。我们将这些分析应用于四种神经语言模型(LSTM,变压器,变换器-XL和GPT-2)。对于本地结构 - 例如,单个依赖性 - 模型生成的文本比来自每个模型的测试集的人类生成文本的基线显着不那么新颖。对于大规模结构 - 例如,总句结构 - 模型生成的文本与人生成的基线一样新颖甚至更新颖,但模型仍然有时复制,在某些情况下,在训练集中重复超过1000字超过1,000字的通道。我们还表现了广泛的手动分析,表明GPT-2的新文本通常在形态学和语法中形成良好,但具有合理的语义问题(例如,是自相矛盾)。
translated by 谷歌翻译
Recent work pre-training Transformers with self-supervised objectives on large text corpora has shown great success when fine-tuned on downstream NLP tasks including text summarization. However, pre-training objectives tailored for abstractive text summarization have not been explored. Furthermore there is a lack of systematic evaluation across diverse domains. In this work, we propose pre-training large Transformer-based encoder-decoder models on massive text corpora with a new selfsupervised objective. In PEGASUS, important sentences are removed/masked from an input document and are generated together as one output sequence from the remaining sentences, similar to an extractive summary. We evaluated our best PEGASUS model on 12 downstream summarization tasks spanning news, science, stories, instructions, emails, patents, and legislative bills. Experiments demonstrate it achieves state-of-the-art performance on all 12 downstream datasets measured by ROUGE scores. Our model also shows surprising performance on low-resource summarization, surpassing previous state-of-the-art results on 6 datasets with only 1000 examples. Finally we validated our results using human evaluation and show that our model summaries achieve human performance on multiple datasets.
translated by 谷歌翻译
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
鉴于大型语言模型的广泛能力,应该有可能朝着一般的文本的助手工作,这些助手与人类价值一致,这意味着它是有帮助,诚实的和无害的。在此方向上的初始遗传,我们研究简单的基线技术和评估,例如提示。我们发现,从模型规模增加适度的干预措施的好处,概括为各种对准评估,并不会损害大型模型的性能。接下来,我们调查与对齐,比较仿制,二进制歧视和排名偏好建模相关的几个培训目标的缩放趋势。我们发现排名优先级模型比模仿学习更好地表现得多,并且通常以模型大小更有利地缩放。相比之下,二进制歧视通常与模仿学习非常类似地执行和缩放。最后,我们研究了一种“偏好模型预训练阶段的培训阶段,其目的是在对人偏好的芬明时提高样本效率。
translated by 谷歌翻译
由于在开放式文本生成中取得了重大进展,衡量机器生成的文本是如何对人类语言的关键问题。我们介绍紫红色,一个开放式文本生成的比较措施,它直接将文本生成模型的学习分布与使用发散边界的分发进行了分布到人写的文本。淡紫色通过计算量化嵌入空间中的信息分流来缩放到现代文本生成模型。通过对三个开放式发电任务的广泛实证研究,我们发现紫红色标识了所生成文本的已知属性,天然存在模型大小,并与人类判断相关,而不是现有的分布评估度量的限制较少。
translated by 谷歌翻译
随着人工智能系统变得越来越强大和普遍,人们对机器的道德或缺乏道德的关注变得越来越关注。然而,向机器讲授道德是一项艰巨的任务,因为道德仍然是人类中最激烈的争论问题之一,更不用说AI了。但是,部署到数百万用户的现有AI系统已经在做出充满道德影响的决策,这构成了一个看似不可能的挑战:教学机器的道德意义,而人类继续努力努力。为了探索这一挑战,我们介绍了Delphi,这是一个基于深层神经网络的实验框架,直接训练了描述性道德判断,例如,“帮助朋友”通常是不错的,而“帮助朋友传播假新闻”不是。经验结果提供了对机器伦理的承诺和局限性的新见解。面对新的道德情况,德尔菲(Delphi)表现出强大的概括能力,而现成的神经网络模型表现出明显差的判断,包括不公正的偏见,证实了对明确教学机器的道德意义的必要性。然而,德尔菲并不完美,表现出对普遍性偏见和不一致的敏感性。尽管如此,我们还是展示了不完美的Delphi的积极用例,包括在其他不完美的AI系统中将其用作组件模型。重要的是,我们根据著名的道德理论来解释Delphi的运营化,这使我们提出了重要的未来研究问题。
translated by 谷歌翻译
将来,强大的AI系统可能会在高风险的设置中部署,在这种情况下,单个故障可能是灾难性的。在高风险设置中改善AI安全性的一种技术是对手训练,该培训使用对手来生成示例进行训练,以实现更好的最差表现。在这项工作中,我们将语言生成任务用作测试台,以通过对抗性培训来实现高可靠性。我们创建了一系列的对抗训练技术 - 包括一种有助于人类对手的工具 - 以在分类器中找到和消除故障,该分类器过滤了发电机建议的文本完成。在简单的“避免受伤”任务中,我们确定我们可以设置非常保守的分类器阈值,而不会显着影响过滤后的输出的质量。使用我们选择的阈值,使用基线分类器进行过滤,将不安全完成的速度从分布数据的数据降低到约2.4%至0.003%,这是我们测量能力的极限。我们发现,对抗性训练可显着提高对我们训练的对抗攻击的鲁棒性,而不会影响分布性能。我们希望在高风险的可靠性环境中看到进一步的工作,包括更强大的工具来增强人类对手,以及更好的方法来衡量高水平的可靠性,直到我们可以自信地排除强大模型的灾难性部署时间失败的可能性。
translated by 谷歌翻译
虽然神经网络在平均病例的性能方面对分类任务的成功显着,但它们通常无法在某些数据组上表现良好。这样的组信息可能是昂贵的;因此,即使在培训数据不可用的组标签不可用,较稳健性和公平的最新作品也提出了改善最差组性能的方法。然而,这些方法通常在培训时间使用集团信息的表现不佳。在这项工作中,我们假设没有组标签的较大数据集一起访问少量组标签。我们提出了一个简单的两步框架,利用这个部分组信息来提高最差组性能:训练模型以预测训练数据的丢失组标签,然后在强大的优化目标中使用这些预测的组标签。从理论上讲,我们在最差的组性能方面为我们的方法提供泛化界限,展示了泛化误差如何相对于培训点总数和具有组标签的培训点的数量。凭经验,我们的方法优于不使用群组信息的基线表达,即使只有1-33%的积分都有组标签。我们提供消融研究,以支持我们框架的稳健性和可扩展性。
translated by 谷歌翻译
在过去的十年中,计算机愿景,旨在了解视觉世界的人工智能分支,从简单地识别图像中的物体来描述图片,回答有关图像的问题,以及围绕物理空间的机器人操纵甚至产生新的视觉内容。随着这些任务和应用程序的现代化,因此依赖更多数据,用于模型培训或评估。在本章中,我们展示了新颖的互动策略可以为计算机愿景提供新的数据收集和评估。首先,我们提出了一种众群界面,以通过数量级加速付费数据收集,喂养现代视觉模型的数据饥饿性质。其次,我们探索使用自动社交干预措施增加志愿者贡献的方法。第三,我们开发一个系统,以确保人类对生成视觉模型的评估是可靠的,实惠和接地在心理物理学理论中。我们结束了人机互动的未来机会,以帮助计算机愿景。
translated by 谷歌翻译
本次调查绘制了用于分析社交媒体数据的生成方法的研究状态的广泛的全景照片(Sota)。它填补了空白,因为现有的调查文章在其范围内或被约会。我们包括两个重要方面,目前正在挖掘和建模社交媒体的重要性:动态和网络。社会动态对于了解影响影响或疾病的传播,友谊的形成,友谊的形成等,另一方面,可以捕获各种复杂关系,提供额外的洞察力和识别否则将不会被注意的重要模式。
translated by 谷歌翻译
我们理论上和经验地证明,对抗性鲁棒性可以显着受益于半体验学习。从理论上讲,我们重新审视了Schmidt等人的简单高斯模型。这显示了标准和稳健分类之间的示例复杂性差距。我们证明了未标记的数据桥接这种差距:简单的半体验学习程序(自我训练)使用相同数量的达到高标准精度所需的标签实现高的强大精度。经验上,我们增强了CiFar-10,使用50万微小的图像,使用了8000万微小的图像,并使用强大的自我训练来优于最先进的鲁棒精度(i)$ \ ell_ infty $鲁棒性通过对抗培训和(ii)认证$ \ ell_2 $和$ \ ell_ \ infty $鲁棒性通过随机平滑的几个强大的攻击。在SVHN上,添加DataSet自己的额外训练集,删除的标签提供了4到10个点的增益,在使用额外标签的1点之内。
translated by 谷歌翻译
Generative AI has matured to a point where large-scale models can generate text that seems indistinguishable from human-written text and remarkably photorealistic images. Automatically measuring how close the distribution of generated data is to the target real data distribution is a key step in diagnosing existing models and developing better models. We present MAUVE, a family of comparison measures between pairs of distributions such as those encountered in the generative modeling of text or images. These scores are statistical summaries of divergence frontiers capturing two types of errors in generative modeling. We explore four approaches to statistically estimate these scores: vector quantization, non-parametric estimation, classifier-based estimation, and parametric Gaussian approximations. We provide statistical bounds for the vector quantization approach. Empirically, we find that the proposed scores paired with a range of $f$-divergences and statistical estimation methods can quantify the gaps between the distributions of human-written text and those of modern neural language models by correlating with human judgments and identifying known properties of the generated texts. We conclude the paper by demonstrating its applications to other AI domains and discussing practical recommendations.
translated by 谷歌翻译
We build new test sets for the CIFAR-10 and ImageNet datasets. Both benchmarks have been the focus of intense research for almost a decade, raising the danger of overfitting to excessively re-used test sets. By closely following the original dataset creation processes, we test to what extent current classification models generalize to new data. We evaluate a broad range of models and find accuracy drops of 3% -15% on CIFAR-10 and 11% -14% on ImageNet. However, accuracy gains on the original test sets translate to larger gains on the new test sets. Our results suggest that the accuracy drops are not caused by adaptivity, but by the models' inability to generalize to slightly "harder" images than those found in the original test sets.
translated by 谷歌翻译
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
translated by 谷歌翻译