Calibration strengthens the trustworthiness of black-box models by producing better accurate confidence estimates on given examples. However, little is known about if model explanations can help confidence calibration. Intuitively, humans look at important features attributions and decide whether the model is trustworthy. Similarly, the explanations can tell us when the model may or may not know. Inspired by this, we propose a method named CME that leverages model explanations to make the model less confident with non-inductive attributions. The idea is that when the model is not highly confident, it is difficult to identify strong indications of any class, and the tokens accordingly do not have high attribution scores for any class and vice versa. We conduct extensive experiments on six datasets with two popular pre-trained language models in the in-domain and out-of-domain settings. The results show that CME improves calibration performance in all settings. The expected calibration errors are further reduced when combined with temperature scaling. Our findings highlight that model explanations can help calibrate posterior estimates.
translated by 谷歌翻译
Calibration is a popular framework to evaluate whether a classifier knows when it does not know - i.e., its predictive probabilities are a good indication of how likely a prediction is to be correct. Correctness is commonly estimated against the human majority class. Recently, calibration to human majority has been measured on tasks where humans inherently disagree about which class applies. We show that measuring calibration to human majority given inherent disagreements is theoretically problematic, demonstrate this empirically on the ChaosNLI dataset, and derive several instance-level measures of calibration that capture key statistical properties of human judgements - class frequency, ranking and entropy.
translated by 谷歌翻译
估算预测预测的语言模型的不确定性对于提高NLP的可靠性是重要的。虽然许多以前的作品侧重于量化预测不确定性,但在解释不确定性时几乎没有工作。本文进一步推动了一个关于解释后校准的预训练的语言模型的不确定预测。我们适应了两种基于扰动的后宫释放方法,留出次出来和采样福利,以识别引起预测中不确定性的输入中的单词。我们以三项任务测试BERT和Roberta上提出的方法:情绪分类,自然语言推断和解释域,在域内和域外设置。实验表明,两种方法都始终捕获引起预测不确定性的输入中的单词。
translated by 谷歌翻译
神经网络缺乏对抗性鲁棒性,即,它们容易受到对抗的例子,通过对输入的小扰动导致错误的预测。此外,当模型给出错误的预测时,信任被破坏,即,预测的概率不是我们应该相信我们模型的良好指标。在本文中,我们研究了对抗性鲁棒性和校准之间的联系,发现模型对小扰动敏感的输入(很容易攻击)更有可能具有较差的预测。基于这种洞察力,我们通过解决这些对抗的缺陷输入来研究校准。为此,我们提出了基于对抗基于对抗的自适应标签平滑(AR-AD),其通过适应性软化标签,通过适应性软化标签来整合对抗性鲁棒性和校准到训练中的相关性,这是基于对敌人可以攻击的容易攻击。我们发现我们的方法,考虑了分销数据的对抗性稳健性,即使在分布班次下也能够更好地校准模型。此外,还可以应用于集合模型,以进一步提高模型校准。
translated by 谷歌翻译
Pre-trained language models (PLMs) achieve remarkable performance on many downstream tasks, but may fail in giving reliable estimates of their predictive uncertainty. Given the lack of a comprehensive understanding of PLMs calibration, we take a close look into this new research problem, aiming to answer two questions: (1) Do PLMs learn to become calibrated in the training process? (2) How effective are existing calibration methods? For the first question, we conduct fine-grained control experiments to study the dynamic change in PLMs' calibration performance in training. We consider six factors as control variables, including dataset difficulty, available training samples, training steps, the number of tunable parameters, model scale, and pretraining. In experiments, we observe a consistent change in calibration performance across six factors. We find that PLMs don't learn to become calibrated in training, evidenced by the continual increase in confidence, no matter the predictions are correct or not. We highlight that our finding presents some contradiction with two established conclusions: (a) Larger PLMs are more calibrated; (b) Pretraining improves model calibration. Next, we study the effectiveness of existing calibration methods in mitigating the overconfidence issue, in both in-distribution and various out-of-distribution settings. Besides unlearnable calibration methods, we adapt two recently proposed learnable methods that directly collect data to train models to have reasonable confidence estimations. Also, we propose extended learnable methods based on existing ones to further improve or maintain PLMs calibration without sacrificing the original task performance. Experimental results show that learnable methods significantly reduce PLMs' confidence in wrong predictions, and our methods exhibit superior performance compared with previous methods.
translated by 谷歌翻译
变形金刚目前是自然语言理解(NLU)任务的最新技术,容易产生未校准的预测或极端概率,从而根据其输出相对困难而做出不同的决策过程。在本文中,我们建议构建几个电感Venn - 持续预测因子(IVAP),这些预测因子(IVAP)可以根据预先训练的变压器的选择在最小的假设下可以很好地校准。我们在一组不同的NLU任务上测试了它们的性能,并表明它们能够产生均匀分布在[0,1]间隔的概率预测的良好概率预测,同时均保留了原始模型的预测准确性。
translated by 谷歌翻译
Confidence calibration -the problem of predicting probability estimates representative of the true correctness likelihood -is important for classification models in many applications. We discover that modern neural networks, unlike those from a decade ago, are poorly calibrated. Through extensive experiments, we observe that depth, width, weight decay, and Batch Normalization are important factors influencing calibration. We evaluate the performance of various post-processing calibration methods on state-ofthe-art architectures with image and document classification datasets. Our analysis and experiments not only offer insights into neural network learning, but also provide a simple and straightforward recipe for practical settings: on most datasets, temperature scaling -a singleparameter variant of Platt Scaling -is surprisingly effective at calibrating predictions.
translated by 谷歌翻译
通过利用仅偏置模型的输出来调整学习目标,可以有效地显示了基于组合的脱叠方法。在本文中,我们专注于这些基于集合的方法的偏见模型,这起到了重要作用,但在现有文献中没有大量关注。从理论上讲,我们证明了脱结性能可能因偏见模型的不准确性估计而受损。凭经验,我们表明现有的偏见模型在产生准确的不确定性估计方面不足。这些发现的动机,我们建议在唯一的模型上进行校准,从而实现基于三阶段的脱叠框架,包括偏置建模,模型校准和脱叠。 NLI的实验结果和事实验证任务表明,我们提出的三阶段脱叠框架始终如一地优于传统的两级,以分配的准确性。
translated by 谷歌翻译
Recent studies have revealed that, beyond conventional accuracy, calibration should also be considered for training modern deep neural networks. To address miscalibration during learning, some methods have explored different penalty functions as part of the learning objective, alongside a standard classification loss, with a hyper-parameter controlling the relative contribution of each term. Nevertheless, these methods share two major drawbacks: 1) the scalar balancing weight is the same for all classes, hindering the ability to address different intrinsic difficulties or imbalance among classes; and 2) the balancing weight is usually fixed without an adaptive strategy, which may prevent from reaching the best compromise between accuracy and calibration, and requires hyper-parameter search for each application. We propose Class Adaptive Label Smoothing (CALS) for calibrating deep networks, which allows to learn class-wise multipliers during training, yielding a powerful alternative to common label smoothing penalties. Our method builds on a general Augmented Lagrangian approach, a well-established technique in constrained optimization, but we introduce several modifications to tailor it for large-scale, class-adaptive training. Comprehensive evaluation and multiple comparisons on a variety of benchmarks, including standard and long-tailed image classification, semantic segmentation, and text classification, demonstrate the superiority of the proposed method. The code is available at https://github.com/by-liu/CALS.
translated by 谷歌翻译
在本文中,我们研究了现代神经网络的事后校准,这个问题近年来引起了很多关注。已经为任务提出了许多不同复杂性的校准方法,但是关于这些任务的表达方式尚无共识。我们专注于置信度缩放的任务,特别是在概括温度缩放的事后方法上,我们将其称为自适应温度缩放家族。我们分析了改善校准并提出可解释方法的表达功能。我们表明,当有大量数据复杂模型(例如神经网络)产生更好的性能时,但是当数据量受到限制时,很容易失败,这是某些事后校准应用(例如医学诊断)的常见情况。我们研究表达方法在理想条件和设计更简单的方法下学习但对这些表现良好的功能具有强烈的感应偏见的功能。具体而言,我们提出了基于熵的温度缩放,这是一种简单的方法,可根据其熵缩放预测的置信度。结果表明,与其他方法相比,我们的方法可获得最先进的性能,并且与复杂模型不同,它对数据稀缺是可靠的。此外,我们提出的模型可以更深入地解释校准过程。
translated by 谷歌翻译
Deep neural networks (DNN) are prone to miscalibrated predictions, often exhibiting a mismatch between the predicted output and the associated confidence scores. Contemporary model calibration techniques mitigate the problem of overconfident predictions by pushing down the confidence of the winning class while increasing the confidence of the remaining classes across all test samples. However, from a deployment perspective, an ideal model is desired to (i) generate well-calibrated predictions for high-confidence samples with predicted probability say >0.95, and (ii) generate a higher proportion of legitimate high-confidence samples. To this end, we propose a novel regularization technique that can be used with classification losses, leading to state-of-the-art calibrated predictions at test time; From a deployment standpoint in safety-critical applications, only high-confidence samples from a well-calibrated model are of interest, as the remaining samples have to undergo manual inspection. Predictive confidence reduction of these potentially ``high-confidence samples'' is a downside of existing calibration approaches. We mitigate this by proposing a dynamic train-time data pruning strategy that prunes low-confidence samples every few epochs, providing an increase in "confident yet calibrated samples". We demonstrate state-of-the-art calibration performance across image classification benchmarks, reducing training time without much compromise in accuracy. We provide insights into why our dynamic pruning strategy that prunes low-confidence training samples leads to an increase in high-confidence samples at test time.
translated by 谷歌翻译
我们解决了不确定性校准的问题,并引入了一种新型的校准方法,即参数化温度缩放(PTS)。标准的深神经网络通常会产生未校准的预测,可以使用事后校准方法将其转化为校准的置信得分。在这项贡献中,我们证明了准确保存最先进的事后校准器的性能受其内在表达能力的限制。我们通过计算通过神经网络参数为参数的预测温度来概括温度缩放。我们通过广泛的实验表明,我们的新型准确性保护方法始终优于大量模型体系结构,数据集和指标的现有算法。
translated by 谷歌翻译
尽管图形神经网络(GNNS)已经取得了显着的准确性,但结果是否值得信赖仍未开发。以前的研究表明,许多现代神经网络对预测过度充满信心,然而,令人惊讶的是,我们发现GNN主要呈相反方向,即,GNN是不受自信的。因此,非常需要GNN的置信度校准。在本文中,我们通过设计拓扑知识的后HOC校准函数提出了一种新型值得信赖的GNN模型。具体而言,我们首先验证图形中的置信度分布具有同眼性的财产,而且这一发现激发了我们设计校准GNN模型(CAGCN)以学习校准功能。 CAGCN能够从GNN的Logits对每个节点的校准置信度获得独特的变换,同时,这种变换能够在类之间保留课程之间的顺序,满足精度保留的属性。此外,我们将校准GNN应用于自培训框架,表明可以通过校准的置信度获得更可靠的伪标签,并进一步提高性能。广泛的实验证明了我们所提出的模型在校准和准确性方面的有效性。
translated by 谷歌翻译
机器学习已经急剧提高,在多模式任务中缩小了人类的准确性差距,例如视觉问题答案(VQA)。但是,尽管人类在不确定的时候可以说“我不知道”(即避免回答问题),但这种能力在多模式研究中被大大忽略了,尽管此问题对VQA的使用很重要,而VQA实际上使用了VQA。设置。在这项工作中,我们为可靠的VQA提出了一个问题制定,我们更喜欢弃权,而不是提供错误的答案。我们首先为多种VQA模型提供了弃戒功能,并分析了它们的覆盖范围,回答的问题的一部分和风险,该部分的错误。为此,我们探索了几种弃权方法。我们发现,尽管最佳性能模型在VQA V2数据集上实现了超过71%的准确性,但通过直接使用模型的SoftMax得分介绍了弃权的选项,限制了它们的少于8%的问题,以达到错误的错误风险(即1%)。这促使我们利用多模式选择功能直接估计预测答案的正确性,我们显示的可以将覆盖率增加,例如,在1%风险下,2.4倍从6.8%到16.3%。尽管分析覆盖范围和风险很重要,但这些指标具有权衡,这使得比较VQA模型具有挑战性。为了解决这个问题,我们还建议对VQA的有效可靠性指标,与弃权相比,将不正确的答案的成本更大。 VQA的这种新问题制定,度量和分析为构建有效和可靠的VQA模型提供了基础,这些模型具有自我意识,并且只有当他们不知道答案时才戒除。
translated by 谷歌翻译
最近的作品表明了解释性和鲁棒性是值得信赖和可靠的文本分类的两个关键成分。然而,以前的作品通常是解决了两个方面的一个:i)如何提取准确的理由,以便在有利于预测的同时解释; ii)如何使预测模型对不同类型的对抗性攻击稳健。直观地,一种产生有用的解释的模型应该对对抗性攻击更加强大,因为我们无法信任输出解释的模型,而是在小扰动下改变其预测。为此,我们提出了一个名为-BMC的联合分类和理由提取模型。它包括两个关键机制:混合的对手训练(AT)旨在在离散和嵌入空间中使用各种扰动,以改善模型的鲁棒性,边界匹配约束(BMC)有助于利用边界信息的引导来定位理由。基准数据集的性能表明,所提出的AT-BMC优于分类和基本原子的基础,由大边距提取。鲁棒性分析表明,建议的AT-BMC将攻击成功率降低了高达69%。经验结果表明,强大的模型与更好的解释之间存在连接。
translated by 谷歌翻译
最佳决策要求分类器产生与其经验准确性一致的不确定性估计。然而,深度神经网络通常在他们的预测中受到影响或过度自信。因此,已经开发了方法,以改善培训和后HOC期间的预测性不确定性的校准。在这项工作中,我们提出了可分解的损失,以改善基于频流校准误差估计底层的钻孔操作的软(连续)版本的校准。当纳入训练时,这些软校准损耗在多个数据集中实现最先进的单一模型ECE,精度低于1%的数量。例如,我们观察到ECE的82%(相对于HOC后射出ECE 70%),以换取相对于CIFAR-100上的交叉熵基线的准确性0.7%的相对降低。在培训后结合时,基于软合成的校准误差目标会改善温度缩放,一种流行的重新校准方法。总体而言,跨损失和数据集的实验表明,使用校准敏感程序在数据集移位下产生更好的不确定性估计,而不是使用跨熵损失和后HOC重新校准方法的标准做法。
translated by 谷歌翻译
In this paper, we empirically analyze a simple, non-learnable, and nonparametric Nadaraya-Watson (NW) prediction head that can be used with any neural network architecture. In the NW head, the prediction is a weighted average of labels from a support set. The weights are computed from distances between the query feature and support features. This is in contrast to the dominant approach of using a learnable classification head (e.g., a fully-connected layer) on the features, which can be challenging to interpret and can yield poorly calibrated predictions. Our empirical results on an array of computer vision tasks demonstrate that the NW head can yield better calibration than its parametric counterpart, while having comparable accuracy and with minimal computational overhead. To further increase inference-time efficiency, we propose a simple approach that involves a clustering step run on the training set to create a relatively small distilled support set. In addition to using the weights as a means of interpreting model predictions, we further present an easy-to-compute "support influence function," which quantifies the influence of a support element on the prediction for a given query. As we demonstrate in our experiments, the influence function can allow the user to debug a trained model. We believe that the NW head is a flexible, interpretable, and highly useful building block that can be used in a range of applications.
translated by 谷歌翻译
尽管深度学习预测模型在歧视不同阶层方面已经成功,但它们通常会遭受跨越包括医疗保健在内的具有挑战性领域的校准不良。此外,长尾分布在深度学习分类问题(包括临床疾病预测)中构成了巨大挑战。最近提出了一些方法来校准计算机视觉中的深入预测,但是没有发现代表模型如何在不同挑战性的环境中起作用。在本文中,我们通过对四个高影响力校准模型的比较研究来弥合从计算机视觉到医学成像的置信度校准。我们的研究是在不同的情况下进行的(自然图像分类和肺癌风险估计),包括在平衡与不平衡训练集以及计算机视觉与医学成像中进行。我们的结果支持关键发现:(1)我们获得了新的结​​论,这些结论未在不同的学习环境中进行研究,例如,结合两个校准模型,这些模型都可以减轻过度启发的预测,从而导致了不足的预测,并且来自计算机视觉模型的更简单的校准模型域往往更容易被医学成像化。 (2)我们强调了一般计算机视觉任务和医学成像预测之间的差距,例如,校准方法是通用计算机视觉任务的理想选择,实际上可能会损坏医学成像预测的校准。 (3)我们还加强了自然图像分类设置的先前结论。我们认为,这项研究的优点可以指导读者选择校准模型,并了解一般计算机视觉和医学成像域之间的差距。
translated by 谷歌翻译
In many task settings, text classification models are likely to encounter examples from novel classes on which they cannot predict correctly. Selective prediction, in which models abstain on low-confidence examples, provides a possible solution, but existing models are often overly confident on OOD examples. To remedy this overconfidence, we introduce Contrastive Novelty-Augmented Learning (CoNAL), a two-step method that generates OOD examples representative of novel classes, then trains to decrease confidence on them. First, we generate OOD examples by prompting a large language model twice: we prompt it to enumerate relevant novel labels, then generate examples from each novel class matching the task format. Second, we train our classifier with a novel contrastive objective that encourages lower confidence on generated OOD examples than training examples. When trained with CoNAL, classifiers improve in their ability to detect and abstain on OOD examples over prior methods by an average of 2.3% AUAC and 5.5% AUROC across 4 NLP datasets, with no cost to in-distribution accuracy.
translated by 谷歌翻译
尽管深神经网络的占优势性能,但最近的作品表明它们校准不佳,导致过度自信的预测。由于培训期间的跨熵最小化,因此可以通过过度化来加剧错误烫伤,因为它促进了预测的Softmax概率来匹配单热标签分配。这产生了正确的类别的Pre-SoftMax激活,该类别明显大于剩余的激活。来自文献的最近证据表明,损失函数嵌入隐含或明确最大化的预测熵会产生最先进的校准性能。我们提供了当前最先进的校准损耗的统一约束优化视角。具体地,这些损失可以被视为在Logit距离上施加平等约束的线性惩罚(或拉格朗日)的近似值。这指出了这种潜在的平等约束的一个重要限制,其随后的梯度不断推动非信息解决方案,这可能会阻止在基于梯度的优化期间模型的辨别性能和校准之间的最佳妥协。在我们的观察之后,我们提出了一种基于不平等约束的简单灵活的泛化,这在Logit距离上强加了可控裕度。关于各种图像分类,语义分割和NLP基准的综合实验表明,我们的方法在网络校准方面对这些任务设置了新的最先进的结果,而不会影响辨别性能。代码可在https://github.com/by-liu/mbls上获得。
translated by 谷歌翻译