Overparameterized neural networks can be highly accurate on average on an i.i.d.test set yet consistently fail on atypical groups of the data (e.g., by learning spurious correlations that hold on average but not in such groups). Distributionally robust optimization (DRO) allows us to learn models that instead minimize the worst-case training loss over a set of pre-defined groups. However, we find that naively applying group DRO to overparameterized neural networks fails: these models can perfectly fit the training data, and any model with vanishing average training loss also already has vanishing worst-case training loss. Instead, the poor worst-case performance arises from poor generalization on some groups. By coupling group DRO models with increased regularization-a stronger-than-typical 2 penalty or early stopping-we achieve substantially higher worst-group accuracies, with 10-40 percentage point improvements on a natural language inference task and two image tasks, while maintaining high average accuracies. Our results suggest that regularization is important for worst-group generalization in the overparameterized regime, even if it is not needed for average generalization. Finally, we introduce a stochastic optimization algorithm, with convergence guarantees, to efficiently train group DRO models.
translated by 谷歌翻译
Standard training via empirical risk minimization (ERM) can produce models that achieve high accuracy on average but low accuracy on certain groups, especially in the presence of spurious correlations between the input and label. Prior approaches that achieve high worst-group accuracy, like group distributionally robust optimization (group DRO) require expensive group annotations for each training point, whereas approaches that do not use such group annotations typically achieve unsatisfactory worst-group accuracy. In this paper, we propose a simple two-stage approach, JTT, that first trains a standard ERM model for several epochs, and then trains a second model that upweights the training examples that the first model misclassified. Intuitively, this upweights examples from groups on which standard ERM models perform poorly, leading to improved worst-group performance. Averaged over four image classification and natural language processing tasks with spurious correlations, JTT closes 75% of the gap in worst-group accuracy between standard ERM and group DRO, while only requiring group annotations on a small validation set in order to tune hyperparameters.
translated by 谷歌翻译
虽然神经网络在平均病例的性能方面对分类任务的成功显着,但它们通常无法在某些数据组上表现良好。这样的组信息可能是昂贵的;因此,即使在培训数据不可用的组标签不可用,较稳健性和公平的最新作品也提出了改善最差组性能的方法。然而,这些方法通常在培训时间使用集团信息的表现不佳。在这项工作中,我们假设没有组标签的较大数据集一起访问少量组标签。我们提出了一个简单的两步框架,利用这个部分组信息来提高最差组性能:训练模型以预测训练数据的丢失组标签,然后在强大的优化目标中使用这些预测的组标签。从理论上讲,我们在最差的组性能方面为我们的方法提供泛化界限,展示了泛化误差如何相对于培训点总数和具有组标签的培训点的数量。凭经验,我们的方法优于不使用群组信息的基线表达,即使只有1-33%的积分都有组标签。我们提供消融研究,以支持我们框架的稳健性和可扩展性。
translated by 谷歌翻译
最近,对分布(OOD)数据具有相关性转移的概括引起了极大的关注。相关转移是由与类标签相关的虚假属性引起的,因为它们之间的相关性可能在训练和测试数据中有所不同。对于这样一个问题,我们表明,鉴于类标签,有条件独立的虚假属性模型是可推广的。基于此,提出了控制OOD泛化误差的度量条件伪变异(CSV),以衡量这种条件独立性。为了改善OOD的概括,我们将培训过程正常使用拟议的CSV。在温和的假设下,我们的训练目标可以作为非Convex-Concave Mini-Max问题提出。提出了具有可证明的收敛速率的算法来解决该问题。广泛的经验结果验证了我们算法在改善OOD概括方面的功效。
translated by 谷歌翻译
Models trained via empirical risk minimization (ERM) are known to rely on spurious correlations between labels and task-independent input features, resulting in poor generalization to distributional shifts. Group distributionally robust optimization (G-DRO) can alleviate this problem by minimizing the worst-case loss over a set of pre-defined groups over training data. G-DRO successfully improves performance of the worst-group, where the correlation does not hold. However, G-DRO assumes that the spurious correlations and associated worst groups are known in advance, making it challenging to apply it to new tasks with potentially multiple unknown spurious correlations. We propose AGRO -- Adversarial Group discovery for Distributionally Robust Optimization -- an end-to-end approach that jointly identifies error-prone groups and improves accuracy on them. AGRO equips G-DRO with an adversarial slicing model to find a group assignment for training examples which maximizes worst-case loss over the discovered groups. On the WILDS benchmark, AGRO results in 8% higher model performance on average on known worst-groups, compared to prior group discovery approaches used with G-DRO. AGRO also improves out-of-distribution performance on SST2, QQP, and MS-COCO -- datasets where potential spurious correlations are as yet uncharacterized. Human evaluation of ARGO groups shows that they contain well-defined, yet previously unstudied spurious correlations that lead to model errors.
translated by 谷歌翻译
Learning models that gracefully handle distribution shifts is central to research on domain generalization, robust optimization, and fairness. A promising formulation is domain-invariant learning, which identifies the key issue of learning which features are domain-specific versus domaininvariant. An important assumption in this area is that the training examples are partitioned into "domains" or "environments". Our focus is on the more common setting where such partitions are not provided. We propose EIIL, a general framework for domain-invariant learning that incorporates Environment Inference to directly infer partitions that are maximally informative for downstream Invariant Learning. We show that EIIL outperforms invariant learning methods on the CMNIST benchmark without using environment labels, and significantly outperforms ERM on worst-group performance in the Waterbirds and CivilComments datasets. Finally, we establish connections between EIIL and algorithmic fairness, which enables EIIL to improve accuracy and calibration in a fair prediction problem.
translated by 谷歌翻译
Distributional shift is one of the major obstacles when transferring machine learning prediction systems from the lab to the real world. To tackle this problem, we assume that variation across training domains is representative of the variation we might encounter at test time, but also that shifts at test time may be more extreme in magnitude. In particular, we show that reducing differences in risk across training domains can reduce a model's sensitivity to a wide range of extreme distributional shifts, including the challenging setting where the input contains both causal and anticausal elements. We motivate this approach, Risk Extrapolation (REx), as a form of robust optimization over a perturbation set of extrapolated domains (MM-REx), and propose a penalty on the variance of training risks (V-REx) as a simpler variant. We prove that variants of REx can recover the causal mechanisms of the targets, while also providing some robustness to changes in the input distribution ("covariate shift"). By tradingoff robustness to causally induced distributional shifts and covariate shift, REx is able to outperform alternative methods such as Invariant Risk Minimization in situations where these types of shift co-occur.
translated by 谷歌翻译
尽管过度参数化的模型已经在许多机器学习任务上表现出成功,但与培训不同的测试分布的准确性可能会下降。这种准确性下降仍然限制了在野外应用机器学习的限制。同时,重要的加权是一种处理分配转移的传统技术,已被证明在经验和理论上对过度参数化模型的影响较小甚至没有影响。在本文中,我们提出了重要的回火来改善决策界限,并为过度参数化模型取得更好的结果。从理论上讲,我们证明在标签移位和虚假相关设置下,组温度的选择可能不同。同时,我们还证明正确选择的温度可以解脱出少数群体崩溃的分类不平衡。从经验上讲,我们使用重要性回火来实现最严重的小组分类任务的最新结果。
translated by 谷歌翻译
显示过次分辨率化,导致在亚组信息的各种设置下在罕见的子组上的测试精度差。为了获得更完整的图片,我们考虑子组信息未知的情况。我们调查模型规模在多种设置的经验风险最小化(ERM)下最差组泛化的影响,不同:1)架构(Reset,VGG或BERT),2)域(视觉或自然语言处理)3)模型尺寸(宽度或深度)和4)初始化(具有预先培训或随机重量)。我们的系统评价显示,模型大小的增加不会受到伤害,并且可以帮助所有设置的ERM下的最差群体测试性能。特别是,增加预先训练的模型大小一致地提高水鸟和多液体的性能。当子组标签未知时,我们建议从业者使用更大的预训练模型。
translated by 谷歌翻译
重要性加权是一种处理分销班次的经典技术。然而,事先工作呈现出强大的实证和理论证据,证明重要性重量对过度分辨的神经网络没有影响。重要性加权与过度分辨率的神经网络的培训真正不相容吗?我们的论文在负面回答。我们表明重要的权重不是因为过度分辨率,而是因为使用像物流或交叉熵损失等指数尾损失。作为一种补救措施,我们表明多项式尾损失恢复了重要性重量在校正过度分配模型中的分布换档的影响。我们表征了梯度下降的行为,其具有过度分辨的线性模型的重要性加权多项式损耗,并且理论上证明了在标签换档设置中使用多环尾损失的优点。令人惊讶的是,我们的理论表明,使用通过以指数来引入经典无偏的重要性重量而获得的权重可以提高性能。最后,我们展示了我们对亚潜班班和标签移位数据集的神经网络实验的分析的实际价值。重新重复时,我们的损耗函数可以在测试精度的高达9%的跨熵优先于重复的交叉熵。我们的损耗功能还提供了与校正分配换档的最先进的方法可比或甚至超过的测试精度。
translated by 谷歌翻译
在许多现实世界中的机器学习应用中,亚种群的转移存在着极大地存在,指的是包含相同亚种群组的培训和测试分布,但在亚种群频率中有所不同。重要性重新加权是通过对训练数据集中每个样本施加恒定或自适应抽样权重来处理亚种群转移问题的正常方法。但是,最近的一些研究已经认识到,这些方法中的大多数无法改善性能,而不是经验风险最小化,尤其是当应用于过度参数化的神经网络时。在这项工作中,我们提出了一个简单而实用的框架,称为“不确定性感知混合”(UMIX),以根据样品不确定性重新加权“混合”样品来减轻过度参数化模型中的过度拟合问题。基于训练 - 注射器的不确定性估计为每个样品的拟议UMIX配备,以灵活地表征亚群分布。我们还提供有见地的理论分析,以验证UMIX是否在先前的工作中实现了更好的概括界限。此外,我们在广泛的任务上进行了广泛的经验研究,以验证我们方法的有效性,既有定性和定量。
translated by 谷歌翻译
在偏置数据集中培训时,分类器会偏差。作为一种补救措施,我们建议学习分裂(LS),这是一种用于自动偏置检测的算法。给定一个具有输入标签对的数据集,LS学会了将该数据集分开,以便在训练分训练上训练的预测因素不能推广到测试分配。该性能差距表明,数据集中的测试拆分代表性不足,这是潜在偏差的信号。识别不可替代的分裂是具有挑战性的,因为我们对偏见没有注释。在这项工作中,我们表明,测试拆分中每个示例的预测正确性可以用作弱监督的来源:如果我们移动正确预测的示例,将概括性能下降错误预测。 LS是任务不合时宜的,可以应用于任何监督的学习问题,从自然语言理解和图像分类到分子财产预测。经验结果表明,LS能够产生与人类识别偏见相关的惊人挑战分裂。此外,我们证明,将强大的学习算法(例如群DRO)与LS启用自动偏差确定的拆分相结合。与以前的最先进相比,当训练和验证过程中偏见的来源未知时,我们显着提高了最差的组绩效(平均为23.4%)。
translated by 谷歌翻译
学习不变表示是在数据集中虚假相关驱动的机器学习模型时的重要要求。这些杂散相关性,在输入样本和目标标签之间,错误地指导了神经网络预测,导致某些组的性能差,尤其是少数群体。针对这些虚假相关性的强大培训需要每个样本的组成员资格。这种要求在少数群体或稀有群体的数据标签努力的情况下是显着费力的,或者包括数据集的个人选择隐藏敏感信息的情况。另一方面,存在这种数据收集的存在力度导致包含部分标记的组信息的数据集。最近的作品解决了完全无监督的场景,没有用于组的标签。因此,我们的目标是通过解决更现实的设置来填补文献中的缺失差距,这可以在培训期间利用部分可用的敏感或群体信息。首先,我们构造一个约束集并导出组分配所属的高概率绑定到该集合。其次,我们提出了一种从约束集中优化了优化最严格的组分配的算法。通过对图像和表格数据集的实验,我们显示少数集团的性能的改进,同时在跨组中保持整体汇总精度。
translated by 谷歌翻译
尽管无偏见的机器学习模型对于许多应用程序至关重要,但偏见是一个人为定义的概念,可以在任务中有所不同。只有输入标签对,算法可能缺乏足够的信息来区分稳定(因果)特征和不稳定(虚假)特征。但是,相关任务通常具有类似的偏见 - 我们可以利用在转移环境中开发稳定的分类器的观察结果。在这项工作中,我们明确通知目标分类器有关源任务中不稳定功能的信息。具体而言,我们得出一个表示,该表示通过对比源任务中的不同数据环境来编码不稳定的功能。我们通过根据此表示形式将目标任务的数据聚类来实现鲁棒性,并最大程度地降低这些集群中最坏情况的风险。我们对文本和图像分类进行评估。经验结果表明,我们的算法能够在合成生成的环境和现实环境的目标任务上保持鲁棒性。我们的代码可在https://github.com/yujiabao/tofu上找到。
translated by 谷歌翻译
公平的机器学习研究人员(ML)围绕几个公平标准结合,这些标准为ML模型公平提供了正式的定义。但是,这些标准有一些严重的局限性。我们确定了这些正式公平标准的四个主要缺点,并旨在通过扩展性能预测以包含分配强大的目标来帮助解决这些问题。
translated by 谷歌翻译
机器学习算法通常假设培训和测试示例是从相同的分布中汲取的。然而,分发转移是现实世界应用中的常见问题,并且可以在测试时间造成模型急剧执行。在本文中,我们特别考虑域移位和亚泊素班次的问题(例如,不平衡数据)。虽然先前的作品通常会寻求明确地将模型的内部表示和预测器进行明确,以成为域不变的,但我们旨在规范整个功能而不限制模型的内部表示。这导致了一种简单的基于混合技术,它通过名为LISA的选择性增强来学习不变函数。 Lisa选择性地用相同的标签而单独地插值样本,但不同的域或具有相同的域但不同的标签。我们分析了线性设置,从理论上展示了LISA如何导致较小的最差组错误。凭经验,我们研究了LISA对从亚本化转变到域移位的九个基准的有效性,我们发现LISA一直以其他最先进的方法表达。
translated by 谷歌翻译
尽管现代的大规模数据集通常由异质亚群(例如,多个人口统计组或多个文本语料库)组成 - 最小化平均损失的标准实践并不能保证所有亚人群中均匀的低损失。我们提出了一个凸面程序,该过程控制给定尺寸的所有亚群中最差的表现。我们的程序包括有限样本(非参数)收敛的保证,可以保证最坏的亚群。从经验上讲,我们观察到词汇相似性,葡萄酒质量和累犯预测任务,我们最糟糕的程序学习了对不看到看不见的亚人群的模型。
translated by 谷歌翻译
许多现代化的机器学习任务需要具有高尾部性能的模型,即在数据集中最严格的样本上的高性能。该问题已广泛研究了算法公平,类别不平衡和风险敏感决策等领域。一种最大化模型的尾部性能的流行方法是最大限度地减少CVAR(风险条件值)损失,这计算了损失尾部的平均风险。然而,对于通过零一次损耗评估模型的分类任务,我们表明,如果分类器是确定性的,那么平均零一个损耗的最小值也会最小化CVAR零一次损耗,表明CVAR损耗最小化是最小化的没有额外的假设没有帮助。我们通过最大限度地减少随机分类器的CVAR损失来规避这种负面结果,其中平均零一个损耗和CVAR零一次损耗的最小化器不再相同,因此最小化后者可能导致更好的尾部性能。为了学习这样的随机分类器,我们提出了增强的CVAR分类框架,该框架通过CVAR与称为LPBoost的经典升压算法之间的直接关系而激励。基于此框架,我们设计了一种称为$ \ alpha $ -adalpboost的算法。我们在四个基准数据集中凭经验评估了我们所提出的算法,并显示它比确定性模型训练方法更高的尾部性能。
translated by 谷歌翻译
我们提出了简单的主动采样和重新重量策略,以优化最小最大公平性,可以应用于通过损耗最小化学习的任何分类或回归模型。我们的方法背后的关键直觉是在每个TIMESTEP中使用来自当前模型中最差的组的DataPoint,以更新模型。实施的易于实现和我们稳健的制定的一般性使其成为提高糟糕表现群体的模型性能的有吸引力的选择。对于凸起的学习问题,如线性或逻辑回归,我们提供了对我们的策略的细粒度分析,证明了其收敛速度对Min-Max Fair解决方案。
translated by 谷歌翻译
Empirical studies suggest that machine learning models trained with empirical risk minimization (ERM) often rely on attributes that may be spuriously correlated with the class labels. Such models typically lead to poor performance during inference for data lacking such correlations. In this work, we explicitly consider a situation where potential spurious correlations are present in the majority of training data. In contrast with existing approaches, which use the ERM model outputs to detect the samples without spurious correlations, and either heuristically upweighting or upsampling those samples; we propose the logit correction (LC) loss, a simple yet effective improvement on the softmax cross-entropy loss, to correct the sample logit. We demonstrate that minimizing the LC loss is equivalent to maximizing the group-balanced accuracy, so the proposed LC could mitigate the negative impacts of spurious correlations. Our extensive experimental results further reveal that the proposed LC loss outperforms the SoTA solutions on multiple popular benchmarks by a large margin, an average 5.5% absolute improvement, without access to spurious attribute labels. LC is also competitive with oracle methods that make use of the attribute labels. Code is available at https://github.com/shengliu66/LC.
translated by 谷歌翻译