Labeling large image datasets with attributes such as facial age or object type is tedious and sometimes infeasible. Supervised machine learning methods provide a highly accurate solution, but require manual labels which are often unavailable. Zero-shot models (e.g., CLIP) do not require manual labels but are not as accurate as supervised ones, particularly when the attribute is numeric. We propose a new approach, CLIPPR (CLIP with Priors), which adapts zero-shot models for regression and classification on unlabelled datasets. Our method does not use any annotated images. Instead, we assume a prior over the label distribution in the dataset. We then train an adapter network on top of CLIP under two competing objectives: i) minimal change of predictions from the original CLIP model ii) minimal distance between predicted and prior distribution of labels. Additionally, we present a novel approach for selecting prompts for Vision & Language models using a distributional prior. Our method is effective and presents a significant improvement over the original model. We demonstrate an improvement of 28% in mean absolute error on the UTK age regression task. We also present promising results for classification benchmarks, improving the classification accuracy on the ImageNet dataset by 2.83%, without using any labels.
translated by 谷歌翻译
Contrastive Language-Image Pre-training (CLIP) has emerged as a simple yet effective way to train large-scale vision-language models. CLIP demonstrates impressive zero-shot classification and retrieval on diverse downstream tasks. However, to leverage its full potential, fine-tuning still appears to be necessary. Fine-tuning the entire CLIP model can be resource-intensive and unstable. Moreover, recent methods that aim to circumvent this need for fine-tuning still require access to images from the target distribution. In this paper, we pursue a different approach and explore the regime of training-free "name-only transfer" in which the only knowledge we possess about the downstream task comprises the names of downstream target categories. We propose a novel method, SuS-X, consisting of two key building blocks -- SuS and TIP-X, that requires neither intensive fine-tuning nor costly labelled data. SuS-X achieves state-of-the-art zero-shot classification results on 19 benchmark datasets. We further show the utility of TIP-X in the training-free few-shot setting, where we again achieve state-of-the-art results over strong training-free baselines. Code is available at https://github.com/vishaal27/SuS-X.
translated by 谷歌翻译
对比视力语言预训练(称为剪辑)为使用大型图像文本对学习视觉表示提供了新的范式。通过零拍知识转移,它在下游任务上表现出令人印象深刻的表现。为了进一步增强剪辑的适应能力,现有的方法提议微调额外的可学习模块,这大大改善了少量的性能,但引入了额外的培训时间和计算资源。在本文中,我们提出了一种无训练的适应方法,用于进行剪辑进行几个弹药分类,称为Tip-Adapter,该分类不仅继承了零拍剪辑的无训练优势,而且还与训练需要的那些相当的表现相当方法。 TIP-ADAPTER通过少数照片训练集通过键值缓存模型构造适配器,并更新通过功能检索中剪辑中编码的先验知识。最重要的是,可以通过对10 $ \ times $ \现有方法少的速度$ \ times $ $ \现有方法进行微调,这可以进一步提高Imagenet上的最先进。高效的。我们在11个数据集上进行了很少的射击分类实验,以证明我们提出的方法的优势。代码在https://github.com/gaopengcuhk/tip-adapter上发布。
translated by 谷歌翻译
作为剪辑的对比视觉语言预培训为通过使用大规模对比图像文本对提供了学习视觉表示的新范式。它显示了零击中知识转移到下游任务的令人印象深刻的性能。为了进一步增强剪辑的几次射击功能,提出的剪辑适配器提出微调轻量级残留功能适配器,并显着提高了几次拍摄分类的性能。但是,这样的过程仍然需要额外的培训和计算资源。在本文中,我们提出了\ textbf {t}下雨的cl \ textbf {ip} - \ textbf {适配器}(\ textbf {tip-adapter}),它不仅继承了剪辑的无训练优势,还可以相当地执行或甚至比剪辑适配器更好。提示 - 适配器不需要任何用于训练适配器的备份传播,而是通过从几次拍摄训练集构造的键值高速缓存模型创建权重。在这种非参数的方式中,提示适配器在没有任何训练的情况下获取良好的适配器权重,这既有效且有效。此外,可以通过微调这种适当的初始化适配器进一步提高尖端适配器的性能,仅用于具有超快速收敛速度的几个时期。我们对ImageNet和其他10个数据集进行了广泛的小型分类实验,以证明提出的提示适配器的优越性。代码将以\ URL {https://github.com/gaopengcuhk/tip-adapter}释放。
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
虽然大型审计的基础模型(FMS)对数据集级别的分布变化显示出显着的零击分类鲁棒性,但它们对亚群或组移动的稳健性相对却相对不受欢迎。我们研究了这个问题,并发现诸如剪辑之类的FMS可能对各种群体转移可能不健壮。在9个稳健性基准中,其嵌入式分类零射击分类导致平均和最差组精度之间的差距高达80.7个百分点(PP)。不幸的是,现有的改善鲁棒性的方法需要重新培训,这在大型基础模型上可能非常昂贵。我们还发现,改善模型推理的有效方法(例如,通过适配器,具有FM嵌入式作为输入的轻量级网络)不会持续改进,有时与零击相比会伤害组鲁棒性(例如,将精度差距提高到50.1 pp on 50.1 pp on On on 50.1 pp on Celeba)。因此,我们制定了一种适配器培训策略,以有效有效地改善FM组的鲁棒性。我们激励的观察是,尽管同一阶级中的群体中较差的鲁棒性在基础模型“嵌入空间”中分开,但标准适配器训练可能不会使这些要点更加紧密。因此,我们提出了对比度的适应,该适应器会通过对比度学习进行训练适配器,以使样品嵌入在同一类中的地面真相类嵌入和其他样品嵌入。在整个9个基准测试中,我们的方法始终提高组鲁棒性,使最差的组精度提高了8.5至56.0 pp。我们的方法也是有效的,这样做的方法也没有任何FM芬太尼,只有一组固定的冷冻FM嵌入。在水鸟和Celeba等基准上,这导致最差的组精度可与最先进的方法相媲美,而最先进的方法可以重新训练整个模型,而仅训练$ \ leq $ 1%的模型参数。
translated by 谷歌翻译
很少有射击分类需要深层神经网络才能仅从有限的培训图像中学习广义表示,这在低数据制度中很有挑战,但很重要。最近,基于剪辑的方法显示出有希望的很少的射击性能受益于对比的语言图像预训练。基于这一点,我们质疑大规模的预训练是否可以减轻少数数据的缺陷,并通过预测的知识帮助代表性学习。在本文中,我们提出了Como,这是对预培训模型的合作,该模型结合了来自各种培训范式的各种先验知识,以获得更好的几次学习。我们的科莫包括:剪辑的语言对比知识,迪诺的视力对抗性知识以及达尔 - E的语言基础知识。具体而言,科莫在两个方面工作:很少的数据扩展和多样化的知识合奏。首先,我们通过零摄影dall-e生成合成图像,以丰富少量训练数据,而无需任何人力。另一方面,我们引入了一个可学习的多知识适配器(MK-apapter),以适应剪辑和恐龙的预测。通过这种合作,COMO可以完全释放不同的预训练方法的潜力,并将其统一以进行几次分类。我们在11个数据集上进行了广泛的实验,以证明我们方法的优势和概括能力。
translated by 谷歌翻译
在低标签制度中,解决图像的多标签识别(MLR)是许多现实世界应用的一项艰巨任务。最近的工作学会了文本和视觉空间之间的一致性,以补偿图像标签不足,但由于可用的MLR注释量有限,因此失去了准确性。在这项工作中,我们利用数百万辅助图像文本对预测的文本和视觉特征的牢固对齐,并提出双背景优化(dualCoop)作为部分标签MLR和零发射MLR的统一框架。 DualCoop用类名来编码正面和负面的上下文,作为语言输入的一部分(即提示)。由于DualCoop仅在验证的视觉语言框架上引入了非常轻松的开销,因此它可以迅速适应具有有限的注释甚至看不见的类别的多标签识别任务。对两个挑战性低标签设置的标准多标签识别基准测试的实验证明了我们方法比最新方法的优势。
translated by 谷歌翻译
预训练的视觉模型(例如,剪辑)在许多下游任务中显示出有希望的零弹性概括,并具有正确设计的文本提示。最近的作品不依赖手工设计的提示,而是使用下游任务的培训数据来学习提示。虽然有效,但针对领域数据的培训却降低了模型的概括能力,使其无法看到新领域。在这项工作中,我们提出了测试时间提示调整(TPT),该方法可以通过单个测试样本即时学习自适应提示。对于图像分类,TPT通过使用置信度选择最小化熵来优化提示,以便模型在每个测试样本的不同增强视图上都具有一致的预测。在评估对自然分布变化的概括时,TPT平均将零击的TOP-1精度提高了3.6%,超过了先前需要其他特定于任务的训练数据的迅速调整方法。在评估看不见类别的跨数据集泛化时,TPT与使用其他培训数据的最先进方法相当。项目页面:https://azshue.github.io/tpt。
translated by 谷歌翻译
Although massive pre-trained vision-language models like CLIP show impressive generalization capabilities for many tasks, still it often remains necessary to fine-tune them for improved performance on specific datasets. When doing so, it is desirable that updating the model is fast and that the model does not lose its capabilities on data outside of the dataset, as is often the case with classical fine-tuning approaches. In this work we suggest a lightweight adapter, that only updates the models predictions close to seen datapoints. We demonstrate the effectiveness and speed of this relatively simple approach in the context of few-shot learning, where our results both on classes seen and unseen during training are comparable with or improve on the state of the art.
translated by 谷歌翻译
Contrastive Language-Image Pre-trained (CLIP) models have zero-shot ability of classifying an image belonging to "[CLASS]" by using similarity between the image and the prompt sentence "a [CONTEXT] of [CLASS]". Based on exhaustive text cues in "[CONTEXT]", CLIP model is aware of different contexts, e.g. background, style, viewpoint, and exhibits unprecedented robustness against a wide range of distribution shifts. However, recent works find further fine-tuning of CLIP models improves accuracy but sacrifices the robustness on downstream tasks. We conduct an empirical investigation to show fine-tuning will corrupt the context-aware ability of pre-trained CLIP features. To solve this problem, we propose Context-Aware Robust Fine-tuning (CAR-FT). CAR-FT regularizes the model during fine-tuning to capture the context information. Specifically, we use zero-shot prompt weights to get the context distribution contained in the image. By minimizing the Kullback-Leibler Divergence (KLD) between context distributions induced by original/fine-tuned CLIP models, CAR-FT makes the context-aware ability of CLIP inherited into downstream tasks, and achieves both higher In-Distribution (ID) and Out-Of-Distribution (OOD) accuracy. The experimental results show CAR-FT achieves superior robustness on five OOD test datasets of ImageNet, and meanwhile brings accuracy gains on nine downstream tasks. Additionally, CAR-FT surpasses previous Domain Generalization (DG) methods and gets 78.5% averaged accuracy on DomainBed benchmark, building the new state-of-the-art.
translated by 谷歌翻译
从自然语言监督中学习视觉表示,最近在许多开创性的作品中表现出了巨大的希望。通常,这些具有语言的视觉模型表现出对各种数据集和任务的强大可传递性。但是,由于缺乏易于使用的评估工具包和公共基准,评估这些模型的可转让性仍然很具有挑战性。为了解决这个问题,我们构建了高级版(评估语言的视觉任务级传输),这是用于评估(预训练)语言增强视觉模型的第一个基准和工具包。升华由三个组成部分组成。 (i)数据集。作为下游评估套件,它由20个图像分类数据集和35个对象检测数据集组成,每个数据集都用外部知识来增强。 (ii)工具包。开发了自动高参数调谐工具包,以促进下游任务的模型评估。 (iii)指标。多种评估指标用于测量样品效率(零射击和少量)和参数效率(线性探测和完整模型微调)。我们在https://computer-vision-in-the-wild.github.io/elevater/上公开发布leverater
translated by 谷歌翻译
对比视觉语言预培训(剪辑)最近淹没了其可转让的视觉表现学习的关注。由大规模的图像文本对进行监督,剪辑能够对准配对的图像和文本,从而在开放词汇场景中进行零拍摄识别。然而,特定应用与通常预先训练的知识之间存在语义差距,这使得匹配子最优在下游任务上。在本文中,我们提出了VT-CLIP通过可视导向文本来增强视觉语言建模。具体而言,我们指导文本功能以自适应地探索图像上的信息区域,并通过跨关注的Machanism聚合视觉特征。以这种方式,视觉引导文本与图像变得更加语义相关,这极大地利益匹配过程。在几次拍摄的设置中,我们在11名知名分类数据集中评估我们的VT-CLIP,并进行实验广泛的消融研究,以证明VT-CLIP的有效性。代码将很快发布。
translated by 谷歌翻译
我们引入了构图软提示(CSP),这是一种参数有效的学习技术,可改善大规模预处理视觉模型(VLMS)的零摄像组成性。 VLM可以在其灵活的文本编码器中代表任意类作为自然语言提示,但在组成零击基准任务上的表现不佳。为了改善VLM,我们提出了一种新颖的软提示形式。我们将构成的属性和对象视为将类定义为词汇的可学习令牌,并在多个及时的构图上调整它们。在推断期间,我们在新组合中重新组装了学习的属性对象词汇。我们表明,CSP在基准数据集上的原始VLM的表现平均为AUC上的10.9个百分点。 CSP还胜过Coop,这是一种调谐前缀上下文的软提示方法,在AUC上平均要点5.8个百分点。我们执行其他实验,以表明CSP对仅属性分类,高阶属性 - 属性对象组成以及预验证属性和微调对象的组合进行了改进。
translated by 谷歌翻译
Prompt Tuning, conditioning on task-specific learned prompt vectors, has emerged as a data-efficient and parameter-efficient method for adapting large pretrained vision-language models to multiple downstream tasks. However, existing approaches usually consider learning prompt vectors for each task independently from scratch, thereby failing to exploit the rich shareable knowledge across different vision-language tasks. In this paper, we propose multitask vision-language prompt tuning (MVLPT), which incorporates cross-task knowledge into prompt tuning for vision-language models. Specifically, (i) we demonstrate the effectiveness of learning a single transferable prompt from multiple source tasks to initialize the prompt for each target task; (ii) we show many target tasks can benefit each other from sharing prompt vectors and thus can be jointly learned via multitask prompt tuning. We benchmark the proposed MVLPT using three representative prompt tuning methods, namely text prompt tuning, visual prompt tuning, and the unified vision-language prompt tuning. Results in 20 vision tasks demonstrate that the proposed approach outperforms all single-task baseline prompt tuning methods, setting the new state-of-the-art on the few-shot ELEVATER benchmarks and cross-task generalization benchmarks. To understand where the cross-task knowledge is most effective, we also conduct a large-scale study on task transferability with 20 vision tasks in 400 combinations for each prompt tuning method. It shows that the most performant MVLPT for each prompt tuning method prefers different task combinations and many tasks can benefit each other, depending on their visual similarity and label similarity. Code is available at https://github.com/sIncerass/MVLPT.
translated by 谷歌翻译
诸如剪辑之类的大型预训练的视觉模型在学习表现方面表现出巨大的潜力,这些模型可以在各种下游任务中转移。与主要基于离散标签的传统表示学习不同,视觉语言预训练会使图像和文本在公共特征空间中对齐,这允许通过提示零弹性转移到下游任务,即从分类权重合成。描述兴趣类的自然语言。在这项工作中,我们表明,在实践中部署此类模型的一个重大挑战是及时的工程,它需要域专业知识,并且非常耗时 - 由于措辞的略有变化,需要花费大量时间来进行单词调整可能会对性能产生巨大影响。受到自然语言处理(NLP)迅速学习研究的最新进展的启发,我们提出了上下文优化(COP),这是一种专门用于调整类似剪辑的视觉语言模型的简单方法,用于下游图像识别。具体而言,Coop用可学习的向量建模了提示A的上下文单词,而整个预训练的参数则保持固定。为了处理不同的图像识别任务,我们提供了两个COOP的实现:统一上下文和特定于班级的上下文。通过在11个数据集上进行的大量实验,我们证明Coop只需要一两个镜头才能以相当的利润击败手工制作的提示,并且能够以16张镜头(例如16张照片)获得迅速工程的显着改进增益约为15%(最高达到45%以上)。尽管是一种基于学习的方法,但与使用手工制作的提示相比,Coop与零拍模型相比,取得了出色的域泛化性能。
translated by 谷歌翻译
在Web规模数据上预测的大型视觉和语言模型提供了对许多V&L问题无价的表示。但是,目前尚不清楚如何将它们用于以非结构化语言为特定于用户特定的视觉概念。这个问题来自多个域,从个性化图像检索到与智能设备的个性化交互。我们介绍了一个新的学习设置,称为个性化视觉和语言(PERVL),并使用两个新的基准数据集来检索和细分用户特定的“个性化”概念“野外”。在PERVL中,应该独立于下游任务(2)允许经过审慎的模型以免费语言来推论它们,并且(3)不需要个性化的负面示例。我们提出了一个用于解决PERVL的体系结构,该体系结构通过扩展了一个预审计模型的输入词汇,并用新单词嵌入新的个性化概念。然后,模型可以通过简单地在句子中使用它们来推理它们。我们证明我们的方法从几个示例中学习了个性化的视觉概念,并且可以使用丰富的文本查询有效地将它们应用于图像检索和语义细分中。
translated by 谷歌翻译
这项工作是在培训生成动作/视频识别模型上,其输出是描述视频的自由形式的特定动作标题(而不是动作类标签)。生成的方法具有实用的优势,例如生产更细粒度和人类可读的产出,并且自然而然地是开放的。为此,我们提议适应视频/动作识别的预先训练的生成视觉和语言(V&L)基础模型。据我们所知,最近有几次尝试适应了用对比度学习(例如剪辑)训练的V&L模型(例如剪辑),但据我们所知,我们提出了第一种设定实现这一目标的方法来实现生成模型的方法。我们首先表明,生成模型的直接微调生产具有严重过度拟合的动作类别。为了减轻这一点,我们介绍了REST,这是一个由两个关键组成部分组成的培训框架:一种无监督的方法,用于通过伪捕获生成和自我训练,将生成模型适应动作/视频,即不使用任何动作特定的标签; (b)基于剪辑的检索方法,用于为每个视频发现一套伪装的伪扣,以训练该模型。重要的是,我们表明这两个组件对于获得高精度都是必要的。我们评估零拍动识别的问题的休息,我们表明,与基于对比的学习方法相比,我们的方法非常有竞争力。代码将可用。
translated by 谷歌翻译
最近,通过对比视觉 - 语言预训练(CLIP)的零射击和少量学习已经在2D视觉识别上显示了鼓舞人心的性能,从而了解在开放词汇设置中将图像与其相应的文本匹配。然而,它仍然在探索中,是否通过2D中的大规模图像文本对预先训练的剪辑可以推广到3D识别。在本文中,我们通过提出引人点来识别这种设置是可行的,这在剪辑编码点云和3D类别文本之间进行对准。具体地,我们通过将点云投射到多视图深度映射而不呈现,并聚合视图零拍摄预测以实现从2D到3D的知识转移。首先,我们设计了一个视图间适配器,以更好地提取全局特征,并自适应地融合从3D到2D预培训的剪辑中学到的几次拍摄知识。只需在几次拍摄设置中微调轻量级适配器,可以在很大程度上提高要素的性能。此外,我们遵守CONTCLIP和古典3D监督网络之间的互补财产。通过简单的合奏,PointClip提高了基线的性能,甚至超越了最先进的模型。因此,PointClip是在低资源成本和数据制度下通过剪辑的有效3D点云理解的有希望的替代方案。我们在广泛采用的ModelNet10,ModelNet40和挑战ScanObjectnn上进行彻底的实验,以证明Pointclip的有效性。代码在https://github.com/zrrskywalker/pointclip发布。
translated by 谷歌翻译
Incidental supervision from language has become a popular approach for learning generic visual representations that can be prompted to perform many recognition tasks in computer vision. We conduct an in-depth exploration of the CLIP model and show that its visual representation is often strongly biased towards solving some tasks more than others. Moreover, which task the representation will be biased towards is unpredictable, with little consistency across images. To resolve this task bias, we show how to learn a visual prompt that guides the representation towards features relevant to their task of interest. Our results show that these visual prompts can be independent of the input image and still effectively provide a conditioning mechanism to steer visual representations towards the desired task.
translated by 谷歌翻译