预训练的视觉模型(例如,剪辑)在许多下游任务中显示出有希望的零弹性概括,并具有正确设计的文本提示。最近的作品不依赖手工设计的提示,而是使用下游任务的培训数据来学习提示。虽然有效,但针对领域数据的培训却降低了模型的概括能力,使其无法看到新领域。在这项工作中,我们提出了测试时间提示调整(TPT),该方法可以通过单个测试样本即时学习自适应提示。对于图像分类,TPT通过使用置信度选择最小化熵来优化提示,以便模型在每个测试样本的不同增强视图上都具有一致的预测。在评估对自然分布变化的概括时,TPT平均将零击的TOP-1精度提高了3.6%,超过了先前需要其他特定于任务的训练数据的迅速调整方法。在评估看不见类别的跨数据集泛化时,TPT与使用其他培训数据的最先进方法相当。项目页面:https://azshue.github.io/tpt。
translated by 谷歌翻译
诸如剪辑之类的大型预训练的视觉模型在学习表现方面表现出巨大的潜力,这些模型可以在各种下游任务中转移。与主要基于离散标签的传统表示学习不同,视觉语言预训练会使图像和文本在公共特征空间中对齐,这允许通过提示零弹性转移到下游任务,即从分类权重合成。描述兴趣类的自然语言。在这项工作中,我们表明,在实践中部署此类模型的一个重大挑战是及时的工程,它需要域专业知识,并且非常耗时 - 由于措辞的略有变化,需要花费大量时间来进行单词调整可能会对性能产生巨大影响。受到自然语言处理(NLP)迅速学习研究的最新进展的启发,我们提出了上下文优化(COP),这是一种专门用于调整类似剪辑的视觉语言模型的简单方法,用于下游图像识别。具体而言,Coop用可学习的向量建模了提示A的上下文单词,而整个预训练的参数则保持固定。为了处理不同的图像识别任务,我们提供了两个COOP的实现:统一上下文和特定于班级的上下文。通过在11个数据集上进行的大量实验,我们证明Coop只需要一两个镜头才能以相当的利润击败手工制作的提示,并且能够以16张镜头(例如16张照片)获得迅速工程的显着改进增益约为15%(最高达到45%以上)。尽管是一种基于学习的方法,但与使用手工制作的提示相比,Coop与零拍模型相比,取得了出色的域泛化性能。
translated by 谷歌翻译
Prompt learning is one of the most effective and trending ways to adapt powerful vision-language foundation models like CLIP to downstream datasets by tuning learnable prompt vectors with very few samples. However, although prompt learning achieves excellent performance over in-domain data, it still faces the major challenge of generalizing to unseen classes and domains. Some existing prompt learning methods tackle this issue by adaptively generating different prompts for different tokens or domains but neglecting the ability of learned prompts to generalize to unseen domains. In this paper, we propose a novel prompt learning paradigm that directly generates domain invariant prompt generalizable to unseen domains, called MetaPrompt. Specifically, a dual-modality prompt tuning network is proposed to generate prompts for inputs from both image and text modalities. More importantly, we propose a meta-learning-based prompt tuning algorithm that explicitly constrains the prompt tuned on a specific domain or class also to achieve good performance on another domain or class. Extensive experiments on 11 datasets for base-to-new generalization and four datasets for domain generalization demonstrate that our method consistently and significantly outperforms existing methods.
translated by 谷歌翻译
Prompt Tuning, conditioning on task-specific learned prompt vectors, has emerged as a data-efficient and parameter-efficient method for adapting large pretrained vision-language models to multiple downstream tasks. However, existing approaches usually consider learning prompt vectors for each task independently from scratch, thereby failing to exploit the rich shareable knowledge across different vision-language tasks. In this paper, we propose multitask vision-language prompt tuning (MVLPT), which incorporates cross-task knowledge into prompt tuning for vision-language models. Specifically, (i) we demonstrate the effectiveness of learning a single transferable prompt from multiple source tasks to initialize the prompt for each target task; (ii) we show many target tasks can benefit each other from sharing prompt vectors and thus can be jointly learned via multitask prompt tuning. We benchmark the proposed MVLPT using three representative prompt tuning methods, namely text prompt tuning, visual prompt tuning, and the unified vision-language prompt tuning. Results in 20 vision tasks demonstrate that the proposed approach outperforms all single-task baseline prompt tuning methods, setting the new state-of-the-art on the few-shot ELEVATER benchmarks and cross-task generalization benchmarks. To understand where the cross-task knowledge is most effective, we also conduct a large-scale study on task transferability with 20 vision tasks in 400 combinations for each prompt tuning method. It shows that the most performant MVLPT for each prompt tuning method prefers different task combinations and many tasks can benefit each other, depending on their visual similarity and label similarity. Code is available at https://github.com/sIncerass/MVLPT.
translated by 谷歌翻译
Contrastive Language-Image Pre-trained (CLIP) models have zero-shot ability of classifying an image belonging to "[CLASS]" by using similarity between the image and the prompt sentence "a [CONTEXT] of [CLASS]". Based on exhaustive text cues in "[CONTEXT]", CLIP model is aware of different contexts, e.g. background, style, viewpoint, and exhibits unprecedented robustness against a wide range of distribution shifts. However, recent works find further fine-tuning of CLIP models improves accuracy but sacrifices the robustness on downstream tasks. We conduct an empirical investigation to show fine-tuning will corrupt the context-aware ability of pre-trained CLIP features. To solve this problem, we propose Context-Aware Robust Fine-tuning (CAR-FT). CAR-FT regularizes the model during fine-tuning to capture the context information. Specifically, we use zero-shot prompt weights to get the context distribution contained in the image. By minimizing the Kullback-Leibler Divergence (KLD) between context distributions induced by original/fine-tuned CLIP models, CAR-FT makes the context-aware ability of CLIP inherited into downstream tasks, and achieves both higher In-Distribution (ID) and Out-Of-Distribution (OOD) accuracy. The experimental results show CAR-FT achieves superior robustness on five OOD test datasets of ImageNet, and meanwhile brings accuracy gains on nine downstream tasks. Additionally, CAR-FT surpasses previous Domain Generalization (DG) methods and gets 78.5% averaged accuracy on DomainBed benchmark, building the new state-of-the-art.
translated by 谷歌翻译
从自然语言监督中学习视觉表示,最近在许多开创性的作品中表现出了巨大的希望。通常,这些具有语言的视觉模型表现出对各种数据集和任务的强大可传递性。但是,由于缺乏易于使用的评估工具包和公共基准,评估这些模型的可转让性仍然很具有挑战性。为了解决这个问题,我们构建了高级版(评估语言的视觉任务级传输),这是用于评估(预训练)语言增强视觉模型的第一个基准和工具包。升华由三个组成部分组成。 (i)数据集。作为下游评估套件,它由20个图像分类数据集和35个对象检测数据集组成,每个数据集都用外部知识来增强。 (ii)工具包。开发了自动高参数调谐工具包,以促进下游任务的模型评估。 (iii)指标。多种评估指标用于测量样品效率(零射击和少量)和参数效率(线性探测和完整模型微调)。我们在https://computer-vision-in-the-wild.github.io/elevater/上公开发布leverater
translated by 谷歌翻译
虽然大型审计的基础模型(FMS)对数据集级别的分布变化显示出显着的零击分类鲁棒性,但它们对亚群或组移动的稳健性相对却相对不受欢迎。我们研究了这个问题,并发现诸如剪辑之类的FMS可能对各种群体转移可能不健壮。在9个稳健性基准中,其嵌入式分类零射击分类导致平均和最差组精度之间的差距高达80.7个百分点(PP)。不幸的是,现有的改善鲁棒性的方法需要重新培训,这在大型基础模型上可能非常昂贵。我们还发现,改善模型推理的有效方法(例如,通过适配器,具有FM嵌入式作为输入的轻量级网络)不会持续改进,有时与零击相比会伤害组鲁棒性(例如,将精度差距提高到50.1 pp on 50.1 pp on On on 50.1 pp on Celeba)。因此,我们制定了一种适配器培训策略,以有效有效地改善FM组的鲁棒性。我们激励的观察是,尽管同一阶级中的群体中较差的鲁棒性在基础模型“嵌入空间”中分开,但标准适配器训练可能不会使这些要点更加紧密。因此,我们提出了对比度的适应,该适应器会通过对比度学习进行训练适配器,以使样品嵌入在同一类中的地面真相类嵌入和其他样品嵌入。在整个9个基准测试中,我们的方法始终提高组鲁棒性,使最差的组精度提高了8.5至56.0 pp。我们的方法也是有效的,这样做的方法也没有任何FM芬太尼,只有一组固定的冷冻FM嵌入。在水鸟和Celeba等基准上,这导致最差的组精度可与最先进的方法相媲美,而最先进的方法可以重新训练整个模型,而仅训练$ \ leq $ 1%的模型参数。
translated by 谷歌翻译
Prompt tuning is a new few-shot transfer learning technique that only tunes the learnable prompt for pre-trained vision and language models such as CLIP. However, existing prompt tuning methods tend to learn spurious or entangled representations, which leads to poor generalization to unseen concepts. Towards non-spurious and efficient prompt learning from limited examples, this paper presents a novel \underline{\textbf{C}}ounterfactual \underline{\textbf{P}}rompt \underline{\textbf{L}}earning (CPL) method for vision and language models, which simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework. Particularly, CPL constructs counterfactual by identifying minimal non-spurious feature change between semantically-similar positive and negative samples that causes concept change, and learns more generalizable prompt representation from both factual and counterfactual examples via contrastive learning. Extensive experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks than previous prompt tuning methods on CLIP. On image classification, we achieve 3.55\% average relative improvement on unseen classes across seven datasets; on image-text retrieval and visual question answering, we gain up to 4.09\% and 25.08\% relative improvements across three few-shot scenarios on unseen test sets respectively.
translated by 谷歌翻译
随着大型预训练的Vison语言模型(如剪辑)的出现,可以通过及时调整来调整可转让表示形式。及时调整试图从存储在预训练的视觉模型的图像和文本编码器中的常识中探索有益信息,以探索下游任务。最近提出的名为“上下文优化”(COP)的方法将一组可学习的向量从语言侧引入文本提示符,而单独调整文本提示符则不会影响图像编码器的计算视觉特征,从而导致了次级优势。在本文中,我们通过学习文本提示并同时为文本和图像编码器提供双重模式提示调整范式。此外,为了使视觉提示更多地集中在目标视觉概念上,我们提出了类感知的视觉及时调整(CAVPT),该调整是通过在模板提示和视觉类别令牌嵌入的语言描述之间进行交叉注意来动态生成的。我们的方法提供了一种新的范式来调整大型预训练的视觉模型,并在8个数据集上进行了广泛的实验结果,证明了该方法的有效性。我们的代码在补充材料中可用。
translated by 谷歌翻译
Pre-trained Vision-Language Models (VLMs) such as CLIP have shown impressive generalization capability in downstream vision tasks with appropriate text prompts. Instead of designing prompts manually, Context Optimization (CoOp) has been recently proposed to learn continuous prompts using task-specific training data. Despite the performance improvements on downstream tasks, several studies have reported that CoOp suffers from the overfitting issue in two aspects: (i) the test accuracy on base classes first gets better and then gets worse during training; (ii) the test accuracy on novel classes keeps decreasing. However, none of the existing studies can understand and mitigate such overfitting problem effectively. In this paper, we first explore the cause of overfitting by analyzing the gradient flow. Comparative experiments reveal that CoOp favors generalizable and spurious features in the early and later training stages respectively, leading to the non-overfitting and overfitting phenomenon. Given those observations, we propose Subspace Prompt Tuning (SubPT) to project the gradients in back-propagation onto the low-rank subspace spanned by the early-stage gradient flow eigenvectors during the entire training process, and successfully eliminate the overfitting problem. Besides, we equip CoOp with Novel Feature Learner (NFL) to enhance the generalization ability of the learned prompts onto novel categories beyond the training set, needless of image training data. Extensive experiments on 11 classification datasets demonstrate that SubPT+NFL consistently boost the performance of CoOp and outperform the state-of-the-art approach CoCoOp. Experiments on more challenging vision downstream tasks including open-vocabulary object detection and zero-shot semantic segmentation also verify the effectiveness of the proposed method. Codes can be found at https://tinyurl.com/mpe64f89.
translated by 谷歌翻译
诸如剪辑之类的对比视觉模型在转移学习方面已显示出巨大进展。在推理阶段,需要仔细设计适当的文本描述,也称为提示,以正确地对给定的图像进行分类。为了避免繁琐的及时工程,最近的作品,例如Coop,Clip-Audapter和Tip-Adapter,建议将视觉模型改编成下游图像识别任务,以在一小部分标记的数据上。尽管实现了有希望的改进,但是需要来自目标数据集的标记数据可能会限制可扩展性。在本文中,我们探讨了一种不同的情况,在该场景中,目标数据集的标签未经证实,并提出了一种无监督的及时学习方法(UPL)方法,以避免及时工程,同时改善类似夹子的视觉模型的传递性能。据我们所知,UPL是第一项将无监督学习引入及时学习的工作。在实验上,我们的UPL在ImageNet以及其他10个数据集上及时使用及时的工程剪辑优于原始剪辑。增强版本的UPL甚至与大多数数据集的8-Shot Coop和8-Shot Tip-Adapter都具有竞争力。代码和型号可在https://github.com/tonyhuang2022/upl上找到。
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
对比视觉语言预培训(剪辑)最近淹没了其可转让的视觉表现学习的关注。由大规模的图像文本对进行监督,剪辑能够对准配对的图像和文本,从而在开放词汇场景中进行零拍摄识别。然而,特定应用与通常预先训练的知识之间存在语义差距,这使得匹配子最优在下游任务上。在本文中,我们提出了VT-CLIP通过可视导向文本来增强视觉语言建模。具体而言,我们指导文本功能以自适应地探索图像上的信息区域,并通过跨关注的Machanism聚合视觉特征。以这种方式,视觉引导文本与图像变得更加语义相关,这极大地利益匹配过程。在几次拍摄的设置中,我们在11名知名分类数据集中评估我们的VT-CLIP,并进行实验广泛的消融研究,以证明VT-CLIP的有效性。代码将很快发布。
translated by 谷歌翻译
执行零摄像推理时(即,在特定数据集上不进行微调)时,大型预训练的模型(例如剪辑或ALIGN)在一系列数据分布中提供一致的精度。尽管现有的微调方法显着提高了给定目标分布的准确性,但它们通常会降低分配变化的稳健性。我们通过引入一种简单有效的方法来提高鲁棒性,同时进行微调:结合零拍和微调模型(Wise-ft)的重量。与标准的微调相比,Wise-FT在分配变化下提供了巨大的准确性提高,同时保留了目标分布的高精度。在Imagenet和五个派生的分布变化上,Wise-FT在先前的工作中提高了分布转移的准确性4至6个百分点(PP),同时将Imagenet精度提高1.6pp。Wise-ft的稳健性相似(2至23 pp),明智之前与七个常用的转移学习数据集的标准微调相比,在一组进一步的分配转移的各种集合中,准确性增长率为0.8至3.3 pp。这些改进在微调或推理期间没有任何额外的计算成本。
translated by 谷歌翻译
Prompt tuning has been employed as an efficient way to adapt large vision-language pre-trained models (e.g. CLIP) to various downstream tasks in data-limited or label-limited settings. Nonetheless, visual data (e.g., images) is by default prerequisite for learning prompts in existing methods. In this work, we advocate that the effectiveness of image-text contrastive learning in aligning the two modalities (for training CLIP) further makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting. In contrast to the visual data, text descriptions are easy to collect, and their class labels can be directly derived. Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning. Moreover, with TaI, double-grained prompt tuning (TaI-DPT) is further presented to extract both coarse-grained and fine-grained embeddings for enhancing the multi-label recognition performance. Experimental results show that our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks, e.g., MS-COCO, VOC2007, and NUS-WIDE, while it can be combined with existing methods of prompting from images to improve recognition performance further. Code is released at https://github.com/guozix/TaI-DPT.
translated by 谷歌翻译
我们引入了构图软提示(CSP),这是一种参数有效的学习技术,可改善大规模预处理视觉模型(VLMS)的零摄像组成性。 VLM可以在其灵活的文本编码器中代表任意类作为自然语言提示,但在组成零击基准任务上的表现不佳。为了改善VLM,我们提出了一种新颖的软提示形式。我们将构成的属性和对象视为将类定义为词汇的可学习令牌,并在多个及时的构图上调整它们。在推断期间,我们在新组合中重新组装了学习的属性对象词汇。我们表明,CSP在基准数据集上的原始VLM的表现平均为AUC上的10.9个百分点。 CSP还胜过Coop,这是一种调谐前缀上下文的软提示方法,在AUC上平均要点5.8个百分点。我们执行其他实验,以表明CSP对仅属性分类,高阶属性 - 属性对象组成以及预验证属性和微调对象的组合进行了改进。
translated by 谷歌翻译
Pretrained large-scale vision-language models like CLIP have exhibited strong generalization over unseen tasks. Yet imperceptible adversarial perturbations can significantly reduce CLIP's performance on new tasks. In this work, we identify and explore the problem of \emph{adapting large-scale models for zero-shot adversarial robustness}. We first identify two key factors during model adaption -- training losses and adaptation methods -- that affect the model's zero-shot adversarial robustness. We then propose a text-guided contrastive adversarial training loss, which aligns the text embeddings and the adversarial visual features with contrastive learning on a small set of training data. We apply this training loss to two adaption methods, model finetuning and visual prompt tuning. We find that visual prompt tuning is more effective in the absence of texts, while finetuning wins in the existence of text guidance. Overall, our approach significantly improves the zero-shot adversarial robustness over CLIP, seeing an average improvement of over 31 points over ImageNet and 15 zero-shot datasets. We hope this work can shed light on understanding the zero-shot adversarial robustness of large-scale models.
translated by 谷歌翻译
在低标签制度中,解决图像的多标签识别(MLR)是许多现实世界应用的一项艰巨任务。最近的工作学会了文本和视觉空间之间的一致性,以补偿图像标签不足,但由于可用的MLR注释量有限,因此失去了准确性。在这项工作中,我们利用数百万辅助图像文本对预测的文本和视觉特征的牢固对齐,并提出双背景优化(dualCoop)作为部分标签MLR和零发射MLR的统一框架。 DualCoop用类名来编码正面和负面的上下文,作为语言输入的一部分(即提示)。由于DualCoop仅在验证的视觉语言框架上引入了非常轻松的开销,因此它可以迅速适应具有有限的注释甚至看不见的类别的多标签识别任务。对两个挑战性低标签设置的标准多标签识别基准测试的实验证明了我们方法比最新方法的优势。
translated by 谷歌翻译
域泛化(DG)是一个难度的学习问题,旨在学习一个概念域的概念模型。最近的巨型预训练模型,如剪辑和GPT-3,即基础模型(FMS),已被证明对许多分布换档具有强大,因此应导致DG的大量改进。在这项工作中,我们研究了在图像分类中采用DG问题采用剪辑的通用方法,在那里我们评估了天真零射击学习和全DG学习设置。对于后者,我们提出了AP(摊销提示),作为迅速生成形式的域推断的新方法。在域泛化基准上使用多个标准数据集,即PACS,VLC,OfficeHome和Terraincognita,Clip提供了可比的性能而无需微调任何参数,这表明FM在DG中的适用性和重要性。此外,我们表明,组合域提示跟踪带剪辑使AP能够以大的余量越大,从71.3 \%升高到79.3 \%的精度。我们希望我们的方法的简单性和成功强调强调的重要性并导致更广泛采用和分析域泛化领域的基础模型。
translated by 谷歌翻译
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model. Since training on a similar scale for videos is infeasible, recent approaches focus on the effective transfer of image-based CLIP to the video domain. In this pursuit, new parametric modules are added to learn temporal information and inter-frame relationships which require meticulous design efforts. Furthermore, when the resulting models are learned on videos, they tend to overfit on the given task distribution and lack in generalization aspect. This begs the following question: How to effectively transfer image-level CLIP representations to videos? In this work, we show that a simple Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos. Our qualitative analysis illustrates that the frame-level processing from CLIP image-encoder followed by feature pooling and similarity matching with corresponding text embeddings helps in implicitly modeling the temporal cues within ViFi-CLIP. Such fine-tuning helps the model to focus on scene dynamics, moving objects and inter-object relationships. For low-data regimes where full fine-tuning is not viable, we propose a `bridge and prompt' approach that first uses fine-tuning to bridge the domain gap and then learns prompts on language and vision side to adapt CLIP representations. We extensively evaluate this simple yet strong baseline on zero-shot, base-to-novel generalization, few-shot and fully supervised settings across five video benchmarks. Our code is available at https://github.com/muzairkhattak/ViFi-CLIP.
translated by 谷歌翻译