近年来,视觉模型(VLMS)在视觉推理任务(例如属性,位置)上表现出色。尽管这些任务衡量了必要的知识以在给定的视觉实例上进行理性和理性,但是它们却不能衡量VLMS保留和概括此类知识的能力。在这项工作中,我们评估了他们获得“可见”物理知识的能力 - 从静态场景的图像,尤其是在对象颜色,大小和空间的尺寸的范围内,可以轻松访问的信息。我们建立了自动管道,以获取综合知识资源,用于校准和探索这些模型。我们的结果表明,在所有三个任务中,模型和人类绩效之间存在严重的差距。此外,我们的字幕预测的基线(Capbert)在大小和空间任务上都大大优于VLM,这强调了,尽管有足够的视觉方式访问了地面语言,但它们仍在努力保留此类知识。数据集和代码可在https://github.com/axe--/viphy上找到。
translated by 谷歌翻译
Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding. Existing efforts to create a general vision model are limited in the scope of assessed tasks and offer no overarching framework to perform them holistically. We present a new comprehensive benchmark, General-purpose Visual Understanding Evaluation (G-VUE), covering the full spectrum of visual cognitive abilities with four functional domains $\unicode{x2014}$ Perceive, Ground, Reason, and Act. The four domains are embodied in 11 carefully curated tasks, from 3D reconstruction to visual reasoning and manipulation. Along with the benchmark, we provide a general encoder-decoder framework to allow for the evaluation of arbitrary visual representation on all 11 tasks. We evaluate various pre-trained visual representations with our framework and observe that (1) Transformer-based visual backbone generally outperforms CNN-based backbone on G-VUE, (2) visual representations from vision-language pre-training are superior to those with vision-only pre-training across visual tasks. With G-VUE, we provide a holistic evaluation standard to motivate research toward building general-purpose visual systems via obtaining more general-purpose visual representations.
translated by 谷歌翻译
Large-scale vision-language models such as CLIP have shown impressive performance on zero-shot image classification and image-to-text retrieval. However, such zero-shot performance of CLIP-based models does not realize in tasks that require a finer-grained correspondence between vision and language, such as Visual Question Answering (VQA). We investigate why this is the case, and report an interesting phenomenon of CLIP, which we call the Concept Association Bias (CAB), as a potential cause of the difficulty of applying CLIP to VQA and similar tasks. CAB is especially apparent when two concepts are present in the given image while a text prompt only contains a single concept. In such a case, we find that CLIP tends to treat input as a bag of concepts and attempts to fill in the other missing concept crossmodally, leading to an unexpected zero-shot prediction. For example, when asked for the color of a lemon in an image, CLIP predicts ``purple'' if the image contains a lemon and an eggplant. We demonstrate the Concept Association Bias of CLIP by showing that CLIP's zero-shot classification performance greatly suffers when there is a strong concept association between an object (e.g. lemon) and an attribute (e.g. its color). On the other hand, when the association between object and attribute is weak, we do not see this phenomenon. Furthermore, we show that CAB is significantly mitigated when we enable CLIP to learn deeper structure across image and text embeddings by adding an additional Transformer on top of CLIP and fine-tuning it on VQA. We find that across such fine-tuned variants of CLIP, the strength of CAB in a model predicts how well it performs on VQA.
translated by 谷歌翻译
视觉问题应答(VQA)任务利用视觉图像和语言分析来回回答图像的文本问题。它是一个流行的研究课题,在过去十年中越来越多的现实应用。本文介绍了我们最近对AliceMind-MMU的研究(阿里巴巴的编码器 - 解码器来自Damo Academy - 多媒体理解的机器智能实验室),其比人类在VQA上获得相似甚至略微更好的结果。这是通过系统地改善VQA流水线来实现的,包括:(1)具有全面的视觉和文本特征表示的预培训; (2)与学习参加的有效跨模型互动; (3)一个新颖的知识挖掘框架,具有专门的专业专家模块,适用于复杂的VQA任务。处理不同类型的视觉问题,需要具有相应的专业知识在提高我们的VQA架构的表现方面发挥着重要作用,这取决于人力水平。进行了广泛的实验和分析,以证明新的研究工作的有效性。
translated by 谷歌翻译
Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into $N\times N$ blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling `P" or ``O" in aPTP ``The block P has a O". This mechanism improves the visual grounding capability of VLP models and thus helps them better handle various downstream tasks. By introducing PTP into several state-of-the-art VLP frameworks, we observe consistently significant improvements across representative cross-modal learning model architectures and several benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average recall@1) for ViLT \cite{vilt} baseline, and COCO Captioning (+5.3 in CIDEr) for SOTA BLIP \cite{blip} baseline. Moreover, PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot. Our code and pre-trained weight will be released at \url{https://github.com/sail-sg/ptp}.
translated by 谷歌翻译
人们说:“一张照片值一千字”。那么,我们如何从图像中获取丰富的信息?我们认为,通过使用视觉线索来桥接大型的识别视觉基础模型和语言模型,我们可以无需任何额外的跨模式训练。得益于基础模型的强大零拍功能,我们首先构建图像的丰富语义表示(例如,图像标签,对象属性 /位置,字幕)作为结构化的文本提示,称为视觉线索,使用视觉基础模型。基于视觉线索,我们使用大型语言模型为视觉内容生成一系列综合描述,然后再次通过视觉模型验证,以选择与图像最合适的候选人。我们通过定量和定性测量评估生成的描述的质量。结果证明了这种结构化语义表示的有效性。
translated by 谷歌翻译
Existing object detection methods are bounded in a fixed-set vocabulary by costly labeled data. When dealing with novel categories, the model has to be retrained with more bounding box annotations. Natural language supervision is an attractive alternative for its annotation-free attributes and broader object concepts. However, learning open-vocabulary object detection from language is challenging since image-text pairs do not contain fine-grained object-language alignments. Previous solutions rely on either expensive grounding annotations or distilling classification-oriented vision models. In this paper, we propose a novel open-vocabulary object detection framework directly learning from image-text pair data. We formulate object-language alignment as a set matching problem between a set of image region features and a set of word embeddings. It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way. Extensive experiments on two benchmark datasets, COCO and LVIS, demonstrate our superior performance over the competing approaches on novel categories, e.g. achieving 32.0% mAP on COCO and 21.7% mask mAP on LVIS. Code is available at: https://github.com/clin1223/VLDet.
translated by 谷歌翻译
我们提出Valse(视觉和语言结构化评估),这是一种新的基准,专为测试通用净化的视觉和语言(V&L)模型而设计,用于对特定语言现象的视野 - 语言接地能力。Valse提供涵盖各种语言构建体的六种测试套件。解决这些需要模型在视觉模型中地对语言现象,允许比迄今为止更细粒度的评估。我们使用支持有效箔的构造的方法构建Valse,并通过评估五种广泛使用的V&L模型的报告结果。我们的实验表明,目前的模型有很大的困难解决了大多数现象。因此,我们预计Valse就可以作为一种重要的基准,从语言角度来衡量预训过的V&L模型的未来进展,补充规范任务为中心的V&L评价。
translated by 谷歌翻译
随着图像文本对的大量数据以及视觉和语言(V&L)任务的多样性,学者在该研究领域引入了大量的深度学习模型。此外,近年来,转移学习还显示出在计算机愿景中的巨大成功,例如图像分类,对象检测等以及在自然语言处理中以进行问答,机器翻译等的自然语言处理。继承转移学习的精神, V&L的研究工作已经在大规模数据集上设计了多种预训练技术,以增强下游任务的性能。本文的目的是提供当代V&L预审前模型的全面修订。特别是,我们对预处理的方法进行了分类和描述,以及最先进的视觉和语言预训练模型的摘要。此外,还提供了培训数据集和下游任务的列表,以进一步提高V&L预处理的观点。最后,我们决定采取进一步的一步,讨论众多未来研究的方向。
translated by 谷歌翻译
本文介绍了用于学习对象级别,语言感知和富含语义的视觉表示的接地语言图像预培训(GLIP)模型。 Glip统一对象检测和短语进行预培训。统一带来了两个好处:1)它允许GLIP从检测和接地数据中学习,以改善两个任务和引导良好的接地模型; 2)GLIP可以通过以自培训方式产生接地盒来利用大规模的图像文本对,使学习的表示是语义丰富的。在我们的实验中,我们在27M的接地数据上预先列车触胶,包括3M人的注释和24M Web爬网的图像文本对。学习的表示表明了强烈的零射击和对各种对象识别任务的可转换性。 1)直接在Coco和LVIS上评估(在训练期间没有在Coco中看到任何图像)时,Plip分别达到49.8 AP和26.9 AP,超过许多监督基线。 2)在COCO上微调后,GLIP在Val和61.5 AP上实现60.8 AP在测试开发上,超过先前的SOTA。 3)当转移到下游对象检测任务时,具有完全监控动态头的1次触发器竞争对手。代码将在https://github.com/microsoft/glip发布。
translated by 谷歌翻译
在原始文本中训练的语言模型(LMS)无法直接访问物理世界。 Gordon和Van Durme(2013)指出,LMS因此可能会遭受报告偏见的困扰:文本很少报告常见事实,而是关注情况的异常方面。如果LMS仅接受文本语料库的培训,并天真地记住当地的同时出现统计数据,那么他们自然会学会对物理世界的偏见。虽然先前的研究反复验证了较小尺度的LM(例如Roberta,GPT-2)放大了报告偏差,但在模型扩展时,这种趋势是否继续。我们从较大语言模型(LLM)(例如Palm和GPT-3)中从颜色的角度研究报告偏见。具体而言,我们查询llms对物体的典型颜色,这是一种简单的感知扎根的物理常识。令人惊讶的是,我们发现LLM在确定对象的典型颜色和更紧密地跟踪人类判断方面的表现明显优于较小的LMS,而不是过于适应文本中存储的表面图案。这表明,仅凭语言的大型语言模型就能克服以局部共发生为特征的某些类型的报告偏差。
translated by 谷歌翻译
连接视觉和语言在生成智能中起着重要作用。因此,已经致力于图像标题的大型研究工作,即用句法和语义有意义的句子描述图像。从2015年开始,该任务通常通过由Visual Encoder组成的管道和文本生成的语言模型来解决任务。在这些年来,两种组件通过对象区域,属性,介绍多模态连接,完全关注方法和伯特早期融合策略的利用而显着发展。但是,无论令人印象深刻的结果,图像标题的研究还没有达到结论性答案。这项工作旨在提供图像标题方法的全面概述,从视觉编码和文本生成到培训策略,数据集和评估度量。在这方面,我们量化地比较了许多相关的最先进的方法来确定架构和培训策略中最有影响力的技术创新。此外,讨论了问题的许多变体及其开放挑战。这项工作的最终目标是作为理解现有文献的工具,并突出显示计算机视觉和自然语言处理的研究领域的未来方向可以找到最佳的协同作用。
translated by 谷歌翻译
深度学习技术导致了通用对象检测领域的显着突破,近年来产生了很多场景理解的任务。由于其强大的语义表示和应用于场景理解,场景图一直是研究的焦点。场景图生成(SGG)是指自动将图像映射到语义结构场景图中的任务,这需要正确标记检测到的对象及其关系。虽然这是一项具有挑战性的任务,但社区已经提出了许多SGG方法并取得了良好的效果。在本文中,我们对深度学习技术带来了近期成就的全面调查。我们审查了138个代表作品,涵盖了不同的输入方式,并系统地将现有的基于图像的SGG方法从特征提取和融合的角度进行了综述。我们试图通过全面的方式对现有的视觉关系检测方法进行连接和系统化现有的视觉关系检测方法,概述和解释SGG的机制和策略。最后,我们通过深入讨论当前存在的问题和未来的研究方向来完成这项调查。本调查将帮助读者更好地了解当前的研究状况和想法。
translated by 谷歌翻译
The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pretraining. However, these datasets are often collected with overrestrictive requirements inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pretraining data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [70] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for visionand-language pre-training. We perform an analysis of this dataset and benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. Our results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks. 1
translated by 谷歌翻译
Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in
translated by 谷歌翻译
通用视觉(GPV)系统是旨在解决各种视觉任务的模型,而无需进行架构更改。如今,GPV主要从大型完全监督的数据集中学习技能和概念。通过获取数据以迅速学习每个技能的每个概念,将GPV扩展到数万个概念都变得令人望而却步。这项工作提出了一种有效且廉价的替代方法:从监督数据集中学习技能,从Web图像搜索中学习概念,并利用GPV的关键特征:跨技能传递视觉知识的能力。我们使用跨越10K+视觉概念的1M+图像的数据集来演示3个基准上的两个现有GPV(GPV-1和VL-T5)的Webly Supumented概念扩展:5个基于可可的数据集(80个主要概念),这是一个新的策划系列,这是一个新的策划系列。基于OpenImages和VisualGenome存储库(〜500个概念)以及Web衍生的数据集(10K+概念)的5个数据集。我们还提出了一种新的体系结构GPV-2,该架构支持各种任务 - 从分类和本地化等视觉任务到Qu Viewer+语言任务,例如QA和字幕,再到更多的利基市场,例如人类对象互动检测。 GPV-2从Web数据中受益匪浅,并且在这些基准测试中胜过GPV-1和VL-T5。我们的数据,代码和Web演示可在https://prior.allenai.org/projects/gpv2上获得。
translated by 谷歌翻译
场景图生成(SGG)是一项基本任务,旨在检测图像中对象之间的视觉关系。流行的SGG方法要求在培训集中给出所有对象类。这样的封闭设置限制了SGG的实际应用。在本文中,我们介绍了开放式视频范围场景图生成,这是一种新颖,现实且具有挑战性的环境,其中模型在一组基本对象类上进行了训练,但需要推断出看不见的目标对象类的关系。为此,我们提出了一种两步方法,该方法首先对大量的粗粒区域捕获数据进行预先培训,然后利用两种基于及时的技术来验证预先训练的模型而无需更新其参数。此外,我们的方法可以支持对完全看不见的对象类的推论,而现有方法无法处理。在三个基准数据集(视觉基因组,GQA和开放图像)上进行的广泛实验,我们的方法在OV-SGG的设置以及常规的封闭SGG上明显优于最近的强大SGG方法。
translated by 谷歌翻译
基础模型由于在广泛的下游应用中的有效性而受到了很多关注。尽管在体系结构方面存在很大的融合,但大多数审慎的模型通常仍用于特定任务或模式。在这项工作中,我们建议将语言模型用作各种基础模型的通用接口。一系列预处理的编码者感知到了多种方式(例如视觉和语言),并与扮演通用任务层角色的语言模型对接。我们提出了一个半伴侣的语言建模目标,以共同确定界面和模块化编码器。我们从因果关系和非因果建模中涵盖了优势和能力,从而结合了两个世界的最佳状态。具体而言,所提出的方法不仅从因果语言建模中继承了内在学习和开放式生成的能力,而且由于双向编码器而有利于填补。更重要的是,我们的方法无缝地解锁了上述功能的组合,例如,通过填充编码器启用了文本学习或指导。各种仅语言和视觉语言基准的实验结果表明,我们的模型表现优于或与鉴定,零弹性概括和几乎没有的学习的专业模型竞争。
translated by 谷歌翻译
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific modelsachieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.Preprint. Under review.
translated by 谷歌翻译
Learning descriptive 3D features is crucial for understanding 3D scenes with diverse objects and complex structures. However, it is usually unknown whether important geometric attributes and scene context obtain enough emphasis in an end-to-end trained 3D scene understanding network. To guide 3D feature learning toward important geometric attributes and scene context, we explore the help of textual scene descriptions. Given some free-form descriptions paired with 3D scenes, we extract the knowledge regarding the object relationships and object attributes. We then inject the knowledge to 3D feature learning through three classification-based auxiliary tasks. This language-assisted training can be combined with modern object detection and instance segmentation methods to promote 3D semantic scene understanding, especially in a label-deficient regime. Moreover, the 3D feature learned with language assistance is better aligned with the language features, which can benefit various 3D-language multimodal tasks. Experiments on several benchmarks of 3D-only and 3D-language tasks demonstrate the effectiveness of our language-assisted 3D feature learning. Code is available at https://github.com/Asterisci/Language-Assisted-3D.
translated by 谷歌翻译