在Enocentric视频中,行动在快速连续中发生。我们利用了行动的时间背景,并提出了一种学习参加周围行动的方法,以提高识别性能。为了纳入时间上下文,我们提出了一种基于变换器的多模式模型,可将视频和音频作为输入模式摄取,具有显式语言模型,提供动作序列上下文来增强预测。我们在史诗厨房和EGTEA数据集上测试我们的方法,报告最先进的性能。我们的消融展示了利用时间上下文的优势以及将音频输入模态和语言模型结合到Rescore预测。代码和模型在:https://github.com/ekazakos/mtcn。
translated by 谷歌翻译
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.
translated by 谷歌翻译
视频分类的视听广义零拍学习需要了解音频和视觉信息之间的关系,以便能够在测试时识别出新颖的,以前看不见的类别的样本。可以利用视频数据中音频和视觉数据之间的自然语义和时间对齐,以学习在测试时概括以概括为看不见类的强大表示。我们为音频概括的零拍学习提供了一个多模式和时间跨注意框架(\ modelname)。它的输入是从预先训练的网络获得的时间对齐音频和视觉功能。鼓励该框架专注于跨时间的跨模式对应关系,而不是在模式中的自我注意力,从而显着提高了表现。我们表明,我们提出的框架摄入时间功能会在\ ucf,\ vgg和\ \ \ \ \ \ \ \ \ vistion基准测试基准上获得最新的性能。复制所有结果的代码可在\ url {https://github.com/explainableml/tcaf-gzsl}上获得。
translated by 谷歌翻译
Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube. Whereas most existing approaches learn low-level representations, we propose a joint visual-linguistic model to learn high-level features without any explicit supervision. In particular, inspired by its recent success in language modeling, we build upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively. We use VideoBERT in numerous tasks, including action classification and video captioning. We show that it can be applied directly to openvocabulary classification, and confirm that large amounts of training data and cross-modal information are critical to performance. Furthermore, we outperform the state-of-theart on video captioning, and quantitative results verify that the model learns high-level semantic features.
translated by 谷歌翻译
这项工作的目的是学习以对象为中心的视频表示形式,以改善对新任务的可转让性,即与动作分类前训练任务不同的任务。为此,我们介绍了基于变压器体系结构的新的以对象为中心的视频识别模型。该模型学习了视频中以对象为中心的摘要向量,并使用这些向量融合视频剪辑的视觉和时空轨迹“模态”。我们还引入了一种新型的轨迹对比损失,以进一步增强这些摘要矢量的物质性。通过在四个数据集上进行实验 - Somethingsometh-v2,Somethingse,Action Genome和Epickitchens-我们表明,以对象为中心的模型优于先验的视频表示(对象 - 敏捷和对象感知)看不见的对象和看不见的环境; (2)小型学习新课程; (3)线性探测到其他下游任务;以及(4)用于标准动作分类。
translated by 谷歌翻译
We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators. Our auto-generated narrations offer a number of advantages, including dense coverage of long videos, better temporal synchronization of the visual information and text, and much higher diversity of text. The video-text embedding learned contrastively with these additional auto-generated narrations outperforms the previous state-of-the-art on multiple first-person and third-person video tasks, both in zero-shot and finetuned setups. Most notably, LaViLa obtains an absolute gain of 10.1% on EGTEA classification and 5.9% Epic-Kitchens-100 multi-instance retrieval benchmarks. Furthermore, LaViLa trained with only half the narrations from the Ego4D dataset outperforms baseline models trained on the full set, and shows positive scaling behavior on increasing pre-training data and model size.
translated by 谷歌翻译
In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data. First, we leverage global action information to catalyze mutual interactions between linguistic texts and local regional objects. It uncovers global and local visual clues from paired video sequences and text descriptions for detailed visual and text relation modeling. Second, we introduce a TaNgled Transformer block (TNT) to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions. Global-local correspondences are discovered via judicious clues extraction from contextual information. It enforces the joint video-text representation to be aware of fine-grained objects as well as global human intention. We validate the generalization capability of ActBERT on downstream video-and-language tasks, i.e., text-video clip retrieval, video captioning, video question answering, action segmentation, and action step localization. ActBERT significantly outperforms the stateof-the-art, demonstrating its superiority in video-text representation learning.actbct * This work was done when Linchao Zhu visited Baidu Research. Yi Yang is the corresponding author.
translated by 谷歌翻译
我们提出了一种自制算法,以从以自我为中心的视频数据中学习表示形式。最近,已经做出了重大努力,以捕捉人类在日常活动中与自己的环境进行互动。结果,已经出现了几个大型的以相互作用的多模式数据的自我为中心的数据集。但是,来自视频的学习表征可能具有挑战性。首先,鉴于长期连续视频的未经保育性质,学习有效表示需要专注于互动的时间。其次,日常活动的视觉表示应对环境状态的变化敏感。但是,当前成功的多模式学习框架鼓励随着时间的推移表示代表。为了应对这些挑战,我们利用音频信号来确定有利于更好学习的可能相互作用的时刻。我们还提出了一个新颖的自我监督目标,该目标从相互作用引起的听觉状态变化中学习。我们在两个大规模的中心数据集(Epic-Kitchens-100和最近发布的EGO4D)上广泛验证了这些贡献,并显示了几个下游任务的改进,包括行动识别,长期行动预期和对象状态变化分类。
translated by 谷歌翻译
最近的动作识别模型通过整合对象,其位置和互动来取得令人印象深刻的结果。但是,为每个框架获得密集的结构化注释是乏味且耗时的,使这些方法的训练昂贵且可扩展性较低。同时,如果可以在感兴趣的域内或之外使用一小部分带注释的图像,我们如何将它们用于下游任务的视频?我们提出了一个学习框架的结构(简称SVIT),该结构证明了仅在训练过程中仅可用的少量图像的结构才能改善视频模型。 SVIT依靠两个关键见解。首先,由于图像和视频都包含结构化信息,因此我们用一组\ emph {对象令牌}丰富了一个可以在图像和视频中使用的\ emph {对象令牌}的模型。其次,视频中各个帧的场景表示应与静止图像的场景表示“对齐”。这是通过\ emph {frame-clip一致性}损失来实现的,该损失可确保图像和视频之间结构化信息的流动。我们探索场景结构的特定实例化,即\ emph {手对象图},由手和对象组成,其位置为节点,以及触点/no-contact的物理关系作为边缘。 SVIT在多个视频理解任务和数据集上显示出强烈的性能改进;它在EGO4D CVPR'22对象状态本地化挑战中赢得了第一名。对于代码和预算模型,请访问\ url {https://eladb3.github.io/svit/}的项目页面
translated by 谷歌翻译
在本文中,我们考虑了视听同步的问题应用于视频`in-wild'(即,超越语音的一般类)。作为一项新任务,我们识别并策划具有高视听相关性的测试集,即VGG-SOCK SYNC。我们比较了一些专门设计的基于变压器的架构变体,用于模拟任意长度的音频和视觉信号,同时显着降低训练期间的内存要求。我们进一步对策划数据集进行了深入的分析,并定义了开放域视听同步的评估度量。我们在标准唇读语音基准测试中应用我们的方法,LRS2和LRS3,在各个方面的消融。最后,我们在新的VGG-SOCKC SYNC视频数据集中设置了与超过160个不同类别的通用视听同步的第一个基准。在所有情况下,我们所提出的模型通过显着的保证金优于以前的最先进。
translated by 谷歌翻译
Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.
translated by 谷歌翻译
大规模未标记数据集的预培训显示了计算机视觉和自然语言处理领域的令人印象深刻的性能改进。鉴于大规模教学视频数据集的出现,预训练视频编码器的常见策略是使用随附的语音作为弱监管。但是,由于演讲用于监督预培训,视频编码器从未见过,这不会学会处理该模态。我们解决了当前预训练方法的这种缺点,这未能利用口语语言中的丰富的线索。我们的提议是使用所有可用的视频模型作为监督,即外观,声音和转录语音预先列车。我们在输入中掩盖了整个模态并使用其他两个模态预测它。这鼓励每个码头与其他方式合作,我们的视频编码器学会处理外观和音频以及语音。我们展示了我们在How2R,YouScook2和浓缩电影数据集上视频检索的“模态屏蔽”预培训方法的卓越性能。
translated by 谷歌翻译
最近,视频变压器在视频理解方面取得了巨大成功,超过了CNN性能;然而,现有的视频变换器模型不会明确地模拟对象,尽管对象对于识别操作至关重要。在这项工作中,我们呈现对象区域视频变换器(Orvit),一个\ emph {对象为中心}方法,它与直接包含对象表示的块扩展视频变压器图层。关键的想法是从早期层开始融合以对象形式的表示,并将它们传播到变压器层中,从而影响整个网络的时空表示。我们的orvit块由两个对象级流组成:外观和动态。在外观流中,“对象区域关注”模块在修补程序上应用自我关注和\ emph {对象区域}。以这种方式,Visual对象区域与统一修补程序令牌交互,并通过上下文化对象信息来丰富它们。我们通过单独的“对象 - 动态模块”进一步模型对象动态,捕获轨迹交互,并显示如何集成两个流。我们在四个任务和五个数据集中评估我们的模型:在某事物中的某些问题和几次射击动作识别,以及在AVA上的某些时空动作检测,以及在某种东西上的标准动作识别 - 某种东西 - 东西,潜水48和EPIC-Kitchen100。我们在考虑的所有任务和数据集中展示了强大的性能改进,展示了将对象表示的模型的值集成到变压器体系结构中。对于代码和预用模型,请访问项目页面\ url {https://roeiherz.github.io/orvit/}
translated by 谷歌翻译
预期未来的事件是智能系统和体现AI的重要功能。但是,与传统的识别任务相比,未来和推理能力要求的不确定性使预期任务非常具有挑战性,并且远远超出了解决。在此文件中,以前的方法通常更关心模型架构设计,或者很少关注如何通过适当的学习政策培训预期模型。为此,在这项工作中,我们提出了一种称为动态上下文删除(DCR)的新型培训方案,该方案动态地安排了学习过程中观察到的未来的可见性。它遵循类似人类的课程学习过程,即逐渐消除事件上下文以增加预期难度,直到满足最终预期目标。我们的学习方案是插件,易于整合包括变压器和LSTM在内的任何推理模型,具有有效性和效率的优势。在广泛的实验中,提出的方法在四个广泛使用的基准上实现了最先进的方法。我们的代码和模型将在https://github.com/allenxuuu/dcr上公开发布。
translated by 谷歌翻译
本文的目标是学习强烈的唇读模型,可以在静音视频中识别语音。大多数事先有效地处理开放式视觉语音识别问题,通过调整在漫步的可视化功能之上的现有自动语音识别技术。相反,在本文中,我们专注于唇读中遇到的独特挑战,并提出量身定制的解决方案。为此,我们提出以下贡献:(1)我们提出了一种基于关注的汇集机制来聚合视觉语音表示; (2)我们首次使用Sub-Word单元进行唇读,并显示这使我们能够更好地模拟任务的含糊不限; (3)我们提出了一种用于视觉语音检测(VSD)的模型,在唇读网络顶部培训。在上文之后,我们在公共数据集训练时获得最先进的LRS2和LRS3基准,甚至通过使用更少的数据量级验证的大规模工业数据集培训的型号。我们最好的模型在LRS2数据集中实现了22.6%的字错误率,这是唇读模型前所未有的性能,显着降低了唇读和自动语音识别之间的性能差距。此外,在AVA-ActiveSpeaker基准测试中,我们的VSD模型超越了所有可视基线,甚至优于最近的几种视听方法。
translated by 谷歌翻译
该报告描述了我们对2022 Epic-Kitchens Action识别挑战的获胜解决方案背后的方法。我们的方法基于我们最近的工作,视频识别的多视图变压器(MTV),并将其适应多模式输入。我们的最终提交由多模式MTV(M&M)模型的合奏组成,它改变了主链尺寸和输入方式。我们的方法在动作类中的测试集上达到了52.8%的TOP-1准确性,比去年的获胜参赛作品高4.1%。
translated by 谷歌翻译
The remarkable success of deep learning in various domains relies on the availability of large-scale annotated datasets. However, obtaining annotations is expensive and requires great effort, which is especially challenging for videos. Moreover, the use of human-generated annotations leads to models with biased learning and poor domain generalization and robustness. As an alternative, self-supervised learning provides a way for representation learning which does not require annotations and has shown promise in both image and video domains. Different from the image domain, learning video representations are more challenging due to the temporal dimension, bringing in motion and other environmental dynamics. This also provides opportunities for video-exclusive ideas that advance self-supervised learning in the video and multimodal domain. In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain. We summarize these methods into four different categories based on their learning objectives: 1) pretext tasks, 2) generative learning, 3) contrastive learning, and 4) cross-modal agreement. We further introduce the commonly used datasets, downstream evaluation tasks, insights into the limitations of existing works, and the potential future directions in this area.
translated by 谷歌翻译
多模式分类是人类以人为本的机器学习中的核心任务。我们观察到信息跨多模式融合在多模式融合之前,信息在偶像中具有高度互补的信息,因此在多模式融合之前可以彻底稀释。为此,我们呈现稀疏的融合变压器(SFT),一种用于现有最先进的方法的变压器的新型多模式融合方法,同时具有大大降低了内存占用和计算成本。我们想法的关键是稀疏池块,可在跨模式建模之前减少单峰令牌集合。评估在多个多模式基准数据集上进行,用于广泛的分类任务。在类似的实验条件下的多个基准上获得最先进的性能,同时报告计算成本和内存要求降低六倍。广泛的消融研究展示了在天真的方法中结合稀疏和多式化学习的好处。这铺平了在低资源设备上实现多模级学习的方式。
translated by 谷歌翻译
We present Masked Audio-Video Learners (MAViL) to train audio-visual representations. Our approach learns with three complementary forms of self-supervision: (1) reconstruction of masked audio and video input data, (2) intra- and inter-modal contrastive learning with masking, and (3) self-training by reconstructing joint audio-video contextualized features learned from the first two objectives. Pre-training with MAViL not only enables the model to perform well in audio-visual classification and retrieval tasks but also improves representations of each modality in isolation, without using information from the other modality for fine-tuning or inference. Empirically, MAViL sets a new state-of-the-art on AudioSet (53.1 mAP) and VGGSound (67.1% accuracy). For the first time, a self-supervised audio-visual model outperforms ones that use external supervision on these benchmarks. Code will be available soon.
translated by 谷歌翻译
细粒度的动作识别是计算机视觉中的一项具有挑战性的任务。由于细粒的数据集在空间和时间空间中具有较小的类间变化,因此细粒度的动作识别模型需要良好的时间推理和属性动作语义的歧视。利用CNN捕获高级时空特征表示能力以及变压器在捕获潜在语义和全球依赖性方面的建模效率,我们研究了两个结合CNN视觉骨干和变压器编码器以增强良好粒度动作识别的框架:1)基于编码器学习潜在的时间语义,以及2)多模式视频文本交叉编码器,以利用其他文本输入并学习视觉语义和文本语义之间的交叉关联。我们的实验结果表明,我们的变压器编码器框架有效地学习潜在的时间语义和跨模式关联,并且比CNN视觉模型改善了识别性能。我们在firgym基准数据集上实现了新的最先进的性能,用于两种拟议的架构。
translated by 谷歌翻译