图像文本检索(ITR)在桥接视觉和舌形式方面具有挑战性。对比度学习已被大多数先前的艺术所采用。除了有限的负面图像文本对外,约束学习的能力受到手动加权负对以及对外部知识的不认识的限制。在本文中,我们提出了新型耦合多样性敏感的动量约束学习(编码器),以改善跨模式表示。首先,发明了一种新颖的多样性对比度学习(DCL)体系结构。我们引入了两种模式的动态词典,以扩大图像文本对的比例,并且通过自适应负面对加权实现多样性敏感性。此外,编码器设计了两个分支。一个人从图像/文本中学习实例级的嵌入式,它还基于其嵌入为其输入图像/文本生成伪在线聚类标签。同时,另一个分支学会从常识知识图中查询以形成两种模式的概念级描述符。之后,两个分支都利用DCL来对齐跨模式嵌入空间,而额外的伪聚类标签预测损失则用于促进第二个分支的概念级表示学习。在两个流行的基准测试(即Mscoco和Flicker30k)上进行的广泛实验,验证编码器的表现明显优于最先进的方法。
translated by 谷歌翻译
对于人工智能系统来说,在低计算成本的情况下实现准确的视频识别是一项挑战。基于自适应推理的有效视频识别方法通常会预览视频,并专注于显着零件以降低计算成本。大多数现有作品都集中在复杂的网络学习,并具有基于视频分类的目标。以所有框架为正样本,其中很少有人关注积极样本(显着框架)和负面样本(非空位框架)之间的歧视。为了填补这一空白,在本文中,我们提出了一个新型的非高度抑制网络(NSNET),该网络有效地抑制了非征力框架的响应。具体而言,在框架级别上,可以生成可以区分显着框架和非空位框架的有效伪标签,以指导框架显着性学习。在视频层面上,在双重视频级别的监督下都学会了一个时间关注模块,这些模块既是对突出表示和非偏心表示形式。从两个两个级别的显着度测量都合并以利用多粒性互补信息。在四个众所周知的基准上进行的广泛实验验证了我们的NSNET不仅实现了最先进的准确性效率折衷,而且比最先进的推理速度要快得多(2.4〜4.3倍) - 艺术方法。我们的项目页面位于https://lawrencexia2008.github.io/projects/nsnet。
translated by 谷歌翻译
有效的视频识别是一个热点研究主题,具有互联网和移动设备上多媒体数据的爆炸性增长。大多数现有方法都选择了显着帧,而不意识对特定于类的显着性分数,这忽略了框架显着性及其归属类别之间的隐式关联。为了减轻此问题,我们设计了一种新颖的时间显着性查询(TSQ)机制,该机制引入了特定于类的信息,以提供明显测量的细粒线索。具体而言,我们将特定于类的显着性测量过程建模为查询响应任务。对于每个类别,它的共同模式被用作查询,最突出的框架对其进行了响应。然后,计算出的相似性被用作框架显着性得分。为了实现这一目标,我们提出了一个时间显着性查询网络(TSQNET),其中包括基于视觉外观相似性和文本事件对象关系的TSQ机制的两个实例化。之后,实施了交叉模式相互作用以促进它们之间的信息交换。最后,我们使用了两种模式生成的最自信类别的特定阶级销售,以执行显着框架的选择。广泛的实验通过在ActivityNet,FCVID和Mini-Kinetics数据集上实现最新结果来证明我们方法的有效性。我们的项目页面位于https://lawrencexia2008.github.io/projects/tsqnet。
translated by 谷歌翻译
时间行动提案生成(TAPG)是一个具有挑战性的任务,旨在在具有时间边界的未经监控视频中找到动作实例。为了评估提案的信任,现有的作品通常预测建议与地面真理之间的时间交叉联盟(TIOO)监督的提案的行动得分。在本文中,我们通过利用背景预测得分来限制提案的信心,创新地提出了一般的辅助背景约束理念,以进一步抑制低质量的建议。以这种方式,可以轻松地将背景约束概念用于现有的TAPG方法(例如,BMN,GTAD)。从这个角度来看,我们提出了背景约束网络(BCNet),以进一步利用行动和背景的丰富信息。具体地,我们介绍了一种动作 - 背景交互模块,用于可靠的置信度评估,它通过帧和剪辑级别的注意机制模拟了动作和背景之间的不一致。在两个流行的基准测试中进行了广泛的实验,即ActivityNet-1.3和Thumos14。结果表明,我们的方法优于最先进的方法。配备现有的Action Classifier,我们的方法还可以在时间动作本地化任务上实现显着性能。
translated by 谷歌翻译
In this work, we focus on instance-level open vocabulary segmentation, intending to expand a segmenter for instance-wise novel categories without mask annotations. We investigate a simple yet effective framework with the help of image captions, focusing on exploiting thousands of object nouns in captions to discover instances of novel classes. Rather than adopting pretrained caption models or using massive caption datasets with complex pipelines, we propose an end-to-end solution from two aspects: caption grounding and caption generation. In particular, we devise a joint Caption Grounding and Generation (CGG) framework based on a Mask Transformer baseline. The framework has a novel grounding loss that performs explicit and implicit multi-modal feature alignments. We further design a lightweight caption generation head to allow for additional caption supervision. We find that grounding and generation complement each other, significantly enhancing the segmentation performance for novel categories. We conduct extensive experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS). The results demonstrate the superiority of our CGG framework over previous OVIS methods, achieving a large improvement of 6.8% mAP on novel classes without extra caption data. Our method also achieves over 15% PQ improvements for novel classes on the OSPS benchmark under various settings.
translated by 谷歌翻译
Recent studies have shown that using an external Language Model (LM) benefits the end-to-end Automatic Speech Recognition (ASR). However, predicting tokens that appear less frequently in the training set is still quite challenging. The long-tail prediction problems have been widely studied in many applications, but only been addressed by a few studies for ASR and LMs. In this paper, we propose a new memory augmented lookup dictionary based Transformer architecture for LM. The newly introduced lookup dictionary incorporates rich contextual information in training set, which is vital to correctly predict long-tail tokens. With intensive experiments on Chinese and English data sets, our proposed method is proved to outperform the baseline Transformer LM by a great margin on both word/character error rate and tail tokens error rate. This is achieved without impact on the decoding efficiency. Overall, we demonstrate the effectiveness of our proposed method in boosting the ASR decoding performance, especially for long-tail tokens.
translated by 谷歌翻译
It is crucial to evaluate the quality and determine the optimal number of clusters in cluster analysis. In this paper, the multi-granularity characterization of the data set is carried out to obtain the hyper-balls. The cluster internal evaluation index based on hyper-balls(HCVI) is defined. Moreover, a general method for determining the optimal number of clusters based on HCVI is proposed. The proposed methods can evaluate the clustering results produced by the several classic methods and determine the optimal cluster number for data sets containing noises and clusters with arbitrary shapes. The experimental results on synthetic and real data sets indicate that the new index outperforms existing ones.
translated by 谷歌翻译
Generalizability to unseen forgery types is crucial for face forgery detectors. Recent works have made significant progress in terms of generalization by synthetic forgery data augmentation. In this work, we explore another path for improving the generalization. Our goal is to reduce the features that are easy to learn in the training phase, so as to reduce the risk of overfitting on specific forgery types. Specifically, in our method, a teacher network takes as input the face images and generates an attention map of the deep features by a diverse multihead attention ViT. The attention map is used to guide a student network to focus on the low-attended features by reducing the highly-attended deep features. A deep feature mixup strategy is also proposed to synthesize forgeries in the feature domain. Experiments demonstrate that, without data augmentation, our method is able to achieve promising performances on unseen forgeries and highly compressed data.
translated by 谷歌翻译
This paper presents a novel framework for planning in unknown and occluded urban spaces. We specifically focus on turns and intersections where occlusions significantly impact navigability. Our approach uses an inpainting model to fill in a sparse, occluded, semantic lidar point cloud and plans dynamically feasible paths for a vehicle to traverse through the open and inpainted spaces. We demonstrate our approach using a car's lidar data with real-time occlusions, and show that by inpainting occluded areas, we can plan longer paths, with more turn options compared to without inpainting; in addition, our approach more closely follows paths derived from a planner with no occlusions (called the ground truth) compared to other state of the art approaches.
translated by 谷歌翻译
In this work, we investigate improving the generalizability of GAN-generated image detectors by performing data augmentation in the fingerprint domain. Specifically, we first separate the fingerprints and contents of the GAN-generated images using an autoencoder based GAN fingerprint extractor, followed by random perturbations of the fingerprints. Then the original fingerprints are substituted with the perturbed fingerprints and added to the original contents, to produce images that are visually invariant but with distinct fingerprints. The perturbed images can successfully imitate images generated by different GANs to improve the generalization of the detectors, which is demonstrated by the spectra visualization. To our knowledge, we are the first to conduct data augmentation in the fingerprint domain. Our work explores a novel prospect that is distinct from previous works on spatial and frequency domain augmentation. Extensive cross-GAN experiments demonstrate the effectiveness of our method compared to the state-of-the-art methods in detecting fake images generated by unknown GANs.
translated by 谷歌翻译