预期人类行为的问题是固有的不确定问题。但是,如果我们对演员试图实现的目标有一种了解,我们可以减少这种不确定性。在这里,我们提出了一个动作预期模型,该模型利用目标信息,以减少未来预测中的不确定性。由于我们在推理过程中没有目标信息或观察到的动作,因此我们诉诸于视觉表示,以封装有关动作和目标的信息。通过此,我们得出了一个名为“抽象目标”的新颖概念,该概念基于观察到的视觉特征序列进行动作预期。我们将抽象目标设计为分布,其参数是使用变异复发网络估算的。我们为下一个动作采样了多个候选人,并引入目标一致性度量,以确定从抽象目标中遵循的最佳候选人。我们的方法对非常具有挑战性的Epic-Kitchens55(EK55),EK100和EGTEA凝视+数据集获得了令人印象深刻的结果。对于TOP-1动词,TOP-1名词和TOP-1动作预期精度,我们获得了+13.69,+11.24和+5.19的绝对改进,分别在先前的最新厨房(S1)的方法上获得了预期精度。 EK55。同样,我们还可以在Top-1动词(+10.75),名词(+5.84)和Action(+2.87)预期的Top-1动词(+10.75)设置的Uney厨房(S2)方面得到显着改进。对于EGTEA凝视 +数据集,观察到类似的趋势,在该数据中,对于名词,动词和动作预期,获得了+9.9,+13.1和+6.8的绝对改进。通过本文提交,我们的方法目前是EK55和EGTEA凝视+ https://competitions.codalab.org/competitions/20071#Results代码的新的最先进的预期。 //github.com/debadityaroy/abstract_goal
translated by 谷歌翻译
培训语义细分模型的现实世界注释收集是一个昂贵的过程。无监督的域适应性(UDA)试图通过研究如何使用更多可访问的数据(例如合成数据)来训练和适应现实世界图像而无需其注释,以解决此问题。最近的UDA方法通过使用学生和教师网络对像素的分类损失进行培训,适用于自学习。在本文中,我们建议通过对网络输出中元素之间的像素间关系进行建模,将一致性正则项添加到半监督UDA中。我们通过将其应用于最先进的涂抹式框架并将GTA5上的MIOU1绩效应用于CityScapes Benchmark,并在Synthia上的MIOU16绩效提高了MIOU19在Synthia上的效果,并将MIOU19上的MIOU1上的性能提高到CityScapes基准,将其应用于CityScapes Benchmark,并将MIOU19上的MIOU1上的性能提高到CityScapes基准,从而证明了拟议的一致性正规化项的有效性。
translated by 谷歌翻译
我们提出了Locommer,一种基于变压器的视频接地模型,其在恒定的存储空间中运行,无论视频长度如何,即帧数。 Locommer专为任务而设计,在那里需要处理整个长视频,并在其核心贴上两个主要贡献。首先,我们的模型包含一种新的采样技术,将输入要素序列分成固定数量的部分,并使用随机方法选择每个部分的单个特征,这允许我们获得代表视频内容的特征样本集在手中的任务,同时保持内存占用空间。其次,我们提出了一种模块化设计,将功能分开,使我们能够通过监督自我关注头来学习归纳偏差,同时还有效利用预先接受训练的文本和视频编码器。我们在相关的基准数据集中测试我们的建议,以进行视频接地,表明该表现形式不仅可以实现优异的结果,包括在YouCookii上的最先进的性能,也可以比竞争对手更有效,并且它一直有效在平均工作的情况下,最新工作的表现,均值较大,最终导致Chardes-STA的新的最先进的性能。
translated by 谷歌翻译
场景图生成(SGG)旨在捕获对物体对之间的各种相互作用,这对于完整的场景了解至关重要。在整个关系集上培训的现有SGG方法未能由于培训数据中的各种偏差而导致视觉和文本相关性的复杂原理。学习表明像“ON”这样的通用空间配置的琐碎关系,而不是“停放”,例如“停放”,不执行这种复杂的推理,伤害泛化。为了解决这个问题,我们提出了一种新颖的SGG培训框架,以利用基于其信息的关系标签。我们的模型 - 不可知论培训程序对培训数据中的较少信息样本造成缺失的信息关系,并在算标签上培训算法的SGG模型以及现有的注释。我们表明,这种方法可以成功地与最先进的SGG方法结合使用,并在标准视觉基因组基准测试中显着提高它们的性能。此外,我们在更具挑战性的零射击设置中获得了看不见的三胞胎的相当大的改进。
translated by 谷歌翻译
卷积神经网络(CNNS)的注意力模块是增强网络对多个计算机视觉任务的性能的有效方法。虽然许多作品专注于通过适当的渠道,空间和自我关注建立更有效的模块,但它们主要以供给送出方式运作。因此,注意机制强烈取决于单个输入特征激活的代表能力,并且可以从语义上更丰富的更高级别激活中受益,该激活可以通过自上而下信息流指定“有什么和位置”。这种反馈连接在灵长类动物视觉皮层中也普遍存在,并且神经科学家被认为是灵长类动物视觉关注的关键组成部分。因此,在这项工作中,我们提出了一种轻量级的自上而下(TD)注意模块,其迭代地产生“视觉探照灯”以执行自上而下的信道和其输入的空间调制,从而在每个计算步骤中输出更多的选择性特征激活。我们的实验表明,集成CNNS中的TD在Imagenet-1K分类上增强了它们的性能,并且优于突出的注意模块,同时具有更多参数和记忆力。此外,我们的模型在推理期间更改输入分辨率更加强大,并通过在没有任何显式监督的情况下本地化各个对象或特征来学习“转移注意”。除了改进细粒度和多标签分类的情况下,这种能力在弱监督对象定位上导致RESET50改进了5%。
translated by 谷歌翻译
There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram overlap, which is neither necessary nor sufficient for the task of simulating human judgment. We hypothesize that semantic propositional content is an important component of human caption evaluation, and propose a new automated caption evaluation metric defined over scene graphs coined SPICE. Extensive evaluations across a range of models and datasets indicate that SPICE captures human judgments over model-generated captions better than other automatic metrics (e.g., system-level correlation of 0.88 with human judgments on the MS COCO dataset, versus 0.43 for CIDEr and 0.53 for METEOR). Furthermore, SPICE can answer questions such as which caption-generator best understands colors? and can caption-generators count?
translated by 谷歌翻译
Advances in computer vision and machine learning techniques have led to significant development in 2D and 3D human pose estimation from RGB cameras, LiDAR, and radars. However, human pose estimation from images is adversely affected by occlusion and lighting, which are common in many scenarios of interest. Radar and LiDAR technologies, on the other hand, need specialized hardware that is expensive and power-intensive. Furthermore, placing these sensors in non-public areas raises significant privacy concerns. To address these limitations, recent research has explored the use of WiFi antennas (1D sensors) for body segmentation and key-point body detection. This paper further expands on the use of the WiFi signal in combination with deep learning architectures, commonly used in computer vision, to estimate dense human pose correspondence. We developed a deep neural network that maps the phase and amplitude of WiFi signals to UV coordinates within 24 human regions. The results of the study reveal that our model can estimate the dense pose of multiple subjects, with comparable performance to image-based approaches, by utilizing WiFi signals as the only input. This paves the way for low-cost, broadly accessible, and privacy-preserving algorithms for human sensing.
translated by 谷歌翻译
Periocular refers to the region of the face that surrounds the eye socket. This is a feature-rich area that can be used by itself to determine the identity of an individual. It is especially useful when the iris or the face cannot be reliably acquired. This can be the case of unconstrained or uncooperative scenarios, where the face may appear partially occluded, or the subject-to-camera distance may be high. However, it has received revived attention during the pandemic due to masked faces, leaving the ocular region as the only visible facial area, even in controlled scenarios. This paper discusses the state-of-the-art of periocular biometrics, giving an overall framework of its most significant research aspects.
translated by 谷歌翻译
Traditionally, data analysis and theory have been viewed as separate disciplines, each feeding into fundamentally different types of models. Modern deep learning technology is beginning to unify these two disciplines and will produce a new class of predictively powerful space weather models that combine the physical insights gained by data and theory. We call on NASA to invest in the research and infrastructure necessary for the heliophysics' community to take advantage of these advances.
translated by 谷歌翻译
Multi-class ensemble classification remains a popular focus of investigation within the research community. The popularization of cloud services has sped up their adoption due to the ease of deploying large-scale machine-learning models. It has also drawn the attention of the industrial sector because of its ability to identify common problems in production. However, there are challenges to conform an ensemble classifier, namely a proper selection and effective training of the pool of classifiers, the definition of a proper architecture for multi-class classification, and uncertainty quantification of the ensemble classifier. The robustness and effectiveness of the ensemble classifier lie in the selection of the pool of classifiers, as well as in the learning process. Hence, the selection and the training procedure of the pool of classifiers play a crucial role. An (ensemble) classifier learns to detect the classes that were used during the supervised training. However, when injecting data with unknown conditions, the trained classifier will intend to predict the classes learned during the training. To this end, the uncertainty of the individual and ensemble classifier could be used to assess the learning capability. We present a novel approach for novel detection using ensemble classification and evidence theory. A pool selection strategy is presented to build a solid ensemble classifier. We present an architecture for multi-class ensemble classification and an approach to quantify the uncertainty of the individual classifiers and the ensemble classifier. We use uncertainty for the anomaly detection approach. Finally, we use the benchmark Tennessee Eastman to perform experiments to test the ensemble classifier's prediction and anomaly detection capabilities.
translated by 谷歌翻译