基于对比的学习的预培训的目标是利用大量的未标记数据来产生可以容易地调整下游的模型。电流方法围绕求解图像辨别任务:给定锚图像,该图像的增强对应物和一些其他图像,该模型必须产生表示,使得锚和其对应物之间的距离很小,并且锚和其他图像很大。这种方法存在两个重要问题:(i)通过对比图像级别的表示,很难生成有利于下游对象级任务(如实例分段)的详细对象敏感功能; (ii)制造增强对应的增强策略是固定的,在预培训的后期阶段做出更低的学习。在这项工作中,我们引入课程对比对象级预培训(CCOP)来解决这些问题:(i)我们使用选择性搜索来查找粗略对象区域并使用它们构建图像间对象级对比度损耗和一个图像内对象级别歧视损失进入我们的预训练目标; (ii)我们提出了一种课程学习机制,其自适应地增强所生成的区域,这允许模型一致地获取有用的学习信号,即使在预训练的后期阶段也是如此。我们的实验表明,当在多对象场景图像数据集上进行预训练时,我们的方法通过大量对象级任务的大幅度提高了MoCo V2基线。代码可在https://github.com/chenhongyiyang/ccop中找到。
translated by 谷歌翻译
复杂知识库问题回答是过去十年的一个流行的研究领域。最近的公共数据集导致这一领域的令人鼓舞的结果,但主要涉及英语,只涉及少数问题类型和关系,在更现实的环境和英语以外的语言中妨碍研究。此外,很少有最先进的KBQA模型在Wikidata上培训,是最受欢迎的真实知识库之一。我们提出了CLC-Quad,这是Wikidata的第一个大规模复杂的中文语义解析数据集,以解决这些挑战。我们与数据集一起介绍了一个文本到SPARQL基线模型,可以有效地应答多种类型的复杂问题,例如事实上的问题,双重意图问题,布尔问题和计数问题,以及Wikidata作为背景知识。我们终于分析了SOTA KBQA模型在此数据集中的表现,并确定了中国KBQA面临的挑战。
translated by 谷歌翻译
Letting a deep network be aware of the quality of its own predictions is an interesting yet important problem. In the task of instance segmentation, the confidence of instance classification is used as mask quality score in most instance segmentation frameworks. However, the mask quality, quantified as the IoU between the instance mask and its ground truth, is usually not well correlated with classification score. In this paper, we study this problem and propose Mask Scoring R-CNN which contains a network block to learn the quality of the predicted instance masks. The proposed network block takes the instance feature and the corresponding predicted mask together to regress the mask IoU. The mask scoring strategy calibrates the misalignment between mask quality and mask score, and improves instance segmentation performance by prioritizing more accurate mask predictions during COCO AP evaluation. By extensive evaluations on the COCO dataset, Mask Scoring R-CNN brings consistent and noticeable gain with different models, and outperforms the state-of-the-art Mask R-CNN. We hope our simple and effective approach will provide a new direction for improving instance segmentation. The source code of our method is available at https:// github.com/zjhuang22/maskscoring_rcnn. * The work was done when Zhaojin Huang was an intern in Horizon Robotics Inc.
translated by 谷歌翻译
Contextual information is vital in visual understanding problems, such as semantic segmentation and object detection. We propose a Criss-Cross Network (CCNet) for obtaining full-image contextual information in a very effective and efficient way. Concretely, for each pixel, a novel criss-cross attention module harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. Besides, a category consistent loss is proposed to enforce the criss-cross attention module to produce more discriminative features. Overall, CCNet is with the following merits: 1) GPU memory friendly. Compared with the non-local block, the proposed recurrent criss-cross attention module requires 11× less GPU memory usage. 2) High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85% of the non-local block. 3) The state-of-the-art performance. We conduct extensive experiments on semantic segmentation benchmarks including Cityscapes, ADE20K, human parsing benchmark LIP, instance segmentation benchmark COCO, video segmentation benchmark CamVid. In particular, our CCNet achieves the mIoU scores of 81.9%, 45.76% and 55.47% on the Cityscapes test set, the ADE20K validation set and the LIP validation set respectively, which are the new state-of-the-art results. The source codes are available at https://github.com/speedinghzl/CCNet.
translated by 谷歌翻译
Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.
translated by 谷歌翻译
Data heterogeneity across clients in federated learning (FL) settings is a widely acknowledged challenge. In response, personalized federated learning (PFL) emerged as a framework to curate local models for clients' tasks. In PFL, a common strategy is to develop local and global models jointly - the global model (for generalization) informs the local models, and the local models (for personalization) are aggregated to update the global model. A key observation is that if we can improve the generalization ability of local models, then we can improve the generalization of global models, which in turn builds better personalized models. In this work, we consider class imbalance, an overlooked type of data heterogeneity, in the classification setting. We propose FedNH, a novel method that improves the local models' performance for both personalization and generalization by combining the uniformity and semantics of class prototypes. FedNH initially distributes class prototypes uniformly in the latent space and smoothly infuses the class semantics into class prototypes. We show that imposing uniformity helps to combat prototype collapse while infusing class semantics improves local models. Extensive experiments were conducted on popular classification datasets under the cross-device setting. Our results demonstrate the effectiveness and stability of our method over recent works.
translated by 谷歌翻译
Point cloud completion, as the upstream procedure of 3D recognition and segmentation, has become an essential part of many tasks such as navigation and scene understanding. While various point cloud completion models have demonstrated their powerful capabilities, their robustness against adversarial attacks, which have been proven to be fatally malicious towards deep neural networks, remains unknown. In addition, existing attack approaches towards point cloud classifiers cannot be applied to the completion models due to different output forms and attack purposes. In order to evaluate the robustness of the completion models, we propose PointCA, the first adversarial attack against 3D point cloud completion models. PointCA can generate adversarial point clouds that maintain high similarity with the original ones, while being completed as another object with totally different semantic information. Specifically, we minimize the representation discrepancy between the adversarial example and the target point set to jointly explore the adversarial point clouds in the geometry space and the feature space. Furthermore, to launch a stealthier attack, we innovatively employ the neighbourhood density information to tailor the perturbation constraint, leading to geometry-aware and distribution-adaptive modifications for each point. Extensive experiments against different premier point cloud completion networks show that PointCA can cause a performance degradation from 77.9% to 16.7%, with the structure chamfer distance kept below 0.01. We conclude that existing completion models are severely vulnerable to adversarial examples, and state-of-the-art defenses for point cloud classification will be partially invalid when applied to incomplete and uneven point cloud data.
translated by 谷歌翻译
无人驾驶飞机(UAV)通过低成本,大型覆盖,实时和高分辨率数据采集能力而广泛应用于检查,搜索和救援行动的目的。在这些过程中产生了大量航空视频,在这些过程中,正常事件通常占压倒性的比例。本地化和提取异常事件非常困难,这些事件包含手动从长视频流中的潜在有价值的信息。因此,我们致力于开发用于解决此问题的异常检测方法。在本文中,我们创建了一个新的数据集,名为Droneanomaly,用于空中视频中的异常检测。该数据集提供了37个培训视频序列和22个测试视频序列,这些视频序列来自7个不同的现实场景,其中包括各种异常事件。有87,488个彩色视频框架(训练51,635,测试35,853),每秒30帧的尺寸为640美元\ times 640美元。基于此数据集,我们评估现有方法并为此任务提供基准。此外,我们提出了一种新的基线模型,即变压器(ANDT)的异常检测,该模型将连续的视频帧视为一系列小管,它利用变压器编码器从序列中学习特征表示,并利用解码器来预测下一帧。我们的网络模型在训练阶段模型正常,并确定了具有不可预测的时间动力学的事件,作为测试阶段的异常。此外,为了全面评估我们提出的方法的性能,我们不仅使用无人机 - 异常数据集,而且使用另一个数据集。我们将使我们的数据集和代码公开可用。可以在https://youtu.be/ancczyryoby上获得演示视频。我们使数据集和代码公开可用。
translated by 谷歌翻译
由于其低成本和快速移动性,无人驾驶汽车(UAV)现在已广泛应用于数据获取。随着航空视频量的增加,对这些视频自动解析的需求正在激增。为了实现这一目标,当前的研究主要集中于在空间和时间维度沿着卷积的整体特征提取整体特征。但是,这些方法受到小时接收场的限制,无法充分捕获长期的时间依赖性,这对于描述复杂动力学很重要。在本文中,我们提出了一个新颖的深神经网络,称为futh-net,不仅为整体特征建模,而且还模拟了空中视频分类的时间关系。此外,在新型融合模块中,多尺度的时间关系可以完善整体特征,以产生更具歧视性的视频表示。更特别地,FUTH-NET采用了两条道路架构:(1)学习框架外观和短期时间变化的一般特征的整体代表途径,以及(2)捕获跨任意跨越任意时间关系的时间关系途径框架,提供长期的时间依赖性。之后,提出了一个新型的融合模块,以时空整合从这两种途径中学到的两个特征。我们的模型对两个航空视频分类数据集进行了评估,即ERA和无人机操作,并实现了最新结果。这表明了其在不同识别任务(事件分类和人类行动识别)之间的有效性和良好的概括能力。为了促进进一步的研究,我们在https://gitlab.lrz.de/ai4eo/reasoning/futh-net上发布该代码。
translated by 谷歌翻译
数据质量是发展医疗保健中值得信赖的AI的关键因素。大量具有控制混杂因素的策划数据集可以帮助提高下游AI算法的准确性,鲁棒性和隐私性。但是,访问高质量的数据集受数据获取的技术难度的限制,并且严格的道德限制阻碍了医疗保健数据的大规模共享。数据合成算法生成具有与真实临床数据相似的分布的数据,可以作为解决可信度AI的发展过程中缺乏优质数据的潜在解决方案。然而,最新的数据合成算法,尤其是深度学习算法,更多地集中于成像数据,同时忽略了非成像医疗保健数据的综合,包括临床测量,医疗信号和波形以及电子保健记录(EHRS)(EHRS) 。因此,在本文中,我们将回顾合成算法,尤其是对于非成像医学数据,目的是在该领域提供可信赖的AI。本教程风格的审查论文将对包括算法,评估,局限性和未来研究方向在内的各个方面进行全面描述。
translated by 谷歌翻译