为视频中的每个像素分配语义类和跟踪身份的任务称为视频Panoptic分段。我们的工作是第一个在真实世界中瞄准这项任务,需要在空间和时间域中的密集解释。由于此任务的地面真理难以获得,但是,现有数据集是合成构造的或仅在短视频剪辑中稀疏地注释。为了克服这一点,我们介绍了一个包含两个数据集,Kitti-Step和Motchallenge步骤的新基准。数据集包含长视频序列,提供具有挑战性的示例和用于研究长期像素精确分割和在真实条件下跟踪的测试床。我们进一步提出了一种新的评估度量分割和跟踪质量(STQ),其相当余额平衡该任务的语义和跟踪方面,并且更适合评估任意长度的序列。最后,我们提供了几个基线来评估此新具有挑战性数据集的现有方法的状态。我们已将我们的数据集,公制,基准服务器和基准公开提供,并希望这将激发未来的研究。
translated by 谷歌翻译
Panoptic图像分割是计算机视觉任务,即在图像中查找像素组并为其分配语义类别和对象实例标识符。由于其在机器人技术和自动驾驶中的关键应用,图像细分的研究变得越来越流行。因此,研究社区依靠公开可用的基准数据集来推进计算机视觉中的最新技术。但是,由于将图像标记为高昂的成本,因此缺乏适合全景分割的公开地面真相标签。高标签成本还使得将现有数据集扩展到视频域和多相机设置是一项挑战。因此,我们介绍了Waymo Open DataSet:全景视频全景分割数据集,这是一个大型数据集,它提供了用于自主驾驶的高质量的全景分割标签。我们使用公开的Waymo打开数据集生成数据集,利用各种相机图像集。随着时间的推移,我们的标签是一致的,用于视频处理,并且在车辆上安装的多个摄像头保持一致,以了解全景的理解。具体而言,我们为28个语义类别和2,860个时间序列提供标签,这些标签由在三个不同地理位置驾驶的自动驾驶汽车上安装的五个摄像机捕获,从而导致总共标记为100k标记的相机图像。据我们所知,这使我们的数据集比现有的数据集大量数据集大的数量级。我们进一步提出了一个新的基准,用于全景视频全景分割,并根据DeepLab模型家族建立许多强大的基准。我们将公开制作基准和代码。在https://waymo.com/open上找到数据集。
translated by 谷歌翻译
Panoptic现场了解和跟踪动态代理对于机器人和自动化车辆至关重要,以在城市环境中导航。由于LiDAR提供了方案的精确照明和几何描绘,使用LIDAR点云执行这些任务提供可靠的预测。然而,现有数据集缺乏城市场景类型的多样性,并且具有有限数量的动态对象实例,其阻碍了这些任务的学习以及开发方法的可信基准。在本文中,我们介绍了大规模的Panoptic Nuscenes基准数据集,它扩展了我们流行的NUSCENES DataSet,具有用于语义分割,Panoptic分段和Panoptic跟踪任务的Pock-Wise Trountruth annotations。为了便于比较,我们为我们提出的数据集提供了几个任务的强大基线。此外,我们分析了Panoptic跟踪的现有度量标准的缺点,并提出了一种解决问题的小说实例的Pat度量。我们提供详尽的实验,展示了Panoptic Nuscenes与现有数据集相比的效用,并在Nuscenes.org提供的在线评估服务器。我们认为,此扩展将加快新颖的现场了解动态城市环境的新方法研究。
translated by 谷歌翻译
We propose and study a task we name panoptic segmentation (PS). Panoptic segmentation unifies the typically distinct tasks of semantic segmentation (assign a class label to each pixel) and instance segmentation (detect and segment each object instance). The proposed task requires generating a coherent scene segmentation that is rich and complete, an important step toward real-world vision systems. While early work in computer vision addressed related image/scene parsing tasks, these are not currently popular, possibly due to lack of appropriate metrics or associated recognition challenges. To address this, we propose a novel panoptic quality (PQ) metric that captures performance for all classes (stuff and things) in an interpretable and unified manner. Using the proposed metric, we perform a rigorous study of both human and machine performance for PS on three existing datasets, revealing interesting insights about the task. The aim of our work is to revive the interest of the community in a more unified view of image segmentation.
translated by 谷歌翻译
视频分割,即将视频帧分组到多个段或对象中,在广泛的实际应用中扮演关键作用,例如电影中的视觉效果辅助,自主驾驶中的现场理解,以及视频会议中的虚拟背景创建,名称一些。最近,由于计算机愿景中的联系复兴,一直存在众多深度学习的方法,这一直专用于视频分割并提供引人注目的性能。在这项调查中,通过引入各自的任务设置,背景概念,感知需要,开发历史,以及开发历史,综合审查这一领域的两种基本研究,即在视频和视频语义分割中,即视频和视频语义分割中的通用对象分段(未知类别)。主要挑战。我们还提供关于两种方法和数据集的代表文学的详细概述。此外,我们在基准数据集中呈现了审查方法的定量性能比较。最后,我们指出了这一领域的一套未解决的开放问题,并提出了进一步研究的可能机会。
translated by 谷歌翻译
视频分析的图像分割在不同的研究领域起着重要作用,例如智能城市,医疗保健,计算机视觉和地球科学以及遥感应用。在这方面,最近致力于发展新的细分策略;最新的杰出成就之一是Panoptic细分。后者是由语义和实例分割的融合引起的。明确地,目前正在研究Panoptic细分,以帮助获得更多对视频监控,人群计数,自主驾驶,医学图像分析的图像场景的更细致的知识,以及一般对场景更深入的了解。为此,我们介绍了本文的首次全面审查现有的Panoptic分段方法,以获得作者的知识。因此,基于所采用的算法,应用场景和主要目标的性质,执行现有的Panoptic技术的明确定义分类。此外,讨论了使用伪标签注释新数据集的Panoptic分割。继续前进,进行消融研究,以了解不同观点的Panoptic方法。此外,讨论了适合于Panoptic分割的评估度量,并提供了现有解决方案性能的比较,以告知最先进的并识别其局限性和优势。最后,目前对主题技术面临的挑战和吸引不久的将来吸引相当兴趣的未来趋势,可以成为即将到来的研究研究的起点。提供代码的文件可用于:https://github.com/elharroussomar/awesome-panoptic-egation
translated by 谷歌翻译
我们的视频是否可以在场景中存在沉重的遮挡时感知对象?为了回答这个问题,我们收集一个名为OVIS的大型数据集,用于遮挡视频实例分段,即同时检测,段和跟踪遮挡场景中的实例。 OVIS由25个语义类别的296K高质量的掩码组成,通常发生对象遮挡。虽然我们的人类视觉系统可以通过语境推理和关联来理解那些被遮挡的情况,但我们的实验表明当前的视频理解系统不能。在ovis数据集上,最先进的算法实现的最高AP仅为16.3,这揭示了我们仍然处于创建对象,实例和视频中的新生阶段。我们还提出了一个简单的即插即用模块,执行时间特征校准,以补充闭塞引起的缺失对象线索。基于MaskTrack R-CNN和SIPMASK构建,我们在OVIS数据集中获得了显着的AP改进。 ovis数据集和项目代码可在http://songbai.site/ovis获得。
translated by 谷歌翻译
TU Dresden www.cityscapes-dataset.net train/val -fine annotation -3475 images train -coarse annotation -20 000 images test -fine annotation -1525 images
translated by 谷歌翻译
In this paper we present a new computer vision task, named video instance segmentation. The goal of this new task is simultaneous detection, segmentation and tracking of instances in videos. In words, it is the first time that the image instance segmentation problem is extended to the video domain. To facilitate research on this new task, we propose a large-scale benchmark called YouTube-VIS, which consists of 2,883 high-resolution YouTube videos, a 40-category label set and 131k high-quality instance masks.In addition, we propose a novel algorithm called Mask-Track R-CNN for this task. Our new method introduces a new tracking branch to Mask R-CNN to jointly perform the detection, segmentation and tracking tasks simultaneously. Finally, we evaluate the proposed method and several strong baselines on our new dataset. Experimental results clearly demonstrate the advantages of the proposed algorithm and reveal insight for future improvement. We believe the video instance segmentation task will motivate the community along the line of research for video understanding.
translated by 谷歌翻译
多个现有基准测试涉及视频中的跟踪和分割对象,例如,视频对象细分(VOS)和多对象跟踪和分割(MOTS)(MOTS),但是由于使用不同的基准标准数据集和指标,它们之间几乎没有相互作用(例如J&F,J&F,J&F,J&F,地图,smotsa)。结果,已发表的作品通常针对特定的基准,并且不容易相互媲美。我们认为,可以解决多个任务的广义方法的发展需要在这些研究子社区中更大的凝聚力。在本文中,我们旨在通过提出爆发来促进这一点,该数据集包含数千个带有高质量对象掩码的视频,以及一个相关的基准标准,其中包含六个任务,涉及视频中的对象跟踪和细分。使用相同的数据和可比较的指标对所有任务进行评估,这使研究人员能够一致考虑它们,因此更有效地从不同任务的不同方法中汇集了知识。此外,我们为所有任务展示了几个基线,并证明可以将一个任务的方法应用于另一个任务,并具有可量化且可解释的性能差异。数据集注释和评估代码可在以下网址获得:https://github.com/ali2500/burst-benchmark。
translated by 谷歌翻译
当前的多类多类别对象跟踪(MOT)指标使用类标签来分组跟踪结果以进行每类评估。同样,MOT方法通常仅将对象与相同的类预测相关联。这两种MOT中的普遍策略隐含地假设分类性能几乎完美。但是,这远非最近的大型MOT数据集中的情况,这些数据集包含许多罕见或语义上类似类别的类别。因此,所得的不正确分类导致跟踪器的基准跟踪和基准不足。我们通过将分类与跟踪无关,以解决这些问题。我们引入了一个新的指标,跟踪所有准确性(TETA),将跟踪测量测量分为三个子因素:本地化,关联和分类,即使在不准确的分类下,也可以全面地跟踪性能的基准测试。 TETA还处理了大规模跟踪数据集中具有挑战性的不完整注释问题。我们进一步介绍了使用类示例匹配(CEM)执行关联的每件事跟踪器(TETER)。我们的实验表明,TETA对跟踪器进行更全面的评估,并且与最先进的ART相比,TETE对挑战性的大规模数据集BDD100K和TAO进行了重大改进。
translated by 谷歌翻译
Datasets drive vision progress, yet existing driving datasets are impoverished in terms of visual content and supported tasks to study multitask learning for autonomous driving. Researchers are usually constrained to study a small set of problems on one dataset, while real-world computer vision applications require performing tasks of various complexities. We construct BDD100K 1 , the largest driving video dataset with 100K videos and 10 tasks to evaluate the exciting progress of image recognition algorithms on autonomous driving. The dataset possesses geographic, environmental, and weather diversity, which is useful for training models that are less likely to be surprised by new conditions. Based on this diverse dataset, we build a benchmark for heterogeneous multitask learning and study how to solve the tasks together. Our experiments show that special training strategies are needed for existing models to perform such heterogeneous tasks. BDD100K opens the door for future studies in this important venue.
translated by 谷歌翻译
虽然深度学习方法近年来取得了高级视频对象识别性能,但在视频中感知封闭对象仍然是一个非常具有挑战性的任务。为促进遮挡理解的发展,我们在遮挡方案中收集一个名为OVIS的大规模数据集,用于遮挡方案中的视频实例分段。 ovis由296K高质量的屏幕和901个遮挡场景组成。虽然我们的人类视觉系统可以通过语境推理和关联来感知那些遮挡物体,但我们的实验表明当前的视频了解系统不能。在ovis数据集上,所有基线方法都遇到了大约80%的大约80%的大约80%,这表明仍然有很长的路要走在复杂的真实情景中理解模糊物体和视频。为了促进对视频理解系统的新范式研究,我们基于OVI数据集启动了挑战。提交的顶级执行算法已经比我们的基线实现了更高的性能。在本文中,我们将介绍OVIS数据集,并通过分析基线的结果和提交的方法来进一步剖析。可以在http://songbai.site/ovis找到ovis数据集和挑战信息。
translated by 谷歌翻译
Multi-animal tracking (MAT), a multi-object tracking (MOT) problem, is crucial for animal motion and behavior analysis and has many crucial applications such as biology, ecology and animal conservation. Despite its importance, MAT is largely under-explored compared to other MOT problems such as multi-human tracking due to the scarcity of dedicated benchmarks. To address this problem, we introduce AnimalTrack, a dedicated benchmark for multi-animal tracking in the wild. Specifically, AnimalTrack consists of 58 sequences from a diverse selection of 10 common animal categories. On average, each sequence comprises of 33 target objects for tracking. In order to ensure high quality, every frame in AnimalTrack is manually labeled with careful inspection and refinement. To our best knowledge, AnimalTrack is the first benchmark dedicated to multi-animal tracking. In addition, to understand how existing MOT algorithms perform on AnimalTrack and provide baselines for future comparison, we extensively evaluate 14 state-of-the-art representative trackers. The evaluation results demonstrate that, not surprisingly, most of these trackers become degenerated due to the differences between pedestrians and animals in various aspects (e.g., pose, motion, and appearance), and more efforts are desired to improve multi-animal tracking. We hope that AnimalTrack together with evaluation and analysis will foster further progress on multi-animal tracking. The dataset and evaluation as well as our analysis will be made available at https://hengfan2010.github.io/projects/AnimalTrack/.
translated by 谷歌翻译
多目标跟踪(MOT)的典型管道是使用探测器进行对象本地化,并在重新识别(RE-ID)之后进行对象关联。该管道通过对象检测和重新ID的最近进展部分而部分地激励,并且部分地通过现有的跟踪数据集中的偏差激励,其中大多数物体倾向于具有区分外观和RE-ID模型足以建立关联。为了响应这种偏见,我们希望重新强调多目标跟踪的方法也应该在对象外观不充分辨别时起作用。为此,我们提出了一个大型数据集,用于多人跟踪,人类具有相似的外观,多样化的运动和极端关节。由于数据集包含主要组跳舞视频,我们将其命名为“DanceTrack”。我们预计DanceTrack可以提供更好的平台,以开发更多的MOT算法,这些算法依赖于视觉识别并更依赖于运动分析。在我们的数据集上,我们在数据集上基准测试了几个最先进的追踪器,并在与现有基准测试中遵守DanceTrack的显着性能下降。 DataSet,项目代码和竞争服务器播放:\ url {https://github.com/danceTrack}。
translated by 谷歌翻译
Tracking has traditionally been the art of following interest points through space and time. This changed with the rise of powerful deep networks. Nowadays, tracking is dominated by pipelines that perform object detection followed by temporal association, also known as tracking-by-detection. We present a simultaneous detection and tracking algorithm that is simpler, faster, and more accurate than the state of the art. Our tracker, CenterTrack, applies a detection model to a pair of images and detections from the prior frame. Given this minimal input, CenterTrack localizes objects and predicts their associations with the previous frame. That's it. CenterTrack is simple, online (no peeking into the future), and real-time. It achieves 67.8% MOTA on the MOT17 challenge at 22 FPS and 89.4% MOTA on the KITTI tracking benchmark at 15 FPS, setting a new state of the art on both datasets. CenterTrack is easily extended to monocular 3D tracking by regressing additional 3D attributes. Using monocular video input, it achieves 28.3% AMOTA@0.2 on the newly released nuScenes 3D tracking benchmark, substantially outperforming the monocular baseline on this benchmark while running at 28 FPS.
translated by 谷歌翻译
我们介绍了遮阳板,一个新的像素注释的新数据集和一个基准套件,用于在以自我为中心的视频中分割手和活动对象。遮阳板注释Epic-kitchens的视频,其中带有当前视频分割数据集中未遇到的新挑战。具体而言,我们需要确保像素级注释作为对象经历变革性相互作用的短期和长期一致性,例如洋葱被剥皮,切成丁和煮熟 - 我们旨在获得果皮,洋葱块,斩波板,刀,锅以及表演手的准确像素级注释。遮阳板引入了一条注释管道,以零件为ai驱动,以进行可伸缩性和质量。总共,我们公开发布257个对象类的272K手册语义面具,990万个插值密集口罩,67K手动关系,涵盖36小时的179个未修剪视频。除了注释外,我们还引入了视频对象细分,互动理解和长期推理方面的三个挑战。有关数据,代码和排行榜:http://epic-kitchens.github.io/visor
translated by 谷歌翻译
Panoptic Part Segmentation (PPS) unifies panoptic segmentation and part segmentation into one task. Previous works utilize separated approaches to handle thing, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework named Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we make the following contributions: Firstly, we design a meta-architecture that decouples part feature and things/stuff feature, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Secondly, we propose a new metric Part-Whole Quality (PWQ) to better measure such task from both pixel-region and part-whole perspectives. It can also decouple the error for part segmentation and panoptic segmentation. Thirdly, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross attention scheme to further boost part segmentation qualities. We design a new part-whole interaction method using masked cross attention. Finally, the extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results with a significant cost drop of 70% on GFlops and 50% on parameters. Our models can serve as a strong baseline and aid future research in PPS. Code will be available.
translated by 谷歌翻译
我们介绍了Caltech Fish计数数据集(CFC),这是一个用于检测,跟踪和计数声纳视频中鱼类的大型数据集。我们将声纳视频识别为可以推进低信噪比计算机视觉应用程序并解决多对象跟踪(MOT)和计数中的域概括的丰富数据来源。与现有的MOT和计数数据集相比,这些数据集主要仅限于城市中的人和车辆的视频,CFC来自自然世界领域,在该域​​中,目标不容易解析,并且无法轻易利用外观功能来进行目标重新识别。 CFC允许​​研究人员训练MOT和计数算法并评估看不见的测试位置的概括性能。我们执行广泛的基线实验,并确定在MOT和计数中推进概括的最新技术的关键挑战和机会。
translated by 谷歌翻译
In this paper, we propose a unified panoptic segmentation network (UPSNet) for tackling the newly proposed panoptic segmentation task. On top of a single backbone residual network, we first design a deformable convolution based semantic segmentation head and a Mask R-CNN style instance segmentation head which solve these two subtasks simultaneously. More importantly, we introduce a parameter-free panoptic head which solves the panoptic segmentation via pixel-wise classification. It first leverages the logits from the previous two heads and then innovatively expands the representation for enabling prediction of an extra unknown class which helps better resolve the conflicts between semantic and instance segmentation. Additionally, it handles the challenge caused by the varying number of instances and permits back propagation to the bottom modules in an end-to-end manner. Extensive experimental results on Cityscapes, COCO and our internal dataset demonstrate that our UPSNet achieves stateof-the-art performance with much faster inference. Code has been made available at: https://github.com/ uber-research/UPSNet. * Equal contribution.† This work was done when Hengshuang Zhao was an intern at Uber ATG.
translated by 谷歌翻译