语义场景的理解对于在各种环境中作用的移动代理至关重要。尽管语义细分已经提供了大量信息,但缺少有关单个对象以及一般场景的详细信息,但对于许多现实世界应用程序所必需。但是,分别解决多个任务是昂贵的,并且在移动平台上计算和电池能力有限,无法实时完成。在本文中,我们提出了一种有效的多任务方法,用于RGB-D场景分析〜(EMSANET),该方法同时执行语义和实例分割〜(Panoptic分割),实例方向估计和场景分类。我们表明,所有任务都可以在移动平台上实时使用单个神经网络完成,而不会降低性能 - 相比之下,各个任务能够彼此受益。为了评估我们的多任务方法,我们扩展了常见的RGB-D室内数据集NYUV2和SUNRGB-D的注释,例如分割和方向估计。据我们所知,我们是第一个为NYUV2和SUNRGB-D上的室内场景分析提供如此全面的多任务设置的结果。
translated by 谷歌翻译
In this work, we introduce Panoptic-DeepLab, a simple, strong, and fast system for panoptic segmentation, aiming to establish a solid baseline for bottom-up methods that can achieve comparable performance of two-stage methods while yielding fast inference speed. In particular, Panoptic-DeepLab adopts the dual-ASPP and dual-decoder structures specific to semantic, and instance segmentation, respectively. The semantic segmentation branch is the same as the typical design of any semantic segmentation model (e.g., DeepLab), while the instance segmentation branch is class-agnostic, involving a simple instance center regression. As a result, our single Panoptic-DeepLab simultaneously ranks first at all three Cityscapes benchmarks, setting the new state-of-art of 84.2% mIoU, 39.0% AP, and 65.5% PQ on test set. Additionally, equipped with MobileNetV3, Panoptic-DeepLab runs nearly in real-time with a single 1025 × 2049 image (15.8 frames per second), while achieving a competitive performance on Cityscapes (54.1 PQ% on test set). On Mapillary Vistas test set, our ensemble of six models attains 42.7% PQ, outperforming the challenge winner in 2018 by a healthy margin of 1.5%. Finally, our Panoptic-DeepLab also performs on par with several topdown approaches on the challenging COCO dataset. For the first time, we demonstrate a bottom-up approach could deliver state-of-the-art results on panoptic segmentation.
translated by 谷歌翻译
视频分析的图像分割在不同的研究领域起着重要作用,例如智能城市,医疗保健,计算机视觉和地球科学以及遥感应用。在这方面,最近致力于发展新的细分策略;最新的杰出成就之一是Panoptic细分。后者是由语义和实例分割的融合引起的。明确地,目前正在研究Panoptic细分,以帮助获得更多对视频监控,人群计数,自主驾驶,医学图像分析的图像场景的更细致的知识,以及一般对场景更深入的了解。为此,我们介绍了本文的首次全面审查现有的Panoptic分段方法,以获得作者的知识。因此,基于所采用的算法,应用场景和主要目标的性质,执行现有的Panoptic技术的明确定义分类。此外,讨论了使用伪标签注释新数据集的Panoptic分割。继续前进,进行消融研究,以了解不同观点的Panoptic方法。此外,讨论了适合于Panoptic分割的评估度量,并提供了现有解决方案性能的比较,以告知最先进的并识别其局限性和优势。最后,目前对主题技术面临的挑战和吸引不久的将来吸引相当兴趣的未来趋势,可以成为即将到来的研究研究的起点。提供代码的文件可用于:https://github.com/elharroussomar/awesome-panoptic-egation
translated by 谷歌翻译
The recently introduced panoptic segmentation task has renewed our community's interest in unifying the tasks of instance segmentation (for thing classes) and semantic segmentation (for stuff classes). However, current state-ofthe-art methods for this joint task use separate and dissimilar networks for instance and semantic segmentation, without performing any shared computation. In this work, we aim to unify these methods at the architectural level, designing a single network for both tasks. Our approach is to endow Mask R-CNN, a popular instance segmentation method, with a semantic segmentation branch using a shared Feature Pyramid Network (FPN) backbone. Surprisingly, this simple baseline not only remains effective for instance segmentation, but also yields a lightweight, topperforming method for semantic segmentation. In this work, we perform a detailed study of this minimally extended version of Mask R-CNN with FPN, which we refer to as Panoptic FPN, and show it is a robust and accurate baseline for both tasks. Given its effectiveness and conceptual simplicity, we hope our method can serve as a strong baseline and aid future research in panoptic segmentation.
translated by 谷歌翻译
我们介绍了MGNET,这是一个多任务框架,用于单眼几何场景。我们将单眼几何场景的理解定义为两个已知任务的组合:全景分割和自我监管的单眼深度估计。全景分段不仅在语义上,而且在实例的基础上捕获完整场景。自我监督的单眼深度估计使用摄像机测量模型得出的几何约束,以便从单眼视频序列中测量深度。据我们所知,我们是第一个在一个模型中提出这两个任务的组合的人。我们的模型专注于低潜伏期,以实时在单个消费级GPU上实时提供快速推断。在部署过程中,我们的模型将产生密集的3D点云,其中具有来自单个高分辨率摄像头图像的实例意识到语义标签。我们对两个流行的自动驾驶基准(即CityScapes and Kitti)评估了模型,并在其他能够实时的方法中表现出竞争性能。源代码可从https://github.com/markusschoen/mgnet获得。
translated by 谷歌翻译
In this paper, we propose a unified panoptic segmentation network (UPSNet) for tackling the newly proposed panoptic segmentation task. On top of a single backbone residual network, we first design a deformable convolution based semantic segmentation head and a Mask R-CNN style instance segmentation head which solve these two subtasks simultaneously. More importantly, we introduce a parameter-free panoptic head which solves the panoptic segmentation via pixel-wise classification. It first leverages the logits from the previous two heads and then innovatively expands the representation for enabling prediction of an extra unknown class which helps better resolve the conflicts between semantic and instance segmentation. Additionally, it handles the challenge caused by the varying number of instances and permits back propagation to the bottom modules in an end-to-end manner. Extensive experimental results on Cityscapes, COCO and our internal dataset demonstrate that our UPSNet achieves stateof-the-art performance with much faster inference. Code has been made available at: https://github.com/ uber-research/UPSNet. * Equal contribution.† This work was done when Hengshuang Zhao was an intern at Uber ATG.
translated by 谷歌翻译
点云的Panoptic分割是一种重要的任务,使自动车辆能够使用高精度可靠的激光雷达传感器来理解其附近。现有的自上而下方法通过将独立的任务特定网络或转换方法从图像域转换为忽略激光雷达数据的复杂性,因此通常会导致次优性性能来解决这个问题。在本文中,我们提出了新的自上而下的高效激光乐光线分割(有效的LID)架构,该架构解决了分段激光雷达云中的多种挑战,包括距离依赖性稀疏性,严重的闭塞,大规模变化和重新投影误差。高效地板包括一种新型共享骨干,可以通过加强的几何变换建模容量进行编码,并聚合语义丰富的范围感知多尺度特征。它结合了新的不变语义和实例分段头以及由我们提出的Panoptic外围损耗功能监督的Panoptic Fusion模块。此外,我们制定了正则化的伪标签框架,通过对未标记数据的培训进行进一步提高高效性的性能。我们在两个大型LIDAR数据集中建议模型基准:NUSCENES,我们还提供了地面真相注释和Semantickitti。值得注意的是,高效地将在两个数据集上设置新的最先进状态。
translated by 谷歌翻译
在这项工作中,我们将全景景观分割介绍为最整体的场景理解,无论是在视野(FOV)和图像级别的理解方面,用于基于标准摄像机的输入。完整的围绕理解为移动代理提供了最大的信息,这对于任何智能车辆至关重要,以便在安全至关重要的动态环境(例如现实世界流量)中做出明智的决定。为了克服缺乏带注释的全景图像,我们提出了一个框架,该框架允许在标准针孔图像上进行模型训练,并以成本限制的方式将学习的功能传输到不同的域。使用我们提出的方法和密集的对比度学习,我们设法对非适应方法实现了重大改进。根据有效的综合分割体系结构,我们可以在我们已建立的野生全景泛滥分割(WILDPPS)数据集中,以圆锥体质量(PQ)测量的3.5-6.5%提高3.5-6.5%。此外,我们的有效框架不需要访问目标域的图像,使其成为适合有限硬件设置的可行域概括方法。作为其他贡献,我们发布了WILDPPS:第一个全景全景图像数据集,以促进周围感知的进展,并探索一种结合受监督和对比度培训的新型培训程序。
translated by 谷歌翻译
Understanding 3D environments semantically is pivotal in autonomous driving applications where multiple computer vision tasks are involved. Multi-task models provide different types of outputs for a given scene, yielding a more holistic representation while keeping the computational cost low. We propose a multi-task model for panoptic segmentation and depth completion using RGB images and sparse depth maps. Our model successfully predicts fully dense depth maps and performs semantic segmentation, instance segmentation, and panoptic segmentation for every input frame. Extensive experiments were done on the Virtual KITTI 2 dataset and we demonstrate that our model solves multiple tasks, without a significant increase in computational cost, while keeping high accuracy performance. Code is available at https://github.com/juanb09111/PanDepth.git
translated by 谷歌翻译
Panoptic现场了解和跟踪动态代理对于机器人和自动化车辆至关重要,以在城市环境中导航。由于LiDAR提供了方案的精确照明和几何描绘,使用LIDAR点云执行这些任务提供可靠的预测。然而,现有数据集缺乏城市场景类型的多样性,并且具有有限数量的动态对象实例,其阻碍了这些任务的学习以及开发方法的可信基准。在本文中,我们介绍了大规模的Panoptic Nuscenes基准数据集,它扩展了我们流行的NUSCENES DataSet,具有用于语义分割,Panoptic分段和Panoptic跟踪任务的Pock-Wise Trountruth annotations。为了便于比较,我们为我们提出的数据集提供了几个任务的强大基线。此外,我们分析了Panoptic跟踪的现有度量标准的缺点,并提出了一种解决问题的小说实例的Pat度量。我们提供详尽的实验,展示了Panoptic Nuscenes与现有数据集相比的效用,并在Nuscenes.org提供的在线评估服务器。我们认为,此扩展将加快新颖的现场了解动态城市环境的新方法研究。
translated by 谷歌翻译
Panoptic Part Segmentation (PPS) unifies panoptic segmentation and part segmentation into one task. Previous works utilize separated approaches to handle thing, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework named Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we make the following contributions: Firstly, we design a meta-architecture that decouples part feature and things/stuff feature, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Secondly, we propose a new metric Part-Whole Quality (PWQ) to better measure such task from both pixel-region and part-whole perspectives. It can also decouple the error for part segmentation and panoptic segmentation. Thirdly, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross attention scheme to further boost part segmentation qualities. We design a new part-whole interaction method using masked cross attention. Finally, the extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results with a significant cost drop of 70% on GFlops and 50% on parameters. Our models can serve as a strong baseline and aid future research in PPS. Code will be available.
translated by 谷歌翻译
了解单个图像的3D场景是各种任务的基础,例如用于机器人,运动规划或增强现实。来自单个RGB图像的3D感知的现有工作倾向于专注于几何重建,或用语义分割或实例分割的几何重建。受到2D Panoptic分割的启发,我们建议统一几何重建,3D语义分割和3D实例分段的任务,进入Panoptic 3D场景重建的任务 - 从单个RGB图像预测相机中场景的完整几何重建图像的截图,以及语义和实例分割。因此,我们为从单个RGB图像提出了一种全新3D场景的新方法,该方法学习从输入图像到达3D容量场景表示来升力和传播2D特征。我们证明,这种联合场景重建,语义和实例分割的整体视图是有益的,独立地处理任务,从而优于替代方法。
translated by 谷歌翻译
为视频中的每个像素分配语义类和跟踪身份的任务称为视频Panoptic分段。我们的工作是第一个在真实世界中瞄准这项任务,需要在空间和时间域中的密集解释。由于此任务的地面真理难以获得,但是,现有数据集是合成构造的或仅在短视频剪辑中稀疏地注释。为了克服这一点,我们介绍了一个包含两个数据集,Kitti-Step和Motchallenge步骤的新基准。数据集包含长视频序列,提供具有挑战性的示例和用于研究长期像素精确分割和在真实条件下跟踪的测试床。我们进一步提出了一种新的评估度量分割和跟踪质量(STQ),其相当余额平衡该任务的语义和跟踪方面,并且更适合评估任意长度的序列。最后,我们提供了几个基线来评估此新具有挑战性数据集的现有方法的状态。我们已将我们的数据集,公制,基准服务器和基准公开提供,并希望这将激发未来的研究。
translated by 谷歌翻译
鸟瞰图(BEV)地图已成为现场理解最强大的表达之一,因为他们能够提供丰富的空间上下文,同时容易解释和处理。此类地图已在许多实际任务中发现,广泛地依赖于准确的场景分段以及在BEV空间中的对象实例标识以进行操作。然而,现有的分段算法仅预测BEV空间中的语义,这限制了它们在对象实例概念也是关键的应用中的应用。在这项工作中,给出了前面视图(FV)中的单眼图像,前往直接预测BEV中的密集Panoptic分段图的第一个BEV Panoptic分割方法。我们的架构遵循自上而下的范式,并采用了一种新型密集变压器模块,包括两个不同的变压器,该模块包括从FV到BEV的输入图像中独立地将垂直和平坦区域映射到BEV的不同变压器。另外,我们推导出用于FV-BEV变换的灵敏度的数学制定,其允许我们智能地重量BEV空间中的像素,以考虑在FV图像上的变化描述。关于基提-360和NUSCENES数据集的广泛评估表明,我们的方法分别超过了PQ度量的最先进的3.61 pp和4.93 pp。
translated by 谷歌翻译
在本文中,我们专注于探索有效的方法,以更快,准确和域的不可知性语义分割。受到相邻视频帧之间运动对齐的光流的启发,我们提出了一个流对齐模块(FAM),以了解相邻级别的特征映射之间的\ textit {语义流},并将高级特征广播到高分辨率特征有效地,有效地有效。 。此外,将我们的FAM与共同特征的金字塔结构集成在一起,甚至在轻量重量骨干网络(例如Resnet-18和DFNET)上也表现出优于其他实时方法的性能。然后,为了进一步加快推理过程,我们还提出了一个新型的封闭式双流对齐模块,以直接对齐高分辨率特征图和低分辨率特征图,在该图中我们将改进版本网络称为SFNET-LITE。广泛的实验是在几个具有挑战性的数据集上进行的,结果显示了SFNET和SFNET-LITE的有效性。特别是,建议的SFNET-LITE系列在使用RESNET-18主链和78.8 MIOU以120 fps运行的情况下,使用RTX-3090上的STDC主链在120 fps运行时,在60 fps运行时达到80.1 miou。此外,我们将四个具有挑战性的驾驶数据集(即CityScapes,Mapillary,IDD和BDD)统一到一个大数据集中,我们将其命名为Unified Drive细分(UDS)数据集。它包含不同的域和样式信息。我们基准了UDS上的几项代表性作品。 SFNET和SFNET-LITE仍然可以在UDS上取得最佳的速度和准确性权衡,这在如此新的挑战性环境中是强大的基准。所有代码和模型均可在https://github.com/lxtgh/sfsegnets上公开获得。
translated by 谷歌翻译
前所未有的访问多时间卫星图像,为各种地球观察任务开辟了新的视角。其中,农业包裹的像素精确的Panoptic分割具有重大的经济和环境影响。虽然研究人员对单张图像进行了探索了这个问题,但我们争辩说,随着图像的时间序列更好地寻址作物候选的复杂时间模式。在本文中,我们介绍了卫星图像时间序列(坐着)的Panoptic分割的第一端到端,单级方法(坐姿)。该模块可以与我们的新型图像序列编码网络相结合,依赖于时间自我关注,以提取丰富和自适应的多尺度时空特征。我们还介绍了Pastis,第一个开放式访问坐在Panoptic注释的数据集。我们展示了对多个竞争架构的语义细分的编码器的优越性,并建立了坐在的第一封Panoptic细分状态。我们的实施和痛苦是公开的。
translated by 谷歌翻译
Image segmentation is a key topic in image processing and computer vision with applications such as scene understanding, medical image analysis, robotic perception, video surveillance, augmented reality, and image compression, among many others. Various algorithms for image segmentation have been developed in the literature. Recently, due to the success of deep learning models in a wide range of vision applications, there has been a substantial amount of works aimed at developing image segmentation approaches using deep learning models. In this survey, we provide a comprehensive review of the literature at the time of this writing, covering a broad spectrum of pioneering works for semantic and instance-level segmentation, including fully convolutional pixel-labeling networks, encoder-decoder architectures, multi-scale and pyramid based approaches, recurrent networks, visual attention models, and generative models in adversarial settings. We investigate the similarity, strengths and challenges of these deep learning models, examine the most widely used datasets, report performances, and discuss promising future research directions in this area.
translated by 谷歌翻译
用于LIDAR点云的快速准确的Panoptic分割系统对于自主驾驶车辆来了解周围物体和场景至关重要。现有方法通常依赖于提案或聚类到分段前景实例。结果,他们努力实现实时性能。在本文中,我们提出了一种用于LIDAR点云的新型实时端到端Panoptic分段网络,称为CPSEG。特别地,CPSEG包括共享编码器,双解码器,任务感知注意模块(TAM)和无簇实例分段头。 TAM旨在强制执行这两个解码器以学习用于语义和实例嵌入的丰富的任务感知功能。此外,CPSEG包含一个新的无簇实例分割头,以根据学习嵌入的嵌入动态占据前景点。然后,它通过找到具有成对嵌入比较的连接的柱子来获取实例标签。因此,将传统的基于提议的或基于聚类的实例分段转换为对成对嵌入比较矩阵的二进制分段问题。为了帮助网络回归实例嵌入,提出了一种快速和确定的深度完成算法,以实时计算每个点云的表面法线。该方法在两个大型自主驾驶数据集中基准测试,即Semantickitti和Nuscenes。值得注意的是,广泛的实验结果表明,CPSEG在两个数据集的实时方法中实现了最先进的结果。
translated by 谷歌翻译
对于现代自治系统来说,可靠的场景理解是必不可少的。当前基于学习的方法通常试图根据仅考虑分割质量的细分指标来最大化其性能。但是,对于系统在现实世界中的安全操作,考虑预测的不确定性也至关重要。在这项工作中,我们介绍了不确定性感知的全景分段的新任务,该任务旨在预测每个像素语义和实例分割,以及每个像素不确定性估计。我们定义了两个新颖的指标,以促进其定量分析,不确定性感知的综合质量(UPQ)和全景预期校准误差(PECE)。我们进一步提出了新型的自上而下的证据分割网络(EVPSNET),以解决此任务。我们的架构采用了一个简单而有效的概率融合模块,该模块利用了预测的不确定性。此外,我们提出了一种新的LOV \'ASZ证据损失函数,以优化使用深度证据学习概率的分割的IOU。此外,我们提供了几个强大的基线,将最新的泛型分割网络与无抽样的不确定性估计技术相结合。广泛的评估表明,我们的EVPSNET可以实现标准综合质量(PQ)的新最新技术,以及我们的不确定性倾斜度指标。
translated by 谷歌翻译
We present MaX-DeepLab, the first end-to-end model for panoptic segmentation. Our approach simplifies the current pipeline that depends heavily on surrogate sub-tasks and hand-designed components, such as box detection, nonmaximum suppression, thing-stuff merging, etc. Although these sub-tasks are tackled by area experts, they fail to comprehensively solve the target task. By contrast, our MaX-DeepLab directly predicts class-labeled masks with a mask transformer, and is trained with a panoptic quality inspired loss via bipartite matching. Our mask transformer employs a dual-path architecture that introduces a global memory path in addition to a CNN path, allowing direct communication with any CNN layers. As a result, MaX-DeepLab shows a significant 7.1% PQ gain in the box-free regime on the challenging COCO dataset, closing the gap between box-based and box-free methods for the first time. A small variant of MaX-DeepLab improves 3.0% PQ over DETR with similar parameters and M-Adds. Furthermore, MaX-DeepLab, without test time augmentation, achieves new state-of-the-art 51.3% PQ on COCO test-dev set.
translated by 谷歌翻译