Australian Centre for Robotic Vision {guosheng.lin;anton.milan;chunhua.shen;
translated by 谷歌翻译
In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First, we highlight convolution with upsampled filters, or 'atrous convolution', as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed "DeepLab" system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7% mIOU in the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly available online.
translated by 谷歌翻译
Scene parsing is challenging for unrestricted open vocabulary and diverse scenes. In this paper, we exploit the capability of global context information by different-regionbased context aggregation through our pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet). Our global prior representation is effective to produce good quality results on the scene parsing task, while PSPNet provides a superior framework for pixellevel prediction. The proposed approach achieves state-ofthe-art performance on various datasets. It came first in Im-ageNet scene parsing challenge 2016, PASCAL VOC 2012 benchmark and Cityscapes benchmark. A single PSPNet yields the new record of mIoU accuracy 85.4% on PASCAL VOC 2012 and accuracy 80.2% on Cityscapes.
translated by 谷歌翻译
One of recent trends [31,32,14] in network architecture design is stacking small filters (e.g., 1x1 or 3x3) in the entire network because the stacked small filters is more efficient than a large kernel, given the same computational complexity. However, in the field of semantic segmentation, where we need to perform dense per-pixel prediction, we find that the large kernel (and effective receptive field) plays an important role when we have to perform the classification and localization tasks simultaneously. Following our design principle, we propose a Global Convolutional Network to address both the classification and localization issues for the semantic segmentation. We also suggest a residual-based boundary refinement to further refine the object boundaries. Our approach achieves state-of-art performance on two public benchmarks and significantly outperforms previous results, 82.2% (vs 80.2%) on PASCAL VOC 2012 dataset and 76.9% (vs 71.8%) on Cityscapes dataset.
translated by 谷歌翻译
Deep Convolutional Neural Networks (DCNNs) have recently shown state of the art performance in high level vision tasks, such as image classification and object detection. This work brings together methods from DCNNs and probabilistic graphical models for addressing the task of pixel-level classification (also called "semantic image segmentation"). We show that responses at the final layer of DCNNs are not sufficiently localized for accurate object segmentation. This is due to the very invariance properties that make DCNNs good for high level tasks. We overcome this poor localization property of deep networks by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF). Qualitatively, our "DeepLab" system is able to localize segment boundaries at a level of accuracy which is beyond previous methods. Quantitatively, our method sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 71.6% IOU accuracy in the test set. We show how these results can be obtained efficiently: Careful network re-purposing and a novel application of the 'hole' algorithm from the wavelet community allow dense computation of neural net responses at 8 frames per second on a modern GPU.
translated by 谷歌翻译
Image segmentation is a key topic in image processing and computer vision with applications such as scene understanding, medical image analysis, robotic perception, video surveillance, augmented reality, and image compression, among many others. Various algorithms for image segmentation have been developed in the literature. Recently, due to the success of deep learning models in a wide range of vision applications, there has been a substantial amount of works aimed at developing image segmentation approaches using deep learning models. In this survey, we provide a comprehensive review of the literature at the time of this writing, covering a broad spectrum of pioneering works for semantic and instance-level segmentation, including fully convolutional pixel-labeling networks, encoder-decoder architectures, multi-scale and pyramid based approaches, recurrent networks, visual attention models, and generative models in adversarial settings. We investigate the similarity, strengths and challenges of these deep learning models, examine the most widely used datasets, report performances, and discuss promising future research directions in this area.
translated by 谷歌翻译
语义分割是计算机视觉中的关键任务之一,它是为图像中的每个像素分配类别标签。尽管最近取得了重大进展,但大多数现有方法仍然遇到两个具有挑战性的问题:1)图像中的物体和东西的大小可能非常多样化,要求将多规模特征纳入完全卷积网络(FCN); 2)由于卷积网络的固有弱点,很难分类靠近物体/物体的边界的像素。为了解决第一个问题,我们提出了一个新的多受感受性现场模块(MRFM),明确考虑了多尺度功能。对于第二期,我们设计了一个边缘感知损失,可有效区分对象/物体的边界。通过这两种设计,我们的多种接收场网络在两个广泛使用的语义分割基准数据集上实现了新的最先进的结果。具体来说,我们在CityScapes数据集上实现了83.0的平均值,在Pascal VOC2012数据集中达到了88.4的平均值。
translated by 谷歌翻译
In this paper, we address the scene segmentation task by capturing rich contextual dependencies based on the self-attention mechanism. Unlike previous works that capture contexts by multi-scale feature fusion, we propose a Dual Attention Network (DANet) to adaptively integrate local features with their global dependencies. Specifically, we append two types of attention modules on top of dilated FCN, which model the semantic interdependencies in spatial and channel dimensions respectively. The position attention module selectively aggregates the feature at each position by a weighted sum of the features at all positions. Similar features would be related to each other regardless of their distances. Meanwhile, the channel attention module selectively emphasizes interdependent channel maps by integrating associated features among all channel maps. We sum the outputs of the two attention modules to further improve feature representation which contributes to more precise segmentation results. We achieve new state-of-theart segmentation performance on three challenging scene segmentation datasets, i.e., Cityscapes, PASCAL Context and COCO Stuff dataset. In particular, a Mean IoU score of 81.5% on Cityscapes test set is achieved without using coarse data. 1 .
translated by 谷歌翻译
Incorporating multi-scale features in fully convolutional neural networks (FCNs) has been a key element to achieving state-of-the-art performance on semantic image segmentation. One common way to extract multi-scale features is to feed multiple resized input images to a shared deep network and then merge the resulting features for pixelwise classification. In this work, we propose an attention mechanism that learns to softly weight the multi-scale features at each pixel location. We adapt a state-of-the-art semantic image segmentation model, which we jointly train with multi-scale input images and the attention model. The proposed attention model not only outperforms averageand max-pooling, but allows us to diagnostically visualize the importance of features at different positions and scales. Moreover, we show that adding extra supervision to the output at each scale is essential to achieving excellent performance when merging multi-scale features. We demonstrate the effectiveness of our model with extensive experiments on three challenging datasets, including PASCAL-Person-Part,
translated by 谷歌翻译
语义分割是图像的像素明智标记。由于在像素级别定义了问题,因此确定图像类标签是不可接受的,而是在原始图像像素分辨率下本地化它们是必要的。通过卷积神经网络(CNN)在创建语义,高级和分层图像特征方面的非凡能力推动;在过去十年中提出了几种基于深入的学习的2D语义分割方法。在本调查中,我们主要关注最近的语义细分科学发展,特别是在使用2D图像的基于深度学习的方法。我们开始分析了对2D语义分割的公共图像集和排行榜,概述了性能评估中使用的技术。在研究现场的演变时,我们按时间顺序分类为三个主要时期,即预先和早期的深度学习时代,完全卷积的时代和后FCN时代。我们在技术上分析了解决领域的基本问题的解决方案,例如细粒度的本地化和规模不变性。在借阅我们的结论之前,我们提出了一张来自所有提到的时代的方法表,每个方法都概述了他们对该领域的贡献。我们通过讨论现场当前的挑战以及他们已经解决的程度来结束调查。
translated by 谷歌翻译
跨不同层的特征的聚合信息是密集预测模型的基本操作。尽管表现力有限,但功能级联占主导地位聚合运营的选择。在本文中,我们引入了细分特征聚合(AFA),以融合不同的网络层,具有更具表现力的非线性操作。 AFA利用空间和渠道注意,以计算层激活的加权平均值。灵感来自神经体积渲染,我们将AFA扩展到规模空间渲染(SSR),以执行多尺度预测的后期融合。 AFA适用于各种现有网络设计。我们的实验表明了对挑战性的语义细分基准,包括城市景观,BDD100K和Mapillary Vistas的一致而显着的改进,可忽略不计的计算和参数开销。特别是,AFA改善了深层聚集(DLA)模型在城市景观上的近6%Miou的性能。我们的实验分析表明,AFA学会逐步改进分割地图并改善边界细节,导致新的最先进结果对BSDS500和NYUDV2上的边界检测基准。在http://vis.xyz/pub/dla-afa上提供代码和视频资源。
translated by 谷歌翻译
我们展示了一个下一代神经网络架构,马赛克,用于移动设备上的高效和准确的语义图像分割。MOSAIC是通过各种移动硬件平台使用常用的神经操作设计,以灵活地部署各种移动平台。利用简单的非对称编码器 - 解码器结构,该解码器结构由有效的多尺度上下文编码器和轻量级混合解码器组成,以从聚合信息中恢复空间细节,Mosaic在平衡准确度和计算成本的同时实现了新的最先进的性能。基于搜索的分类网络,马赛克部署在定制的特征提取骨架顶部,达到目前行业标准MLPerf型号和最先进的架构,达到5%的绝对精度增益。
translated by 谷歌翻译
Spatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information. In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. We demonstrate the effectiveness of the proposed model on PASCAL VOC 2012 and Cityscapes datasets, achieving the test set performance of 89.0% and 82.1% without any post-processing. Our paper is accompanied with a publicly available reference implementation of the proposed models in Tensorflow at https: //github.com/tensorflow/models/tree/master/research/deeplab.
translated by 谷歌翻译
语义分割是将类标签分配给图像中每个像素的问题,并且是自动车辆视觉堆栈的重要组成部分,可促进场景的理解和对象检测。但是,许多表现最高的语义分割模型非常复杂且笨拙,因此不适合在计算资源有限且低延迟操作的板载自动驾驶汽车平台上部署。在这项调查中,我们彻底研究了旨在通过更紧凑,更有效的模型来解决这种未对准的作品,该模型能够在低内存嵌入式系统上部署,同时满足实时推理的限制。我们讨论了该领域中最杰出的作品,根据其主要贡献将它们置于分类法中,最后我们评估了在一致的硬件和软件设置下,所讨论模型的推理速度,这些模型代表了具有高端的典型研究环境GPU和使用低内存嵌入式GPU硬件的现实部署方案。我们的实验结果表明,许多作品能够在资源受限的硬件上实时性能,同时说明延迟和准确性之间的一致权衡。
translated by 谷歌翻译
语义分割是自主车辆了解周围场景的关键技术。当代模型的吸引力表现通常以牺牲重计算和冗长的推理时间为代价,这对于自行车来说是无法忍受的。在低分辨率图像上使用轻量级架构(编码器 - 解码器或双路)或推理,最近的方法实现了非常快的场景解析,即使在单个1080TI GPU上以100多件FPS运行。然而,这些实时方法与基于扩张骨架的模型之间的性能仍有显着差距。为了解决这个问题,我们提出了一家专门为实时语义细分设计的高效底座。所提出的深层双分辨率网络(DDRNET)由两个深部分支组成,之间进行多个双边融合。此外,我们设计了一个名为Deep聚合金字塔池(DAPPM)的新上下文信息提取器,以基于低分辨率特征映射放大有效的接收字段和熔丝多尺度上下文。我们的方法在城市景观和Camvid数据集上的准确性和速度之间实现了新的最先进的权衡。特别是,在单一的2080Ti GPU上,DDRNET-23-Slim在Camvid测试组上的Citycapes试验组102 FPS上的102 FPS,74.7%Miou。通过广泛使用的测试增强,我们的方法优于最先进的模型,需要计算得多。 CODES和培训的型号在线提供。
translated by 谷歌翻译
现代的高性能语义分割方法采用沉重的主链和扩张的卷积来提取相关特征。尽管使用上下文和语义信息提取功能对于分割任务至关重要,但它为实时应用程序带来了内存足迹和高计算成本。本文提出了一种新模型,以实现实时道路场景语义细分的准确性/速度之间的权衡。具体来说,我们提出了一个名为“比例吸引的条带引导特征金字塔网络”(s \ textsuperscript {2} -fpn)的轻巧模型。我们的网络由三个主要模块组成:注意金字塔融合(APF)模块,比例吸引条带注意模块(SSAM)和全局特征Upsample(GFU)模块。 APF采用了注意力机制来学习判别性多尺度特征,并有助于缩小不同级别之间的语义差距。 APF使用量表感知的关注来用垂直剥离操作编码全局上下文,并建模长期依赖性,这有助于将像素与类似的语义标签相关联。此外,APF还采用频道重新加权块(CRB)来强调频道功能。最后,S \ TextSuperScript {2} -fpn的解码器然后采用GFU,该GFU用于融合APF和编码器的功能。已经对两个具有挑战性的语义分割基准进行了广泛的实验,这表明我们的方法通过不同的模型设置实现了更好的准确性/速度权衡。提出的模型已在CityScapes Dataset上实现了76.2 \%miou/87.3fps,77.4 \%miou/67fps和77.8 \%miou/30.5fps,以及69.6 \%miou,71.0 miou,71.0 \%miou,和74.2 \%\%\%\%\%\%。 miou在Camvid数据集上。这项工作的代码将在\ url {https://github.com/mohamedac29/s2-fpn提供。
translated by 谷歌翻译
Semantic segmentation works on the computer vision algorithm for assigning each pixel of an image into a class. The task of semantic segmentation should be performed with both accuracy and efficiency. Most of the existing deep FCNs yield to heavy computations and these networks are very power hungry, unsuitable for real-time applications on portable devices. This project analyzes current semantic segmentation models to explore the feasibility of applying these models for emergency response during catastrophic events. We compare the performance of real-time semantic segmentation models with non-real-time counterparts constrained by aerial images under oppositional settings. Furthermore, we train several models on the Flood-Net dataset, containing UAV images captured after Hurricane Harvey, and benchmark their execution on special classes such as flooded buildings vs. non-flooded buildings or flooded roads vs. non-flooded roads. In this project, we developed a real-time UNet based model and deployed that network on Jetson AGX Xavier module.
translated by 谷歌翻译
We focus on the challenging task of real-time semantic segmentation in this paper. It finds many practical applications and yet is with fundamental difficulty of reducing a large portion of computation for pixel-wise label inference. We propose an image cascade network (ICNet) that incorporates multi-resolution branches under proper label guidance to address this challenge. We provide in-depth analysis of our framework and introduce the cascade feature fusion unit to quickly achieve highquality segmentation. Our system yields real-time inference on a single GPU card with decent quality results evaluated on challenging datasets like Cityscapes, CamVid and COCO-Stuff.
translated by 谷歌翻译
Semantic segmentation is a challenging task that addresses most of the perception needs of Intelligent Vehicles (IV) in an unified way. Deep Neural Networks excel at this task, as they can be trained end-to-end to accurately classify multiple object categories in an image at pixel level. However, a good trade-off between high quality and computational resources is yet not present in state-of-the-art semantic segmentation approaches, limiting their application in real vehicles. In this paper, we propose a deep architecture that is able to run in real-time while providing accurate semantic segmentation. The core of our architecture is a novel layer that uses residual connections and factorized convolutions in order to remain efficient while retaining remarkable accuracy. Our approach is able to run at over 83 FPS in a single Titan X, and 7 FPS in a Jetson TX1 (embedded GPU). A comprehensive set of experiments on the publicly available Cityscapes dataset demonstrates that our system achieves an accuracy that is similar to the state of the art, while being orders of magnitude faster to compute than other architectures that achieve top precision. The resulting trade-off makes our model an ideal approach for scene understanding in IV applications. The code is publicly available at: https://github.com/Eromera/erfnet
translated by 谷歌翻译
Semantic image segmentation is a basic street scene understanding task in autonomous driving, where each pixel in a high resolution image is categorized into a set of semantic labels. Unlike other scenarios, objects in autonomous driving scene exhibit very large scale changes, which poses great challenges for high-level feature representation in a sense that multi-scale information must be correctly encoded. To remedy this problem, atrous convolution [14] was introduced to generate features with larger receptive fields without sacrificing spatial resolution. Built upon atrous convolution, Atrous Spatial Pyramid Pooling (ASPP) [2] was proposed to concatenate multiple atrous-convolved features using different dilation rates into a final feature representation. Although ASPP is able to generate multi-scale features, we argue the feature resolution in the scale-axis is not dense enough for the autonomous driving scenario. To this end, we propose Densely connected Atrous Spatial Pyramid Pooling (DenseASPP), which connects a set of atrous convolutional layers in a dense way, such that it generates multi-scale features that not only cover a larger scale range, but also cover that scale range densely, without significantly increasing the model size. We evaluate DenseASPP on the street scene benchmark Cityscapes [4] and achieve state-of-the-art performance.
translated by 谷歌翻译