时空预测学习是通过历史先验知识来预测未来的框架变化。以前的工作通过使网络更广泛和更深入来改善性能,但这也带来了巨大的内存开销,这严重阻碍了技术的开发和应用。比例是提高普通计算机视觉任务中模型性能的另一个维度,这可以减少计算要求并更好地感知环境。最近的RNN模型尚未考虑和探索如此重要的维度。在本文中,我们从多尺度的好处中学习,我们提出了一个名为多尺度RNN(MS-RNN)的通用框架,以增强最近的RNN模型。我们通过在4个不同的数据集上使用6种流行的RNN模型(Convlstm,Trajgru,Predrnn,Prodrnn ++,MIM和MotionRNN)进行详尽的实验来验证MS-RNN框架。结果表明,将RNN模型纳入我们的框架的效率低得多,但性能比以前更好。我们的代码在\ url {https://github.com/mazhf/ms-rnn}上发布。
translated by 谷歌翻译
The goal of precipitation nowcasting is to predict the future rainfall intensity in a local region over a relatively short period of time. Very few previous studies have examined this crucial and challenging weather forecasting problem from the machine learning perspective. In this paper, we formulate precipitation nowcasting as a spatiotemporal sequence forecasting problem in which both the input and the prediction target are spatiotemporal sequences. By extending the fully connected LSTM (FC-LSTM) to have convolutional structures in both the input-to-state and state-to-state transitions, we propose the convolutional LSTM (ConvLSTM) and use it to build an end-to-end trainable model for the precipitation nowcasting problem. Experiments show that our ConvLSTM network captures spatiotemporal correlations better and consistently outperforms FC-LSTM and the state-of-theart operational ROVER algorithm for precipitation nowcasting.
translated by 谷歌翻译
We are introducing a multi-scale predictive model for video prediction here, whose design is inspired by the "Predictive Coding" theories and "Coarse to Fine" approach. As a predictive coding model, it is updated by a combination of bottom-up and top-down information flows, which is different from traditional bottom-up training style. Its advantage is to reduce the dependence on input information and improve its ability to predict and generate images. Importantly, we achieve with a multi-scale approach -- higher level neurons generate coarser predictions (lower resolution), while the lower level generate finer predictions (higher resolution). This is different from the traditional predictive coding framework in which higher level predict the activity of neurons in lower level. To improve the predictive ability, we integrate an encoder-decoder network in the LSTM architecture and share the final encoded high-level semantic information between different levels. Additionally, since the output of each network level is an RGB image, a smaller LSTM hidden state can be used to retain and update the only necessary hidden information, avoiding being mapped to an overly discrete and complex space. In this way, we can reduce the difficulty of prediction and the computational overhead. Finally, we further explore the training strategies, to address the instability in adversarial training and mismatch between training and testing in long-term prediction. Code is available at https://github.com/Ling-CF/MSPN.
translated by 谷歌翻译
从CNN,RNN到VIT,我们见证了视频预测中的显着进步,结合了辅助输入,精心设计的神经体系结构和复杂的培训策略。我们钦佩这些进步,但对必要性感到困惑:是否有一种可以表现得很好的简单方法?本文提出了SIMVP,这是一个简单的视频预测模型,完全建立在CNN上,并以端到端的方式受到MSE损失的训练。在不引入任何其他技巧和复杂策略的情况下,我们可以在五个基准数据集上实现最先进的性能。通过扩展实验,我们证明了SIMVP在现实世界数据集上具有强大的概括和可扩展性。培训成本的显着降低使扩展到复杂方案变得更加容易。我们认为SIMVP可以作为刺激视频预测进一步发展的坚实基线。该代码可在\ href {https://github.com/gaozhangyang/simvp-simpler-yet-better-video-prediction} {github}中获得。
translated by 谷歌翻译
尽管基于经常性的神经网络(RNN)的视频预测方法已经取得了重大成就,但由于信息损失问题和基于知觉的卑鄙平方错误(MSE)损失功能,它们在具有高分辨率的数据集中的性能仍然远远不令人满意。 。在本文中,我们提出了一个时空信息保存和感知声明模型(STIP),以解决上述两个问题。为了解决信息损失问题,提出的模型旨在在功能提取和状态过渡期间分别保留视频的时空信息。首先,基于X-NET结构设计了多透明时空自动编码器(MGST-AE)。拟议的MGST-AE可以帮助解码器回忆到时间和空间域中编码器的多透明信息。这样,在高分辨率视频的功能提取过程中,可以保留更多时空信息。其次,时空门控复发单元(STGRU)是基于标准的封闭式复发单元(GRU)结构而设计的,该结构可以在状态过渡期间有效地保留时空信息。与流行的长期短期(LSTM)的预测记忆相比,提出的STGRU可以通过计算负载较低的计算负载来实现更令人满意的性能。此外,为了改善传统的MSE损失功能,基于生成的对抗网络(GAN)进一步设计了学识渊博的知觉损失(LP-loss),这可以帮助获得客观质量和感知质量之间的令人满意的权衡。实验结果表明,与各种最先进的方法相比,提出的Stip可以预测具有更令人满意的视觉质量的视频。源代码已在\ url {https://github.com/zhengchang467/stiphr}上获得。
translated by 谷歌翻译
The mainstream of the existing approaches for video prediction builds up their models based on a Single-In-Single-Out (SISO) architecture, which takes the current frame as input to predict the next frame in a recursive manner. This way often leads to severe performance degradation when they try to extrapolate a longer period of future, thus limiting the practical use of the prediction model. Alternatively, a Multi-In-Multi-Out (MIMO) architecture that outputs all the future frames at one shot naturally breaks the recursive manner and therefore prevents error accumulation. However, only a few MIMO models for video prediction are proposed and they only achieve inferior performance due to the date. The real strength of the MIMO model in this area is not well noticed and is largely under-explored. Motivated by that, we conduct a comprehensive investigation in this paper to thoroughly exploit how far a simple MIMO architecture can go. Surprisingly, our empirical studies reveal that a simple MIMO model can outperform the state-of-the-art work with a large margin much more than expected, especially in dealing with longterm error accumulation. After exploring a number of ways and designs, we propose a new MIMO architecture based on extending the pure Transformer with local spatio-temporal blocks and a new multi-output decoder, namely MIMO-VP, to establish a new standard in video prediction. We evaluate our model in four highly competitive benchmarks (Moving MNIST, Human3.6M, Weather, KITTI). Extensive experiments show that our model wins 1st place on all the benchmarks with remarkable performance gains and surpasses the best SISO model in all aspects including efficiency, quantity, and quality. We believe our model can serve as a new baseline to facilitate the future research of video prediction tasks. The code will be released.
translated by 谷歌翻译
时空预测学习旨在通过从历史框架中学习来产生未来的帧。在本文中,我们研究了现有方法,并提出了时空预测学习的一般框架,其中空间编码器和解码器捕获框架内特征和中间时间模块捕获框架间相关性。尽管主流方法采用经常性单元来捕获长期的时间依赖性,但由于无法可行的架构,它们的计算效率低。为了使时间模块并行,我们提出了时间注意单元(TAU),该单元将时间关注分解为框内静态注意力和框架间动力学注意力。此外,虽然平方误差损失侧重于框架内错误,但我们引入了一种新颖的差异差异正则化,以考虑框架间的变化。广泛的实验表明,所提出的方法使派生模型能够在各种时空预测基准上实现竞争性能。
translated by 谷歌翻译
In single image deblurring, the "coarse-to-fine" scheme, i.e. gradually restoring the sharp image on different resolutions in a pyramid, is very successful in both traditional optimization-based methods and recent neural-networkbased approaches. In this paper, we investigate this strategy and propose a Scale-recurrent Network (SRN-DeblurNet) for this deblurring task. Compared with the many recent learning-based approaches in [25], it has a simpler network structure, a smaller number of parameters and is easier to train. We evaluate our method on large-scale deblurring datasets with complex motion. Results show that our method can produce better quality results than state-of-thearts, both quantitatively and qualitatively.
translated by 谷歌翻译
人类自然有效地在复杂的场景中找到突出区域。通过这种观察的动机,引入了计算机视觉中的注意力机制,目的是模仿人类视觉系统的这一方面。这种注意机制可以基于输入图像的特征被视为动态权重调整过程。注意机制在许多视觉任务中取得了巨大的成功,包括图像分类,对象检测,语义分割,视频理解,图像生成,3D视觉,多模态任务和自我监督的学习。在本调查中,我们对计算机愿景中的各种关注机制进行了全面的审查,并根据渠道注意,空间关注,暂时关注和分支注意力进行分类。相关的存储库https://github.com/menghaoguo/awesome-vision-tions致力于收集相关的工作。我们还建议了未来的注意机制研究方向。
translated by 谷歌翻译
近期对抗性生成建模的突破导致了能够生产高质量的视频样本的模型,即使在真实世界视频的大型和复杂的数据集上也是如此。在这项工作中,我们专注于视频预测的任务,其中给出了从视频中提取的一系列帧,目标是生成合理的未来序列。我们首先通过对鉴别器分解进行系统的实证研究并提出产生更快的收敛性和更高性能的系统来提高本领域的最新技术。然后,我们分析发电机中的复发单元,并提出了一种新的复发单元,其根据预测的运动样本来改变其过去的隐藏状态,并改进它以处理DIS闭塞,场景变化和其他复杂行为。我们表明,这种经常性单位始终如一地优于以前的设计。我们的最终模型导致最先进的性能中的飞跃,从大型动力学-600数据集中获得25.7的测试集Frechet视频距离为25.7,下降到69.2。
translated by 谷歌翻译
随着深度学习(DL)的出现,超分辨率(SR)也已成为一个蓬勃发展的研究领域。然而,尽管结果有希望,但该领域仍然面临需要进一步研究的挑战,例如,允许灵活地采样,更有效的损失功能和更好的评估指标。我们根据最近的进步来回顾SR的域,并检查最新模型,例如扩散(DDPM)和基于变压器的SR模型。我们对SR中使用的当代策略进行了批判性讨论,并确定了有前途但未开发的研究方向。我们通过纳入该领域的最新发展,例如不确定性驱动的损失,小波网络,神经体系结构搜索,新颖的归一化方法和最新评估技术来补充先前的调查。我们还为整章中的模型和方法提供了几种可视化,以促进对该领域趋势的全球理解。最终,这篇综述旨在帮助研究人员推动DL应用于SR的界限。
translated by 谷歌翻译
Video prediction is a challenging computer vision task that has a wide range of applications. In this work, we present a new family of Transformer-based models for video prediction. Firstly, an efficient local spatial-temporal separation attention mechanism is proposed to reduce the complexity of standard Transformers. Then, a full autoregressive model, a partial autoregressive model and a non-autoregressive model are developed based on the new efficient Transformer. The partial autoregressive model has a similar performance with the full autoregressive model but a faster inference speed. The non-autoregressive model not only achieves a faster inference speed but also mitigates the quality degradation problem of the autoregressive counterparts, but it requires additional parameters and loss function for learning. Given the same attention mechanism, we conducted a comprehensive study to compare the proposed three video prediction variants. Experiments show that the proposed video prediction models are competitive with more complex state-of-the-art convolutional-LSTM based models. The source code is available at https://github.com/XiYe20/VPTR.
translated by 谷歌翻译
$ $With recent advances in CNNs, exceptional improvements have been made in semantic segmentation of high resolution images in terms of accuracy and latency. However, challenges still remain in detecting objects in crowded scenes, large scale variations, partial occlusion, and distortions, while still maintaining mobility and latency. We introduce a fast and efficient convolutional neural network, ASBU-Net, for semantic segmentation of high resolution images that addresses these problems and uses no novelty layers for ease of quantization and embedded hardware support. ASBU-Net is based on a new feature extraction module, atrous space bender layer (ASBL), which is efficient in terms of computation and memory. The ASB layers form a building block that is used to make ASBNet. Since this network does not use any special layers it can be easily implemented, quantized and deployed on FPGAs and other hardware with limited memory. We present experiments on resource and accuracy trade-offs and show strong performance compared to other popular models.
translated by 谷歌翻译
We focus on the challenging task of real-time semantic segmentation in this paper. It finds many practical applications and yet is with fundamental difficulty of reducing a large portion of computation for pixel-wise label inference. We propose an image cascade network (ICNet) that incorporates multi-resolution branches under proper label guidance to address this challenge. We provide in-depth analysis of our framework and introduce the cascade feature fusion unit to quickly achieve highquality segmentation. Our system yields real-time inference on a single GPU card with decent quality results evaluated on challenging datasets like Cityscapes, CamVid and COCO-Stuff.
translated by 谷歌翻译
Image segmentation is a key topic in image processing and computer vision with applications such as scene understanding, medical image analysis, robotic perception, video surveillance, augmented reality, and image compression, among many others. Various algorithms for image segmentation have been developed in the literature. Recently, due to the success of deep learning models in a wide range of vision applications, there has been a substantial amount of works aimed at developing image segmentation approaches using deep learning models. In this survey, we provide a comprehensive review of the literature at the time of this writing, covering a broad spectrum of pioneering works for semantic and instance-level segmentation, including fully convolutional pixel-labeling networks, encoder-decoder architectures, multi-scale and pyramid based approaches, recurrent networks, visual attention models, and generative models in adversarial settings. We investigate the similarity, strengths and challenges of these deep learning models, examine the most widely used datasets, report performances, and discuss promising future research directions in this area.
translated by 谷歌翻译
基于常规卷积网络的视频超分辨率(VSR)方法具有很强的视频序列的时间建模能力。然而,在单向反复卷积网络中的不同反复单元接收的输入信息不平衡。早期重建帧接收较少的时间信息,导致模糊或工件效果。虽然双向反复卷积网络可以缓解这个问题,但它大大提高了重建时间和计算复杂性。它也不适用于许多应用方案,例如在线超分辨率。为了解决上述问题,我们提出了一种端到端信息预构建的经常性重建网络(IPRRN),由信息预构建网络(IPNet)和经常性重建网络(RRNET)组成。通过将足够的信息从视频的前面集成来构建初始复发单元所需的隐藏状态,以帮助恢复较早的帧,信息预构建的网络在不向后传播之前和之后的输入信息差异。此外,我们展示了一种紧凑的复发性重建网络,可显着改善恢复质量和时间效率。许多实验已经验证了我们所提出的网络的有效性,并与现有的最先进方法相比,我们的方法可以有效地实现更高的定量和定性评估性能。
translated by 谷歌翻译
准确且可靠的车道检测对于巷道维护援助和车道出发警告系统的安全性能至关重要。但是,在某些具有挑战性的情况下,很难在当前文献中主要从一个图像中准确地检测到一个单一图像的车道时获得令人满意的性能。由于车道标记是连续线,因此如果合并了以前的帧信息,则可以在当前单个图像中准确检测到的车道可以更好地推导。这项研究提出了一种新型的混合时空(ST)序列到一个深度学习结构。该体系结构充分利用了多个连续图像帧中的ST信息,以检测最后一帧中的车道标记。具体而言,混合模型集成了以下方面:(a)配备了空间卷积神经网络的单个图像特征提取模块; (b)由ST复发神经网络构建的ST特征集成模块; (c)编码器解码器结构,该结构使此图像分割问题以端到端监督的学习格式起作用。广泛的实验表明,所提出的模型体系结构可以有效地处理具有挑战性的驾驶场景,并且优于可用的最先进方法。
translated by 谷歌翻译
时尚预测学习是给定一系列历史框架的未来框架。传统算法主要基于经常性的神经网络(RNN)。然而,由于经常性结构的序列性,RNN遭受了重大计算负担,例如由于经常性结构的序列性而达到时间和长的背部传播过程。最近,还以编码器 - 解码器或普通编码器的形式研究了基于变压器的方法,但是编码器 - 解码器形式需要过于深的网络,并且普通编码器缺乏短期依赖性。为了解决这些问题,我们提出了一种名为3D时间卷积变压器(TCTN)的算法,其中采用具有时间卷积层的基于变压器的编码器来捕获短期和长期依赖性。由于变压器的并行机理,我们所提出的算法与基于RNN的方法相比,易于实施和培训得多。为了验证我们的算法,我们对移动和kth数据集进行实验,并表明TCTN在性能和训练速度下表现出最先进的(SOTA)方法。
translated by 谷歌翻译
由于流量大数据的增加,交通预测逐渐引起了研究人员的注意力。因此,如何在交通数据中挖掘复杂的时空相关性以预测交通状况更准确地成为难题。以前的作品组合图形卷积网络(GCNS)和具有深度序列模型的自我关注机制(例如,复发性神经网络),分别捕获时空相关性,忽略时间和空间的关系。此外,GCNS受到过平滑问题的限制,自我关注受到二次问题的限制,导致GCN缺乏全局代表能力,自我注意力效率低下捕获全球空间依赖性。在本文中,我们提出了一种新颖的交通预测深入学习模型,命名为多语境意识的时空关节线性关注(STJLA),其对时空关节图应用线性关注以捕获所有时空之间的全球依赖性节点有效。更具体地,STJLA利用静态结构上下文和动态语义上下文来提高模型性能。基于Node2VEC和单热编码的静态结构上下文丰富了时空位置信息。此外,基于多头扩散卷积网络的动态空间上下文增强了局部空间感知能力,并且基于GRU的动态时间上下文分别稳定了线性关注的序列位置信息。在两个现实世界交通数据集,英格兰和PEMSD7上的实验表明,我们的Stjla可以获得高达9.83%和3.08%,在最先进的基线上的衡量标准的准确性提高。
translated by 谷歌翻译
视频异常检测是现在计算机视觉中的热门研究主题之一,因为异常事件包含大量信息。异常是监控系统中的主要检测目标之一,通常需要实时行动。关于培训的标签数据的可用性(即,没有足够的标记数据进行异常),半监督异常检测方法最近获得了利益。本文介绍了该领域的研究人员,以新的视角,并评论了最近的基于深度学习的半监督视频异常检测方法,基于他们用于异常检测的共同策略。我们的目标是帮助研究人员开发更有效的视频异常检测方法。由于选择右深神经网络的选择对于这项任务的几个部分起着重要作用,首先准备了对DNN的快速比较审查。与以前的调查不同,DNN是从时空特征提取观点审查的,用于视频异常检测。这部分审查可以帮助本领域的研究人员选择合适的网络,以获取其方法的不同部分。此外,基于其检测策略,一些最先进的异常检测方法受到严格调查。审查提供了一种新颖,深入了解现有方法,并导致陈述这些方法的缺点,这可能是未来作品的提示。
translated by 谷歌翻译