How to effectively explore the colors of reference exemplars and propagate them to colorize each frame is vital for exemplar-based video colorization. In this paper, we present an effective BiSTNet to explore colors of reference exemplars and utilize them to help video colorization by a bidirectional temporal feature fusion with the guidance of semantic image prior. We first establish the semantic correspondence between each frame and the reference exemplars in deep feature space to explore color information from reference exemplars. Then, to better propagate the colors of reference exemplars into each frame and avoid the inaccurate matches colors from exemplars we develop a simple yet effective bidirectional temporal feature fusion module to better colorize each frame. We note that there usually exist color-bleeding artifacts around the boundaries of the important objects in videos. To overcome this problem, we further develop a mixed expert block to extract semantic information for modeling the object boundaries of frames so that the semantic image prior can better guide the colorization process for better performance. In addition, we develop a multi-scale recurrent block to progressively colorize frames in a coarse-to-fine manner. Extensive experimental results demonstrate that the proposed BiSTNet performs favorably against state-of-the-art methods on the benchmark datasets. Our code will be made available at \url{https://yyang181.github.io/BiSTNet/}
translated by 谷歌翻译
Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose the participants to design an efficient quantized image super-resolution solution that can demonstrate a real-time performance on mobile NPUs. The participants were provided with the DIV2K dataset and trained INT8 models to do a high-quality 3X image upscaling. The runtime of all models was evaluated on the Synaptics VS680 Smart Home board with a dedicated edge NPU capable of accelerating quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 60 FPS rate when reconstructing Full HD resolution images. A detailed description of all models developed in the challenge is provided in this paper.
translated by 谷歌翻译
现实世界图像Denoising是一个实用的图像恢复问题,旨在从野外嘈杂的输入中获取干净的图像。最近,Vision Transformer(VIT)表现出强大的捕获远程依赖性的能力,许多研究人员试图将VIT应用于图像DeNosing任务。但是,现实世界的图像是一个孤立的框架,它使VIT构建了内部贴片的远程依赖性,该依赖性将图像分为贴片并混乱噪声模式和梯度连续性。在本文中,我们建议通过使用连续的小波滑动转换器来解决此问题,该小波滑动转换器在现实世界中构建频率对应关系,称为dnswin。具体而言,我们首先使用CNN编码器从嘈杂的输入图像中提取底部功能。 DNSWIN的关键是将高频和低频信息与功能和构建频率依赖性分开。为此,我们提出了小波滑动窗口变压器,该变压器利用离散的小波变换,自我注意力和逆离散小波变换来提取深度特征。最后,我们使用CNN解码器将深度特征重建为DeNo的图像。对现实世界的基准测试的定量和定性评估都表明,拟议的DNSWIN对最新方法的表现良好。
translated by 谷歌翻译
现有视频超分辨率(VSR)算法的成功主要是从相邻框架中利用时间信息。但是,这些方法都没有讨论带有固定物体和背景的贴片中时间冗余的影响,并且通常使用相邻框架中的所有信息而没有任何歧视。在本文中,我们观察到时间冗余将对信息传播产生不利影响,这限制了最现有的VSR方法的性能。在这一观察结果的推动下,我们旨在通过以优化的方式处理时间冗余贴片来改善现有的VSR算法。我们开发了两种简单但有效的插件方法,以提高广泛使用的公共视频中现有的本地和非本地传播算法的性能。为了更全面地评估现有VSR算法的鲁棒性和性能,我们还收集了一个新数据集,其中包含各种公共视频作为测试集。广泛的评估表明,所提出的方法可以显着提高野生场景中收集的视频的现有VSR方法的性能,同时保持其在现有常用数据集上的性能。该代码可在https://github.com/hyhsimon/boosted-vsr上找到。
translated by 谷歌翻译
我们提出了有效的结构性先验引导的生成对抗变压器(SPGAT)来解决低光图像增强。我们的SPGAT主要包含一个具有两个鉴别器和一个结构性估计器(SPE)的发生器。发电机基于U形变压器,该变压器用于探索非本地信息,以更好地清晰图像恢复。 SPE用于探索来自图像的有用结构,以引导发电机以进行更好的结构细节估计。为了生成更真实的图像,我们通过在发生器和歧视器之间建立跳过连接来开发一种新的结构性对手学习方法,以便歧视者可以更好地区分真实功能和虚假功能。最后,我们提出了一个基于Windows的SWIN Transformer块,以汇总不同级别的层次特征,以进行高质量的图像恢复。实验结果表明,所提出的SPGAT在合成数据集和现实世界中的最新方法中表现出色。
translated by 谷歌翻译
的状态的最先进的视频去模糊方法的成功主要源于潜伏视频恢复相邻帧之间的对准隐式或显式的估计。然而,由于模糊效果的影响,估计从所述模糊的相邻帧的对准信息是不是一个简单的任务。不准确的估计将干扰随后的帧的恢复。相反,估计比对信息,我们提出了一个简单而有效的深层递归神经网络与多尺度双向传播(RNN-MBP),有效传播和收集未对齐的相邻帧的信息,更好的视频去模糊。具体来说,我们建立与这可以通过在不同的尺度整合他们直接利用从非对齐相邻隐藏状态帧间信息的两个U形网RNN细胞多尺度双向传播〜(MBP)模块。此外,为了更好地评估算法和国家的最先进的存在于现实世界的模糊场景的方法,我们也通过一个精心设计的数字视频采集系统创建一个真实世界的模糊视频数据集(RBVD)(的DVA)并把它作为训练和评估数据集。大量的实验结果表明,该RBVD数据集有效地提高了对现实世界的模糊的视频现有算法的性能,并且算法进行从优对三个典型基准的国家的最先进的方法。该代码可在https://github.com/XJTU-CVLAB-LOWLEVEL/RNN-MBP。
translated by 谷歌翻译
非盲折叠是一个不良问题。大多数现有方法通常将该问题与最大-A-Bouthiori框架制定,并通过设计潜在清晰图像的类型的正则化术语和数据项来解决它。在本文中,我们通过学习鉴别性收缩函数来提出有效的非盲折叠方法来隐含地模拟这些术语。与使用深度卷积神经网络(CNNS)或径向基函数的大多数现有方法来说,我们简单地学习正则化术语,我们制定数据项和正则化术语,并将解构模型分成与数据相关和正则化相关的子 - 根据乘法器的交替方向方法问题。我们探讨了Maxout函数的属性,并使用颤扬层开发一个深入的CNN模型,以学习直接近似对这两个子问题的解决方案的判别缩小功能。此外,考虑到基于快速的傅里叶变换的图像恢复通常导致振铃伪像,而基于共轭梯度的图像恢复是耗时的,我们开发共轭梯度网络以有效且有效地恢复潜在的清晰图像。实验结果表明,该方法在效率和准确性方面对最先进的方法有利地执行。
translated by 谷歌翻译
Blind image quality assessment (BIQA) remains challenging due to the diversity of distortion and image content variation, which complicate the distortion patterns crossing different scales and aggravate the difficulty of the regression problem for BIQA. However, existing BIQA methods often fail to consider multi-scale distortion patterns and image content, and little research has been done on learning strategies to make the regression model produce better performance. In this paper, we propose a simple yet effective Progressive Multi-Task Image Quality Assessment (PMT-IQA) model, which contains a multi-scale feature extraction module (MS) and a progressive multi-task learning module (PMT), to help the model learn complex distortion patterns and better optimize the regression issue to align with the law of human learning process from easy to hard. To verify the effectiveness of the proposed PMT-IQA model, we conduct experiments on four widely used public datasets, and the experimental results indicate that the performance of PMT-IQA is superior to the comparison approaches, and both MS and PMT modules improve the model's performance.
translated by 谷歌翻译
In this paper, we study the problem of knowledge-intensive text-to-SQL, in which domain knowledge is necessary to parse expert questions into SQL queries over domain-specific tables. We formalize this scenario by building a new Chinese benchmark KnowSQL consisting of domain-specific questions covering various domains. We then address this problem by presenting formulaic knowledge, rather than by annotating additional data examples. More concretely, we construct a formulaic knowledge bank as a domain knowledge base and propose a framework (ReGrouP) to leverage this formulaic knowledge during parsing. Experiments using ReGrouP demonstrate a significant 28.2% improvement overall on KnowSQL.
translated by 谷歌翻译
Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.
translated by 谷歌翻译