基于CNN的大多数超分辨率(SR)方法假设降解是已知的(\ eg,bicubic)。当降解与假设不同时,这些方法将遭受严重的性能下降。因此,一些方法试图通过多种降解的复杂组合来培训SR网络,以涵盖实际的降解空间。为了适应多个未知降解,引入显式降解估计器实际上可以促进SR性能。然而,以前的显式降解估计方法通常可以通过对地面模糊内核的监督来预测高斯的模糊,并且估计错误可能导致SR失败。因此,有必要设计一种可以提取隐式歧视性降解表示的方法。为此,我们提出了一个基于元学习的区域退化意识SR网络(MRDA),包括元学习网络(MLN),降级提取网络(DEN)和区域退化意识SR Network(RDAN)。为了处理缺乏地面污染的降解,我们使用MLN在几次迭代后快速适应特定的复合物降解并提取隐式降解信息。随后,教师网络MRDA $ _ {T} $旨在进一步利用MLN为SR提取的降解信息。但是,MLN需要在配对的低分辨率(LR)和相应的高分辨率(HR)图像上进行迭代,这在推理阶段不可用。因此,我们采用知识蒸馏(KD)来使学生网络学会直接提取与LR图像的老师相同的隐式退化表示(IDR)。
translated by 谷歌翻译
较轻,更快的型号对于在资源有限设备(例如智能手机和可穿戴设备)上部署视频超分辨率(VSR)至关重要。在本文中,我们开发了残留的稀疏连接学习(RSCL),这是一种结构化的修剪方案,以减少卷积内核的冗余,并获得较小的性能下降的紧凑型VSR网络。但是,残留的块要求将跳过的修剪过滤器索引和残留连接相同,这对于修剪很棘手。因此,为了减轻剩余块的修剪限制,我们通过保留特征通道并仅在重要的通道上运行来设计残留的稀疏连接(RSC)方案。此外,对于Pixel-Shuffle操作,我们通过将几个过滤器分组为修剪单元来设计一种特殊的修剪方案,以确保修剪后功能通道空间转换的准确性。此外,我们引入了时间登录(TF),以减少具有时间传播的隐藏状态的修剪误差放大。广泛的实验表明,提出的RSCL在定量和质量上明显优于最新方法。代码和模型将发布。
translated by 谷歌翻译
基于参考的超分辨率(REFSR)在使用外部参考(REF)图像产生现实纹理方面取得了重大进展。然而,现有的REFSR方法可以获得与输入大小一起消耗二次计算资源的高质量对应匹配,限制其应用程序。此外,这些方法通常遭受低分辨率(LR)图像和REF图像之间的比例错位。在本文中,我们提出了一种加速的多尺度聚合网络(AMSA),用于基于参考的超分辨率,包括粗略嵌入式斑块(CFE-PACKPMATCH)和多尺度动态聚合(MSDA)模块。为了提高匹配效率,我们设计一种具有随机样本传播的新型嵌入式PACKMTH方案,其涉及具有渐近线性计算成本的端到端训练到输入大小。为了进一步降低计算成本和加速会聚,我们在构成CFE-PACKMATCH的嵌入式PACKMACTH上应用了粗略策略。为了完全利用跨多个尺度的参考信息并增强稳定性的稳定性,我们开发由动态聚合和多尺度聚合组成的MSDA模块。动态聚合通过动态聚合特征来纠正轻微比例的错位,并且多尺度聚合通过融合多尺度信息来为大规模错位带来鲁棒性。实验结果表明,该拟议的AMSA对定量和定性评估的最先进方法实现了卓越的性能。
translated by 谷歌翻译
非本地注意力(NLA)通过利用自然图像中的内在特征相关性来带来单幅图像超分辨率(SISR)的显着改进。然而,NLA提供嘈杂的信息大量的权重,并且相对于输入大小消耗二次计算资源,限制其性能和应用。在本文中,我们提出了一种新的高效非局部对比度注意(Enca),以执行远程视觉建模并利用更相关的非局部特征。具体而言,Enca由两部分组成,有效的非本地注意力(Enla)和稀疏聚合。 ENLA采用内核方法来近似指数函数并获得线性计算复杂度。对于稀疏聚合,我们通过放大因子乘以专注于信息特征的输入,但近似的方差呈指数增加。因此,应用对比学习以进一步分离相关和无关的特征。为了展示Enca的有效性,我们通过在简单的骨干中添加一些模块来构建称为有效的非本地对比网络(ENLCN)的架构。广泛的实验结果表明,Enlcn对定量和定性评估的最先进方法达到了卓越的性能。
translated by 谷歌翻译
An increasing number of public datasets have shown a marked clinical impact on assessing anatomical structures. However, each of the datasets is small, partially labeled, and rarely investigates severe tumor subjects. Moreover, current models are limited to segmenting specific organs/tumors, which can not be extended to novel domains and classes. To tackle these limitations, we introduce embedding learned from Contrastive Language-Image Pre-training (CLIP) to segmentation models, dubbed the CLIP-Driven Universal Model. The Universal Model can better segment 25 organs and 6 types of tumors by exploiting the semantic relationship between abdominal structures. The model is developed from an assembly of 14 datasets with 3,410 CT scans and evaluated on 6,162 external CT scans from 3 datasets. We rank first on the public leaderboard of the Medical Segmentation Decathlon (MSD) and achieve the state-of-the-art results on Beyond The Cranial Vault (BTCV). Compared with dataset-specific models, the Universal Model is computationally more efficient (6x faster), generalizes better to CT scans from varying sites, and shows stronger transfer learning performance on novel tasks. The design of CLIP embedding enables the Universal Model to be easily extended to new classes without catastrophically forgetting the previously learned classes.
translated by 谷歌翻译
Machine Reading Comprehension has become one of the most advanced and popular research topics in the fields of Natural Language Processing in recent years. The classification of answerability questions is a relatively significant sub-task in machine reading comprehension; however, there haven't been many studies. Retro-Reader is one of the studies that has solved this problem effectively. However, the encoders of most traditional machine reading comprehension models in general and Retro-Reader, in particular, have not been able to exploit the contextual semantic information of the context completely. Inspired by SemBERT, we use semantic role labels from the SRL task to add semantics to pre-trained language models such as mBERT, XLM-R, PhoBERT. This experiment was conducted to compare the influence of semantics on the classification of answerability for the Vietnamese machine reading comprehension. Additionally, we hope this experiment will enhance the encoder for the Retro-Reader model's Sketchy Reading Module. The improved Retro-Reader model's encoder with semantics was first applied to the Vietnamese Machine Reading Comprehension task and obtained positive results.
translated by 谷歌翻译
While the rollout of the fifth-generation mobile network (5G) is underway across the globe with the intention to deliver 4K/8K UHD videos, Augmented Reality (AR), and Virtual Reality (VR) content to the mass amounts of users, the coverage and throughput are still one of the most significant issues, especially in the rural areas, where only 5G in the low-frequency band are being deployed. This called for a high-performance adaptive bitrate (ABR) algorithm that can maximize the user quality of experience given 5G network characteristics and data rate of UHD contents. Recently, many of the newly proposed ABR techniques were machine-learning based. Among that, Pensieve is one of the state-of-the-art techniques, which utilized reinforcement-learning to generate an ABR algorithm based on observation of past decision performance. By incorporating the context of the 5G network and UHD content, Pensieve has been optimized into Pensieve 5G. New QoE metrics that more accurately represent the QoE of UHD video streaming on the different types of devices were proposed and used to evaluate Pensieve 5G against other ABR techniques including the original Pensieve. The results from the simulation based on the real 5G Standalone (SA) network throughput shows that Pensieve 5G outperforms both conventional algorithms and Pensieve with the average QoE improvement of 8.8% and 14.2%, respectively. Additionally, Pensieve 5G also performed well on the commercial 5G NR-NR Dual Connectivity (NR-DC) Network, despite the training being done solely using the data from the 5G Standalone (SA) network.
translated by 谷歌翻译
This paper introduces a learned hierarchical B-frame coding scheme in response to the Grand Challenge on Neural Network-based Video Coding at ISCAS 2023. We address specifically three issues, including (1) B-frame coding, (2) YUV 4:2:0 coding, and (3) content-adaptive variable-rate coding with only one single model. Most learned video codecs operate internally in the RGB domain for P-frame coding. B-frame coding for YUV 4:2:0 content is largely under-explored. In addition, while there have been prior works on variable-rate coding with conditional convolution, most of them fail to consider the content information. We build our scheme on conditional augmented normalized flows (CANF). It features conditional motion and inter-frame codecs for efficient B-frame coding. To cope with YUV 4:2:0 content, two conditional inter-frame codecs are used to process the Y and UV components separately, with the coding of the UV components conditioned additionally on the Y component. Moreover, we introduce adaptive feature modulation in every convolutional layer, taking into account both the content information and the coding levels of B-frames to achieve content-adaptive variable-rate coding. Experimental results show that our model outperforms x265 and the winner of last year's challenge on commonly used datasets in terms of PSNR-YUV.
translated by 谷歌翻译
Classification using supervised learning requires annotating a large amount of classes-balanced data for model training and testing. This has practically limited the scope of applications with supervised learning, in particular deep learning. To address the issues associated with limited and imbalanced data, this paper introduces a sample-efficient co-supervised learning paradigm (SEC-CGAN), in which a conditional generative adversarial network (CGAN) is trained alongside the classifier and supplements semantics-conditioned, confidence-aware synthesized examples to the annotated data during the training process. In this setting, the CGAN not only serves as a co-supervisor but also provides complementary quality examples to aid the classifier training in an end-to-end fashion. Experiments demonstrate that the proposed SEC-CGAN outperforms the external classifier GAN (EC-GAN) and a baseline ResNet-18 classifier. For the comparison, all classifiers in above methods adopt the ResNet-18 architecture as the backbone. Particularly, for the Street View House Numbers dataset, using the 5% of training data, a test accuracy of 90.26% is achieved by SEC-CGAN as opposed to 88.59% by EC-GAN and 87.17% by the baseline classifier; for the highway image dataset, using the 10% of training data, a test accuracy of 98.27% is achieved by SEC-CGAN, compared to 97.84% by EC-GAN and 95.52% by the baseline classifier.
translated by 谷歌翻译
A major goal of multimodal research is to improve machine understanding of images and text. Tasks include image captioning, text-to-image generation, and vision-language representation learning. So far, research has focused on the relationships between images and text. For example, captioning models attempt to understand the semantics of images which are then transformed into text. An important question is: which annotation reflects best a deep understanding of image content? Similarly, given a text, what is the best image that can present the semantics of the text? In this work, we argue that the best text or caption for a given image is the text which would generate the image which is the most similar to that image. Likewise, the best image for a given text is the image that results in the caption which is best aligned with the original text. To this end, we propose a unified framework that includes both a text-to-image generative model and an image-to-text generative model. Extensive experiments validate our approach.
translated by 谷歌翻译