As one of the prevalent methods to achieve automation systems, Imitation Learning (IL) presents a promising performance in a wide range of domains. However, despite the considerable improvement in policy performance, the corresponding research on the explainability of IL models is still limited. Inspired by the recent approaches in explainable artificial intelligence methods, we proposed a model-agnostic explaining framework for IL models called R2RISE. R2RISE aims to explain the overall policy performance with respect to the frames in demonstrations. It iteratively retrains the black-box IL model from the randomized masked demonstrations and uses the conventional evaluation outcome environment returns as the coefficient to build an importance map. We also conducted experiments to investigate three major questions concerning frames' importance equality, the effectiveness of the importance map, and connections between importance maps from different IL models. The result shows that R2RISE successfully distinguishes important frames from the demonstrations.
translated by 谷歌翻译
With the rapid deployment of graph neural networks (GNNs) based techniques into a wide range of applications such as link prediction, node classification, and graph classification the explainability of GNNs has become an indispensable component for predictive and trustworthy decision-making. Thus, it is critical to explain why graph neural network (GNN) makes particular predictions for them to be believed in many applications. Some GNNs explainers have been proposed recently. However, they lack to generate accurate and real explanations. To mitigate these limitations, we propose GANExplainer, based on Generative Adversarial Network (GAN) architecture. GANExplainer is composed of a generator to create explanations and a discriminator to assist with the Generator development. We investigate the explanation accuracy of our models by comparing the performance of GANExplainer with other state-of-the-art methods. Our empirical results on synthetic datasets indicate that GANExplainer improves explanation accuracy by up to 35\% compared to its alternatives.
translated by 谷歌翻译
图形神经网络(GNN)已证明图形数据的预测性能显着提高。同时,这些模型的预测通常很难解释。在这方面,已经做出了许多努力来从gnnexplainer,XGNN和PGEXPlainer等角度解释这些模型的预测机制。尽管这样的作品呈现出系统的框架来解释GNN,但对于可解释的GNN的整体评论是不可用的。在这项调查中,我们介绍了针对GNN开发的解释性技术的全面综述。我们专注于可解释的图形神经网络,并根据可解释方法的使用对它们进行分类。我们进一步为GNNS解释提供了共同的性能指标,并指出了几个未来的研究指标。
translated by 谷歌翻译
Video Super-Resolution (VSR) aims to restore high-resolution (HR) videos from low-resolution (LR) videos. Existing VSR techniques usually recover HR frames by extracting pertinent textures from nearby frames with known degradation processes. Despite significant progress, grand challenges are remained to effectively extract and transmit high-quality textures from high-degraded low-quality sequences, such as blur, additive noises, and compression artifacts. In this work, a novel Frequency-Transformer (FTVSR) is proposed for handling low-quality videos that carry out self-attention in a combined space-time-frequency domain. First, video frames are split into patches and each patch is transformed into spectral maps in which each channel represents a frequency band. It permits a fine-grained self-attention on each frequency band, so that real visual texture can be distinguished from artifacts. Second, a novel dual frequency attention (DFA) mechanism is proposed to capture the global frequency relations and local frequency relations, which can handle different complicated degradation processes in real-world scenarios. Third, we explore different self-attention schemes for video processing in the frequency domain and discover that a ``divided attention'' which conducts a joint space-frequency attention before applying temporal-frequency attention, leads to the best video enhancement quality. Extensive experiments on three widely-used VSR datasets show that FTVSR outperforms state-of-the-art methods on different low-quality videos with clear visual margins. Code and pre-trained models are available at https://github.com/researchmm/FTVSR.
translated by 谷歌翻译
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously, towards high-quality realistic videos. To generate joint audio-video pairs, we propose a novel Multi-Modal Diffusion model (i.e., MM-Diffusion), with two-coupled denoising autoencoders. In contrast to existing single-modal diffusion models, MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design. Two subnets for audio and video learn to gradually generate aligned audio-video pairs from Gaussian noises. To ensure semantic consistency across modalities, we propose a novel random-shift based attention block bridging over the two subnets, which enables efficient cross-modal alignment, and thus reinforces the audio-video fidelity for each other. Extensive experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks (e.g., video-to-audio). In particular, we achieve the best FVD and FAD on Landscape and AIST++ dancing datasets. Turing tests of 10k votes further demonstrate dominant preferences for our model. The code and pre-trained models can be downloaded at https://github.com/researchmm/MM-Diffusion.
translated by 谷歌翻译
Weakly-supervised temporal action localization (WTAL) learns to detect and classify action instances with only category labels. Most methods widely adopt the off-the-shelf Classification-Based Pre-training (CBP) to generate video features for action localization. However, the different optimization objectives between classification and localization, make temporally localized results suffer from the serious incomplete issue. To tackle this issue without additional annotations, this paper considers to distill free action knowledge from Vision-Language Pre-training (VLP), since we surprisingly observe that the localization results of vanilla VLP have an over-complete issue, which is just complementary to the CBP results. To fuse such complementarity, we propose a novel distillation-collaboration framework with two branches acting as CBP and VLP respectively. The framework is optimized through a dual-branch alternate training strategy. Specifically, during the B step, we distill the confident background pseudo-labels from the CBP branch; while during the F step, the confident foreground pseudo-labels are distilled from the VLP branch. And as a result, the dual-branch complementarity is effectively fused to promote a strong alliance. Extensive experiments and ablation studies on THUMOS14 and ActivityNet1.2 reveal that our method significantly outperforms state-of-the-art methods.
translated by 谷歌翻译
预先训练的图像文本模型(如剪辑)已经证明了从大规模的Web收集的图像文本数据中学到的视觉表示的强大力量。鉴于学习良好的视觉特征,一些现有的作品将图像表示转移到视频域并取得良好的结果。但是,如何利用图像语言预训练的模型(例如,剪辑)进行视频培训(后培训)仍在探索。在本文中,我们研究了两个问题:1)阻碍后期剪辑的因素是什么因素,以进一步提高视频语言任务的性能? 2)如何减轻这些因素的影响?通过一系列比较实验和分析,我们发现语言源之间的数据量表和域间隙具有很大的影响。由这些动机,我们提出了一种配备了视频代理机制的Omnisource跨模式学习方法,即剪辑,即剪辑VIP。广泛的结果表明,我们的方法可以提高视频检索的剪辑的性能。我们的模型还可以在包括MSR-VTT,DIDEMO,LSMDC和ActivityNet在内的各种数据集上实现SOTA结果。我们在https://github.com/microsoft/xpretrain/tree/main/main/main/clip-vip上发布了代码和预训练的剪辑模型。
translated by 谷歌翻译
AI Illustrator旨在自动设计具有视觉吸引力的图像,以激发丰富的思想和情感。为了实现这一目标,我们提出了一个框架,将具有复杂语义的原始描述转换为语义相应的图像。主要的挑战在于原始描述语义的复杂性,可能很难可视化(\ textit {e}。通常,它对现有方法构成了处理此类描述的挑战。为了解决这个问题,我们建议基于rompt \ textbf {c} ross- \ textbf {m} odal generation \ textbf {frame} work(pcm-frame)利用两个强大的预培养模型,,包括剪辑和Stylegan。我们的框架由两个组件组成:\ textIt {textIt嵌入} s到\ textit {image嵌入} s的投影模块,基于提示以及一个构建的适应图像生成模块,该模块构建了\ textit {image嵌入{image Embedding} s作为输入并受到共同语义一致性损失的训练。为了弥合现实图像和插图设计之间的差距,我们进一步采用了风格化模型作为后处理,以获得更好的视觉效果。受益于预先训练的模型,我们的方法可以处理复杂的描述,并且不需要外部配对数据进行培训。此外,我们已经建立了一个由200个原始描述组成的基准。我们进行了一项用户研究,以证明我们对复杂文本的竞争方法的优势。我们在https://github.com/researchmm/ai \ _illustrator} {https://github.com/researchmem/researchmm/ai \_illustrator上发布代码
translated by 谷歌翻译
图像增强旨在通过修饰颜色和音调来提高照片的美学视觉质量,并且是专业数字摄影的必不可少的技术。近年来,基于学习的图像增强算法已达到有希望的表现,并吸引了日益普及。但是,典型的努力试图为所有像素的颜色转换构建一个均匀的增强子。它忽略了对照片重要的不同内容(例如,天空,海洋等)之间的像素差异,从而导致结果不令人满意。在本文中,我们提出了一个新颖的可学习背景知觉的4维查找表(4D LUT),该表通过适应性地学习照片上下文来实现每个图像中不同内容的增强。特别是,我们首先引入一个轻量级上下文编码器和一个参数编码器,以分别学习像素级类别的上下文图和一组图像自适应系数。然后,通过通过系数集成多个基础4D LUT来生成上下文感知的4D LUT。最后,可以通过将源图像和上下文图馈入融合的上下文感知的4D〜LUT来获得增强的图像。与传统的3D LUT(即RGB映射到RGB)相比,通常用于摄像机成像管道系统或工具,4D LUT,即RGBC(RGB+上下文)映射到RGB,可实现具有不同像素的颜色转换的最佳控制每个图像中的内容,即使它们具有相同的RGB值。实验结果表明,我们的方法在广泛使用的基准中优于其他最先进的方法。
translated by 谷歌翻译
在这项工作中,我们探讨了用于语义分割知识蒸馏的数据增强。为了避免过度适合教师网络中的噪音,大量培训示例对于知识蒸馏至关重要。 Imagelevel论证技术(例如翻转,翻译或旋转)在先前的知识蒸馏框架中广泛使用。受到功能空间上语义方向的最新进展的启发,我们建议在功能空间中包括以进行有效蒸馏的功能。具体而言,给定语义方向,可以在功能空间中为学生获得无限数量的增强。此外,分析表明,可以通过最大程度地减少增强损失的上限来同时优化这些增强。基于观察结果,开发了一种用于语义分割的知识蒸馏的新算法。对四个语义分割基准测试的广泛实验表明,所提出的方法可以提高当前知识蒸馏方法的性能而没有任何明显的开销。代码可在以下网址获得:https://github.com/jianlong-yuan/fakd。
translated by 谷歌翻译