Dialogue state tracking (DST) aims to convert the dialogue history into dialogue states which consist of slot-value pairs. As condensed structural information memorizing all history information, the dialogue state in the last turn is typically adopted as the input for predicting the current state by DST models. However, these models tend to keep the predicted slot values unchanged, which is defined as state momentum in this paper. Specifically, the models struggle to update slot values that need to be changed and correct wrongly predicted slot values in the last turn. To this end, we propose MoNET to tackle state momentum via noise-enhanced training. First, the previous state of each turn in the training data is noised via replacing some of its slot values. Then, the noised previous state is used as the input to learn to predict the current state, improving the model's ability to update and correct slot values. Furthermore, a contrastive context matching framework is designed to narrow the representation distance between a state and its corresponding noised variant, which reduces the impact of noised state and makes the model better understand the dialogue history. Experimental results on MultiWOZ datasets show that MoNET outperforms previous DST methods. Ablations and analysis verify the effectiveness of MoNET in alleviating state momentum and improving anti-noise ability.
translated by 谷歌翻译
User-generated-content (UGC) videos have dominated the Internet during recent years. While many methods attempt to objectively assess the quality of these UGC videos, the mechanisms of human quality perception in the UGC-VQA problem is still yet to be explored. To better explain the quality perception mechanisms and learn more robust representations, we aim to disentangle the effects of aesthetic quality issues and technical quality issues risen by the complicated video generation processes in the UGC-VQA problem. To overcome the absence of respective supervisions during disentanglement, we propose the Limited View Biased Supervisions (LVBS) scheme where two separate evaluators are trained with decomposed views specifically designed for each issue. Composed of an Aesthetic Quality Evaluator (AQE) and a Technical Quality Evaluator (TQE) under the LVBS scheme, the proposed Disentangled Objective Video Quality Evaluator (DOVER) reach excellent performance (0.91 SRCC for KoNViD-1k, 0.89 SRCC for LSVQ, 0.88 SRCC for YouTube-UGC) in the UGC-VQA problem. More importantly, our blind subjective studies prove that the separate evaluators in DOVER can effectively match human perception on respective disentangled quality issues. Codes and demos are released in https://github.com/teowu/dover.
translated by 谷歌翻译
随着非专家们拍摄的野外视频的快速增长,盲目视频质量评估(VQA)已成为一个具有挑战性且苛刻的问题。尽管已经做出了许多努力来解决这个问题,但尚不清楚人类视觉系统(HVS)与视频的时间质量有何关系。同时,最近的工作发现,自然视频的框架变成了HV的感知领域,往往会形成表示形式的直线轨迹。通过获得的洞察力,即失真会损害感知的视频质量并导致感知表示的弯曲轨迹,我们提出了一个时间感知质量指数(TPQI),以通过描述表示形式的图形形态来测量时间失真。具体而言,我们首先从HVS的横向基因核(LGN)和主要视觉区域(V1)中提取视频感知表示,然后测量其轨迹的直率和紧凑性,以量化视频的自然性和内容连续性的降解。实验表明,HVS中的感知表示是一种预测主观时间质量的有效方法,因此TPQI首次可以实现与空间质量度量的可比性能,并且在评估具有较大时间变化的视频方面更加有效。我们进一步证明,通过与NIQE(空间质量指标)结合使用,TPQI可以在流行的野外视频数据集中实现最佳性能。更重要的是,除了要评估的视频之外,TPQI不需要任何其他信息,因此可以将其应用于任何数据集,而无需参数调整。源代码可在https://github.com/uolmm/tpqi-vqa上找到。
translated by 谷歌翻译
当前的深度视频质量评估(VQA)方法通常在评估高分辨率视频时具有高计算成本。这使他们无法通过端到端培训学习更好的视频质量相关表示。现有方法通常考虑幼稚的采样以降低计算成本,例如调整大小和裁剪。但是,它们显然在视频中损坏了与质量相关的信息,因此并不是学习VQA的良好表示形式的最佳选择。因此,渴望为VQA设计一种新的质量保留抽样方案。在本文中,我们提出了网格迷你斑点采样(GMS),该采样允许通过在原始分辨率下采样贴片来考虑局部质量,并通过以统一网格采样的迷你绘制来涵盖全球质量。这些迷你斑点是剪接和对齐的,称为片段。我们进一步构建了专门设计的碎片注意网络(粉丝),以适应碎片作为输入。由片段和粉丝组成,VQA(快速VQA)提出的片段样品变压器可实现有效的端到端深VQA,并学习有效的与视频质量相关的表示。它可以提高最新准确性约10%,同时减少1080p高分辨率视频的99.5%的失败。新学习的与视频质量相关的表示形式也可以转移到较小的VQA数据集中,从而在这些情况下提高性能。广泛的实验表明,Fast-VQA在各种分辨率的输入方面具有良好的性能,同时保持高效率。我们在https://github.com/timothyhtimothy/fast-vqa上发布代码。
translated by 谷歌翻译
在现有作品中,框架及其对视频质量评估(VQA)的影响之间的时间关系仍然不足。这些关系导致视频质量的两种重要效果类型。首先,某些时间变化(例如摇动,闪烁和突然的场景过渡)会导致时间扭曲并导致额外的质量降解,而其他变化(例如,与有意义的事件相关的变化)却没有。其次,人类视觉系统通常对具有不同内容的框架有不同的关注,从而导致其对整体视频质量的重要性不同。基于变压器的突出时间序列建模能力,我们提出了一种新颖有效的基于变压器的VQA方法来解决这两个问题。为了更好地区分时间变化,从而捕获了时间变形,我们设计了一个基于变压器的时空扭曲提取(STDE)模块。为了解决时间质量的关注,我们提出了类似编码器的时间含量变压器(TCT)。我们还介绍了功能上的时间抽样,以减少TCT的输入长度,以提高该模块的学习效率和效率。由STDE和TCT组成,用于视频质量评估(DISCOVQA)的拟议的时间失真符合变压器(DISCOVQA)在几个VQA基准上达到了最新的性能,而无需任何额外的预训练数据集,多达10%的概括能力提高了10%比现有方法。我们还进行了广泛的消融实验,以证明我们提出的模型中每个部分的有效性,并提供可视化以证明所提出的模块实现了我们对这些时间问题进行建模的意图。我们将在以后发布我们的代码和预算权重。
translated by 谷歌翻译
由于客户端的通信资源有限和大量的模型参数,大规模分布式学习任务遭受通信瓶颈。梯度压缩是通过传输压缩梯度来减少通信负载的有效方法。由于在随机梯度下降的情况下,相邻轮的梯度可能具有高相关,因为他们希望学习相同的模型,提出了一种用于联合学习的实用梯度压缩方案,它使用历史梯度来压缩梯度并且基于Wyner-Ziv编码但没有任何概率的假设。我们还在实时数据集上实现了我们的渐变量化方法,我们的方法的性能优于前一个方案。
translated by 谷歌翻译
合奏方法是将多种模型相结合以实现卓越性能的可靠方法。但是,关于集合方法在遥感对象检测方案中的应用的研究大多被忽略了。出现了两个问题。首先,遥感对象检测的一个独特特征是对象的定向边界框(OBB)和多个OBB的融合需要进一步的研究注意。其次,广泛使用的深度学习对象检测器为每个检测到的对象提供了一个分数作为置信度的指标,但是如何在集合方法中有效使用这些指标仍然是一个问题。试图解决这些问题,本文提出了与OBB兼容的合奏方法,并以学习的方式结合了检测结果。这种合奏方法有助于在挑战轨道\ textit {高分辨率光学图像中的细粒对象识别}中排名第一,该{\ textit {2021 Gaofen挑战在自动化高分辨率的地球观测图像}中均具有特征。 DOTA数据集和FAIR1M数据集的实验表明,分析了Obbstacking的性能以及Obbstacking的功能。
translated by 谷歌翻译
近年来,基于脑电图的情绪识别的进步已受到人机相互作用和认知科学领域的广泛关注。但是,如何用有限的标签识别情绪已成为一种新的研究和应用瓶颈。为了解决这个问题,本文提出了一个基于人类中刺激一致的脑电图信号的自我监督组减数分裂对比学习框架(SGMC)。在SGMC中,开发了一种新型遗传学启发的数据增强方法,称为减数分裂。它利用了组中脑电图样品之间的刺激对齐,通过配对,交换和分离来生成增强组。该模型采用组投影仪,从相同的情感视频刺激触发的脑电图样本中提取组级特征表示。然后,使用对比度学习来最大程度地提高具有相同刺激的增强群体的组级表示的相似性。 SGMC在公开可用的DEAP数据集上实现了最先进的情感识别结果,其价值为94.72%和95.68%的价和唤醒维度,并且在公共种子数据集上的竞争性能也具有94.04的竞争性能。 %。值得注意的是,即使使用有限的标签,SGMC也会显示出明显的性能。此外,功能可视化的结果表明,该模型可能已经学习了与情感相关的特征表示,以改善情绪识别。在超级参数分析中进一步评估了组大小的影响。最后,进行了对照实验和消融研究以检查建筑的合理性。该代码是在线公开提供的。
translated by 谷歌翻译
A recent study has shown a phenomenon called neural collapse in that the within-class means of features and the classifier weight vectors converge to the vertices of a simplex equiangular tight frame at the terminal phase of training for classification. In this paper, we explore the corresponding structures of the last-layer feature centers and classifiers in semantic segmentation. Based on our empirical and theoretical analysis, we point out that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes, which breaks the equiangular and maximally separated structure of neural collapse for both feature centers and classifiers. However, such a symmetric structure is beneficial to discrimination for the minor classes. To preserve these advantages, we introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure in imbalanced semantic segmentation. Experimental results show that our method can bring significant improvements on both 2D and 3D semantic segmentation benchmarks. Moreover, our method ranks 1st and sets a new record (+6.8% mIoU) on the ScanNet200 test leaderboard. Code will be available at https://github.com/dvlab-research/Imbalanced-Learning.
translated by 谷歌翻译
Weakly-supervised object localization aims to indicate the category as well as the scope of an object in an image given only the image-level labels. Most of the existing works are based on Class Activation Mapping (CAM) and endeavor to enlarge the discriminative area inside the activation map to perceive the whole object, yet ignore the co-occurrence confounder of the object and context (e.g., fish and water), which makes the model inspection hard to distinguish object boundaries. Besides, the use of CAM also brings a dilemma problem that the classification and localization always suffer from a performance gap and can not reach their highest accuracy simultaneously. In this paper, we propose a casual knowledge distillation method, dubbed KD-CI-CAM, to address these two under-explored issues in one go. More specifically, we tackle the co-occurrence context confounder problem via causal intervention (CI), which explores the causalities among image features, contexts, and categories to eliminate the biased object-context entanglement in the class activation maps. Based on the de-biased object feature, we additionally propose a multi-teacher causal distillation framework to balance the absorption of classification knowledge and localization knowledge during model training. Extensive experiments on several benchmarks demonstrate the effectiveness of KD-CI-CAM in learning clear object boundaries from confounding contexts and addressing the dilemma problem between classification and localization performance.
translated by 谷歌翻译