Diffusion models have achieved state-of-the-art synthesis quality on visual and audio tasks, and recent works adapt them to textual data by diffusing on the embedding space. But the difference between the continuous data space and the embedding space raises challenges to the diffusion model, which have not been carefully explored. In this paper, we conduct systematic studies and analyze the challenges threefold. Firstly, the data distribution is learnable for embeddings, which may lead to the collapse of the loss function. Secondly, as the norm of embedding varies between popular and rare words, adding the same noise scale will lead to sub-optimal results. In addition, we find that noises sampled from a standard Gaussian distribution may distract the diffusion process. To solve the above challenges, we propose Difformer, a denoising diffusion probabilistic model based on Transformer, which consists of three techniques including utilizing an anchor loss function, a layer normalization module for embeddings, and a norm factor to the Gaussian noise. All techniques are complementary to each other and critical to boosting the model performance together. Experiments are conducted on benchmark datasets over two seminal text generation tasks including machine translation and text summarization. The results show that Difformer significantly outperforms the embedding diffusion baselines, while achieving competitive results with strong autoregressive baselines.
translated by 谷歌翻译
Facial Expression Recognition (FER) in the wild is an extremely challenging task. Recently, some Vision Transformers (ViT) have been explored for FER, but most of them perform inferiorly compared to Convolutional Neural Networks (CNN). This is mainly because the new proposed modules are difficult to converge well from scratch due to lacking inductive bias and easy to focus on the occlusion and noisy areas. TransFER, a representative transformer-based method for FER, alleviates this with multi-branch attention dropping but brings excessive computations. On the contrary, we present two attentive pooling (AP) modules to pool noisy features directly. The AP modules include Attentive Patch Pooling (APP) and Attentive Token Pooling (ATP). They aim to guide the model to emphasize the most discriminative features while reducing the impacts of less relevant features. The proposed APP is employed to select the most informative patches on CNN features, and ATP discards unimportant tokens in ViT. Being simple to implement and without learnable parameters, the APP and ATP intuitively reduce the computational cost while boosting the performance by ONLY pursuing the most discriminative features. Qualitative results demonstrate the motivations and effectiveness of our attentive poolings. Besides, quantitative results on six in-the-wild datasets outperform other state-of-the-art methods.
translated by 谷歌翻译
Video dubbing aims to translate the original speech in a film or television program into the speech in a target language, which can be achieved with a cascaded system consisting of speech recognition, machine translation and speech synthesis. To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech, which requires strict length control. Previous works usually control the number of words or characters generated by the machine translation model to be similar to the source sentence, without considering the isochronicity of speech as the speech duration of words/characters in different languages varies. In this paper, we propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation, to match the length of source and target speech. Specifically, we control the speech length of generated sentence by guiding the prediction of each word with the duration information, including the speech duration of itself as well as how much duration is left for the remaining words. We design experiments on four language directions (German -> English, Spanish -> English, Chinese <-> English), and the results show that the proposed method achieves better length control ability on the generated speech than baseline methods. To make up the lack of real-world datasets, we also construct a real-world test set collected from films to provide comprehensive evaluations on the video dubbing task.
translated by 谷歌翻译
新兴的六代(6G)是异质无线网络的集成,它们可以在任何地方和任何时间网络中无缝支持。但是,6G应提供高质量的信任,以满足移动用户的期望。人工智能(AI)被认为是6G中最重要的组成部分之一。然后,基于AI的信任管理是提供可信赖和可靠的服务的有希望的范式。在本文中,为6G无线网络提供了一种生成的对抗性学习信任管理方法。首先审查了一些基于AI的典型信任管理方案,然后引入了潜在的异质和智能6G架构。接下来,开发了AI和信任管理的集成以优化情报和安全性。最后,提出的基于AI的信任管理方法用于确保聚类以实现可靠和实时的通信。仿真结果表明了其在保证网络安全和服务质量方面的出色性能。
translated by 谷歌翻译
由于独特的特征和约束,可信赖和可靠的数据传输是无线传感器网络(WSN)的一项艰巨任务。为了获取安全的数据传输并解决安全性和能源之间的冲突,在本文中,我们提出了一种基于进化游戏的安全聚类协议,具有模糊信任评估和WSN的离群检测。首先,提出了一种模糊的信任评估方法,以将传输证据转化为信任价值,同时有效地减轻了信任的不确定性。然后,提出了基于K-均值的离群检测方案,以进一步分析通过模糊信任评估或信任建议获得的大量信任值。它可以发现传感器节点之间的共同点和差异,同时提高异常检测的准确性。最后,我们提出了一种基于进化游戏的安全群集协议,以在选举群集头时进行安全保证和节能节能节省之间的权衡。失败的传感器节点可以通过隔离可疑节点来安全地选择自己的头部。仿真结果验证了我们的安全聚类协议可以有效地捍卫网络免受内部自私或折衷节点的攻击。相应地,及时的数据传输速率可以显着提高。
translated by 谷歌翻译
安全是工业无线传感器网络(IWSN)的主要问题之一。为了确保群集IWSN中的安全性,本文通过模糊信任评估和离群值检测(SCFTO)提出了一个安全的聚类协议(SCFTO)。首先,为了处理开放无线介质中的传输不确定性,采用间隔2型模糊逻辑控制器来估计信托。然后引入了基于密度的离群检测机制,以获取用于隔离群集头的自适应信任阈值。最后,提出了一种基于模糊的集群头选举方法,以在节能和安全保证之间达到平衡,以便具有更多残留能量或对其他节点置信度更高的正常传感器节点具有更高的概率,使其成为群集头。广泛的实验验证我们的安全聚类协议可以有效地捍卫网络免受内部恶意或受损节点的攻击。
translated by 谷歌翻译
5G边缘计算启用医学互联网(IOMT)是一项有效的技术,可提供分散的医疗服务,而设备到设备(D2D)通信是未来5G网络的有希望的范式。为了确保5G边缘计算中的安全可靠的通信和启用D2D的IOMT系统,本文介绍了一种智能的信任云管理方法。首先,提出了一种积极的培训机制来构建标准信任云。其次,可以通过推断和推荐来建立IOMT设备的个人信任云。第三,提出了一种信任分类方案来确定IOMT设备是否恶意。最后,提出了一种信任云更新机制,以使所提出的信任管理方法适应性和智能在开放的无线介质下。仿真结果表明,所提出的方法可以有效解决信任不确定性问题并提高恶意设备的检测准确性。
translated by 谷歌翻译
启用边缘的工业互联网(IIOT)平台对于加速智能行业的发展具有重要意义。但是,随着实时IIOT应用程序的急剧增加,支持快速响应时间,低延迟和有效的带宽利用率是一个巨大的挑战。为了解决这个问题,最近研究了时间敏感网络(TSN),以通过确定性调度来实现低延迟通信。据我们所知,以前从未对多个流量的可组合性(可能会严重影响计划表现)进行系统分析。在本文中,我们首先分析可组合性问题。然后提出了基于非碰撞理论的确定性调度(NDS)方法,以实现时间敏感流的超低延迟通信。此外,为了提高带宽利用率,为最佳富度流提供了动态队列调度(DQS)方法。实验结果表明,NDS/DQ可以很好地支持确定性的超低潜伏期服务并确保有效的带宽利用率。
translated by 谷歌翻译
深度神经网络通过学习从低分辨率(LR)图像到高分辨率(HR)图像的映射,在图像超分辨率(SR)任务中表现出了显着的性能。但是,SR问题通常是一个不适的问题,现有方法将受到一些局限性。首先,由于可能存在许多不同的HR图像,因此SR的可能映射空间可能非常大,可以将其删除到相同的LR图像中。结果,很难直接从如此大的空间中学习有希望的SR映射。其次,通常不可避免地要开发具有极高计算成本的非常大型模型来产生有希望的SR性能。实际上,可以使用模型压缩技术通过降低模型冗余来获得紧凑的模型。然而,由于非常大的SR映射空间,现有模型压缩方法很难准确识别冗余组件。为了减轻第一个挑战,我们提出了一项双重回归学习计划,以减少可能的SR映射空间。具体而言,除了从LR到HR图像的映射外,我们还学习了一个附加的双回归映射,以估算下采样内核和重建LR图像。通过这种方式,双映射是减少可能映射空间的约束。为了应对第二项挑战,我们提出了一种轻巧的双回归压缩方法,以基于通道修剪来降低图层级别和通道级别的模型冗余。具体而言,我们首先开发了一种通道编号搜索方法,该方法将双重回归损耗最小化以确定每一层的冗余。鉴于搜索的通道编号,我们进一步利用双重回归方式来评估通道的重要性并修剪冗余。广泛的实验显示了我们方法在获得准确有效的SR模型方面的有效性。
translated by 谷歌翻译
由于其有条件的独立性假设,非自动回忆翻译(NAT)模型很难捕获目标翻译的多模式分布,这被称为“多模式性问题”,包括词汇多模式和句法。多模式。虽然对第一个进行了充分的研究,但句法多模式性为NAT的标准横熵(XE)损失带来了严重的挑战,并且正在研究中。在本文中,我们对句法多模式问题进行了系统研究。具体而言,我们将其分解为短期和远程句法多模式,并在精心设计的合成数据集和真实数据集上评估了具有高级损耗函数的几种NAT算法。我们发现,连接派时间分类(CTC)损失和订单不合时宜的熵(OAXE)损失可以更好地处理短期和远程语法多模式。此外,我们将同时掌握并设计新的损失功能,以更好地处理现实世界数据集中复杂的句法多模式。为了促进实际用法,我们提供了一个指南,以使用不同种类的句法多模式的不同损失功能。
translated by 谷歌翻译