Subjective image-quality measurement plays a critical role in the development of image-processing applications. The purpose of a visual-quality metric is to approximate the results of subjective assessment. In this regard, more and more metrics are under development, but little research has considered their limitations. This paper addresses that deficiency: we show how image preprocessing before compression can artificially increase the quality scores provided by the popular metrics DISTS, LPIPS, HaarPSI, and VIF as well as how these scores are inconsistent with subjective-quality scores. We propose a series of neural-network preprocessing models that increase DISTS by up to 34.5%, LPIPS by up to 36.8%, VIF by up to 98.0%, and HaarPSI by up to 22.6% in the case of JPEG-compressed images. A subjective comparison of preprocessed images showed that for most of the metrics we examined, visual quality drops or stays unchanged, limiting the applicability of these metrics.
translated by 谷歌翻译
Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, {\textit UnitY}, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization. To leverage large amounts of unlabeled text data, we pre-train the first-pass text decoder based on the self-supervised denoising auto-encoding task. Experimental evaluations on benchmark datasets at various data scales demonstrate that UnitY outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU with 2.83x decoding speed-up. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass. However, predicting discrete units achieves 2.51x decoding speed-up compared to that case.
translated by 谷歌翻译
Denoising diffusion models (DDMs) have led to staggering performance leaps in image generation, editing and restoration. However, existing DDMs use very large datasets for training. Here, we introduce a framework for training a DDM on a single image. Our method, which we coin SinDDM, learns the internal statistics of the training image by using a multi-scale diffusion process. To drive the reverse diffusion process, we use a fully-convolutional light-weight denoiser, which is conditioned on both the noise level and the scale. This architecture allows generating samples of arbitrary dimensions, in a coarse-to-fine manner. As we illustrate, SinDDM generates diverse high-quality samples, and is applicable in a wide array of tasks, including style transfer and harmonization. Furthermore, it can be easily guided by external supervision. Particularly, we demonstrate text-guided generation from a single image using a pre-trained CLIP model.
translated by 谷歌翻译
In recent years, display intensity and contrast have increased considerably. Many displays support high dynamic range (HDR) and 10-bit color depth. Since high bit-depth is an emerging technology, video content is still largely shot and transmitted with a bit depth of 8 bits or less per color component. Insufficient bit-depths produce distortions called false contours or banding, and they are visible on high contrast screens. To deal with such distortions, researchers have proposed algorithms for bit-depth enhancement (dequantization). Such techniques convert videos with low bit-depth (LBD) to videos with high bit-depth (HBD). The quality of converted LBD video, however, is usually lower than that of the original HBD video, and many consumers prefer to keep the original HBD versions. In this paper, we propose an algorithm to determine whether a video has undergone conversion before compression. This problem is complex; it involves detecting outcomes of different dequantization algorithms in the presence of compression that strongly affects the least-significant bits (LSBs) in the video frames. Our algorithm can detect bit-depth enhancement and demonstrates good generalization capability, as it is able to determine whether a video has undergone processing by dequantization algorithms absent from the training dataset.
translated by 谷歌翻译
The meaningful use of electronic health records (EHR) continues to progress in the digital era with clinical decision support systems augmented by artificial intelligence. A priority in improving provider experience is to overcome information overload and reduce the cognitive burden so fewer medical errors and cognitive biases are introduced during patient care. One major type of medical error is diagnostic error due to systematic or predictable errors in judgment that rely on heuristics. The potential for clinical natural language processing (cNLP) to model diagnostic reasoning in humans with forward reasoning from data to diagnosis and potentially reduce the cognitive burden and medical error has not been investigated. Existing tasks to advance the science in cNLP have largely focused on information extraction and named entity recognition through classification tasks. We introduce a novel suite of tasks coined as Diagnostic Reasoning Benchmarks, DR.BENCH, as a new benchmark for developing and evaluating cNLP models with clinical diagnostic reasoning ability. The suite includes six tasks from ten publicly available datasets addressing clinical text understanding, medical knowledge reasoning, and diagnosis generation. DR.BENCH is the first clinical suite of tasks designed to be a natural language generation framework to evaluate pre-trained language models. Experiments with state-of-the-art pre-trained generative language models using large general domain models and models that were continually trained on a medical corpus demonstrate opportunities for improvement when evaluated in DR. BENCH. We share DR. BENCH as a publicly available GitLab repository with a systematic approach to load and evaluate models for the cNLP community.
translated by 谷歌翻译
最近的研究表明,减少时间和空间冗余都是有效的视频识别方法的有效方法,例如,将大多数计算分配给与任务相关的框架或每个帧中最有价值的图像区域。但是,在大多数现有的作品中,任何一种类型的冗余通常都是用另一个缺失建模的。本文探讨了在最近提出的ADAFOCUSV2算法之上的时空动态计算的统一配方,从而有助于改进的ADAFOCUSV3框架。我们的方法仅在一些小但有益的3D视频立方体上激活昂贵的高容量网络来降低计算成本。这些立方体是从框架高度,宽度和视频持续时间形成的空间中裁剪的,而它们的位置则以每样本样本为基础的轻加权政策网络自适应地确定。在测试时间,与每个视频相对应的立方体的数量是动态配置的,即,对视频立方体进行顺序处理,直到产生足够可靠的预测为止。值得注意的是,可以通过近似可插入深度特征的插值来有效地训练adafocusv3。六个基准数据集(即ActivityNet,FCVID,Mini-Kinetics,Something Something V1&V2和潜水48)上的广泛经验结果表明,我们的模型比竞争性基线要高得多。
translated by 谷歌翻译
我们分析了一个随机近似算法的决策依赖性问题,其中算法沿迭代序列演变的数据分布。此类问题的主要示例出现在表演预测及其多人游戏扩展中。我们表明,在温和的假设下,算法的平均迭代和溶液之间的偏差在渐近正常上,协方差很好地解除了梯度噪声和分布移位的影响。此外,在H \'Ajek和Le Cam的工作中,我们表明该算法的渐近性能是本地最小的最佳选择。
translated by 谷歌翻译
开放信息提取(OpenIE)的最先进的神经方法通常以自回旋或基于谓词的方式迭代地提取三重态(或元组),以免产生重复。在这项工作中,我们提出了一种可以平等或更成功的问题的不同方法。也就是说,我们提出了一种新型的单通道方法,用于开放式启发,该方法受到计算机视觉的对象检测算法的启发。我们使用基于双方匹配的订单不足损失,迫使独特的预测和用于序列标签的仅基于变压器的纯编码体系结构。与质量指标和推理时间相比,与标准基准的最新模型相比,提出的方法更快,并且表现出卓越或类似的性能。我们的模型在CARB上的新最新性能为OIE2016评估,而推断的速度比以前的最新状态更快。我们还在两种语言的零弹奏设置中评估了模型的多语言版本,并引入了一种生成合成多语言数据的策略,以微调每个特定语言的模型。在这种情况下,我们在多语言Re-OIE2016上显示了15%的性能提高,葡萄牙语和西班牙语的F1达到75%。代码和型号可在https://github.com/sberbank-ai/detie上找到。
translated by 谷歌翻译
大型和超大语言模型的开发,例如GPT-3,T5,Switch Transformer,Ernie等,已经显着改善了文本生成的性能。该领域的重要研究方向之一是产生具有争论的文本。该问题的解决方案可以用于商务会议,政治辩论,对话系统,以准备学生论文。这些应用的主要领域之一是经济领域。俄罗斯语言的论证文本生成的关键问题是缺乏注释的论证语料库。在本文中,我们将论证的微观版,说服力论文和UKP句子语料库的翻译版本用于微调Rubert模型。此外,该模型用于通过论证注释经济新闻的语料库。然后使用带注释的语料库微调Rugpt-3模型,该模型生成参数文本。结果表明,与原始的Rugpt-3模型相比,这种方法将论点生成的准确性提高了20个百分点(63.2 \%vs. 42.5 \%)。
translated by 谷歌翻译
大部分计算机生成的动画是通过用钻机来操纵网格创建的。尽管这种方法可以很好地对动物(例如动物)进行动画化的态度,但它的灵活性有限,可以使结构较低的自由形式对象进行动画化。我们介绍了WaseSplines,这是一种基于连续标准化流量和最佳运输的最新进展,用于对非结构化密度进行动画化的新型推理方法。关键思想是训练代表密钥帧之间运动的神经参数化速度场。然后,通过通过速度字段推进密钥帧来计算轨迹。我们解决了另一个Wasserstein Barycenter插值问题,以确保严格遵守关键框架。我们的工具可以通过各种基于PDE的正规化器来对轨迹进行风格化轨迹,从而创造出不同的视觉效果。我们在各种关键框架插值问题上演示了我们的工具,以制作时间连接动画而无需嵌入或索具。
translated by 谷歌翻译