本文提出了一种用于端到端现场文本识别的新颖培训方法。端到端的场景文本识别提供高识别精度,尤其是在使用基于变压器的编码器 - 解码器模型时。要培训高度准确的端到端模型,我们需要为目标语言准备一个大型图像到文本配对数据集。但是,很难收集这些数据,特别是对于资源差的语言。为了克服这种困难,我们所提出的方法利用富裕的大型数据集,以资源丰富的语言,如英语,培训资源差的编码器解码器模型。我们的主要思想是建立一个模型,其中编码器反映了多种语言的知识,而解码器专门从事资源差的语言。为此,所提出的方法通过使用组合资源贫乏语言数据集和资源丰富的语言数据集的多语言数据集来预先培训编码器,以学习用于场景文本识别的语言不变知识。所提出的方法还通过使用资源贫乏语言的数据集预先列举解码器,使解码器更适合资源较差的语言。使用小型公共数据集进行日本现场文本识别的实验证明了该方法的有效性。
translated by 谷歌翻译
本文提出了一种用于对话序列标记的新型知识蒸馏方法。对话序列标签是监督的学习任务,估计目标对话文档中每个话语的标签,并且对于许多诸如对话法估计的许多应用是有用的。准确的标签通常通过分层结构化的大型模型来实现,这些大型模型组成的话语级和对话级网络,分别捕获话语内和话语之间的上下文。但是,由于其型号大小,因此无法在资源受限设备上部署此类模型。为了克服这种困难,我们专注于通过蒸馏了大型和高性能教师模型的知识来列举一个小型模型的知识蒸馏。我们的主要思想是蒸馏知识,同时保持教师模型捕获的复杂环境。为此,所提出的方法,等级知识蒸馏,通过蒸馏来列举小型模型,而不是通过培训模型在教师模型中培训的话语水平和对话级环境的知识模拟教师模型在每个级别的输出。对话法案估算和呼叫场景分割的实验证明了该方法的有效性。
translated by 谷歌翻译
The ability to record high-fidelity videos at high acquisition rates is central to the study of fast moving phenomena. The difficulty of imaging fast moving scenes lies in a trade-off between motion blur and underexposure noise: On the one hand, recordings with long exposure times suffer from motion blur effects caused by movements in the recorded scene. On the other hand, the amount of light reaching camera photosensors decreases with exposure times so that short-exposure recordings suffer from underexposure noise. In this paper, we propose to address this trade-off by treating the problem of high-speed imaging as an underexposed image denoising problem. We combine recent advances on underexposed image denoising using deep learning and adapt these methods to the specificity of the high-speed imaging problem. Leveraging large external datasets with a sensor-specific noise model, our method is able to speedup the acquisition rate of a High-Speed Camera over one order of magnitude while maintaining similar image quality.
translated by 谷歌翻译
Robotic hands with soft surfaces can perform stable grasping, but the high friction of the soft surfaces makes it difficult to release objects, or to perform operations that require sliding. To solve this issue, we previously developed a contact area variable surface (CAVS), whose friction changed according to the load. However, only our fundamental results were previously presented, with detailed analyses not provided. In this study, we first investigated the CAVS friction anisotropy, and demonstrated that the longitudinal direction exhibited a larger ratio of friction change. Next, we proposed a sensible CAVS, capable of providing a variable-friction mechanism, and tested its sensing and control systems in operations requiring switching between sliding and stable-grasping modes. Friction sensing was performed using an embedded camera, and we developed a gripper using the sensible CAVS, considering the CAVS friction anisotropy. In CAVS, the low-friction mode corresponds to a small grasping force, while the high-friction mode corresponds to a greater grasping force. Therefore, by controlling only the friction mode, the gripper mode can be set to either the sliding or stable-grasping mode. Based on this feature, a methodology for controlling the contact mode was constructed. We demonstrated a manipulation involving sliding and stable grasping, and thus verified the efficacy of the developed sensible CAVS.
translated by 谷歌翻译
最近的作品表明,隐式神经表示(INR)具有信号导数的有意义表示的能力。在这项工作中,我们利用该属性来执行视频框架插值(VFI),通过明确限制INR的衍生物以满足光流约束方程。我们仅使用目标视频及其光流,在有限的运动范围内实现了最先进的VFI,而无需从其他培训数据中学习插值操作员。我们进一步表明,限制INR衍生物不仅可以更好地插值中间框架,还可以提高狭窄网络适合观察到的帧的能力,这暗示了潜在的视频压缩和INR优化的应用。
translated by 谷歌翻译
在目前的工作中,我们表明,公式驱动的监督学习(FDSL)的表现可以匹配甚至超过Imagenet-21K的表现,而无需在视觉预训练期间使用真实的图像,人类和自我选择变压器(VIT)。例如,在ImagEnet-21K上预先训练的VIT-BASE在ImagEnet-1K上进行微调时,在ImagEnet-1K和FDSL上进行微调时显示了81.8%的TOP-1精度,当在相同条件下进行预训练时(图像数量,数量,,图像数量,超参数和时期数)。公式产生的图像避免了隐私/版权问题,标记成本和错误以及真实图像遭受的偏见,因此具有巨大的预训练通用模型的潜力。为了了解合成图像的性能,我们测试了两个假设,即(i)对象轮廓是FDSL数据集中重要的,(ii)创建标签的参数数量增加会影响FDSL预训练的性能改善。为了检验以前的假设,我们构建了一个由简单对象轮廓组合组成的数据集。我们发现该数据集可以匹配分形的性能。对于后一种假设,我们发现增加训练任务的难度通常会导致更好的微调准确性。
translated by 谷歌翻译
A method to perform offline and online speaker diarization for an unlimited number of speakers is described in this paper. End-to-end neural diarization (EEND) has achieved overlap-aware speaker diarization by formulating it as a multi-label classification problem. It has also been extended for a flexible number of speakers by introducing speaker-wise attractors. However, the output number of speakers of attractor-based EEND is empirically capped; it cannot deal with cases where the number of speakers appearing during inference is higher than that during training because its speaker counting is trained in a fully supervised manner. Our method, EEND-GLA, solves this problem by introducing unsupervised clustering into attractor-based EEND. In the method, the input audio is first divided into short blocks, then attractor-based diarization is performed for each block, and finally, the results of each block are clustered on the basis of the similarity between locally-calculated attractors. While the number of output speakers is limited within each block, the total number of speakers estimated for the entire input can be higher than the limitation. To use EEND-GLA in an online manner, our method also extends the speaker-tracing buffer, which was originally proposed to enable online inference of conventional EEND. We introduce a block-wise buffer update to make the speaker-tracing buffer compatible with EEND-GLA. Finally, to improve online diarization, our method improves the buffer update method and revisits the variable chunk-size training of EEND. The experimental results demonstrate that EEND-GLA can perform speaker diarization of an unseen number of speakers in both offline and online inferences.
translated by 谷歌翻译
We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph. We present the following three principal contributions. First, we develop a convolutional neural network (CNN) architecture that is trainable in an end-to-end manner directly for the place recognition task. The main component of this architecture, NetVLAD, is a new generalized VLAD layer, inspired by the "Vector of Locally Aggregated Descriptors" image representation commonly used in image retrieval. The layer is readily pluggable into any CNN architecture and amenable to training via backpropagation. Second, we develop a training procedure, based on a new weakly supervised ranking loss, to learn parameters of the architecture in an end-to-end manner from images depicting the same places over time downloaded from Google Street View Time Machine. Finally, we show that the proposed architecture significantly outperforms non-learnt image representations and off-the-shelf CNN descriptors on two challenging place recognition benchmarks, and improves over current stateof-the-art compact image representations on standard image retrieval benchmarks.
translated by 谷歌翻译