Quantifying the perceptual similarity of two images is a long-standing problem in low-level computer vision. The natural image domain commonly relies on supervised learning, e.g., a pre-trained VGG, to obtain a latent representation. However, due to domain shift, pre-trained models from the natural image domain might not apply to other image domains, such as medical imaging. Notably, in medical imaging, evaluating the perceptual similarity is exclusively performed by specialists trained extensively in diverse medical fields. Thus, medical imaging remains devoid of task-specific, objective perceptual measures. This work answers the question: Is it necessary to rely on supervised learning to obtain an effective representation that could measure perceptual similarity, or is self-supervision sufficient? To understand whether recent contrastive self-supervised representation (CSR) may come to the rescue, we start with natural images and systematically evaluate CSR as a metric across numerous contemporary architectures and tasks and compare them with existing methods. We find that in the natural image domain, CSR behaves on par with the supervised one on several perceptual tests as a metric, and in the medical domain, CSR better quantifies perceptual similarity concerning the experts' ratings. We also demonstrate that CSR can significantly improve image quality in two image synthesis tasks. Finally, our extensive results suggest that perceptuality is an emergent property of CSR, which can be adapted to many image domains without requiring annotations.
translated by 谷歌翻译
医疗图像合成引起了人们的关注,因为它可能会产生缺失的图像数据,改善诊断并受益于许多下游任务。但是,到目前为止,开发的合成模型并不适应显示域移位的看不见的数据分布,从而限制了其在临床常规中的适用性。这项工作着重于探索3D图像到图像合成模型的域适应性(DA)。首先,我们强调了分类,分割和合成模型之间DA的技术差异。其次,我们提出了一种基于近似3D分布的2D变异自动编码器的新型有效适应方法。第三,我们介绍了有关适应数据量和关键超参数量的影响的经验研究。我们的结果表明,所提出的方法可以显着提高3D设置中未见域的合成精度。该代码可在https://github.com/winstonhutiger/2d_vae_uda_for_3d_sythesis上公开获得。
translated by 谷歌翻译
对新生儿的运动和姿势评估使经验丰富的儿科医生可以预测神经发育障碍,从而可以早期干预相关疾病。但是,大多数用于人类姿势估计方法的最新AI方法都集中在成年人上,缺乏公开基准的婴儿姿势估计。在本文中,我们通过提出婴儿姿势数据集和深度聚合视觉变压器来填补这一空白,以进行人姿势估计,该姿势估计引入了一个快速训练的完整变压器框架,而无需使用卷积操作在早期阶段提取功能。它将变压器 + MLP概括为特征图内的高分辨率深层聚集,从而在不同视力级别之间实现信息融合。我们在可可姿势数据集上预先训练,并将其应用于新发布的大规模婴儿姿势估计数据集。结果表明,凝集可以有效地学习不同分辨率之间的多尺度特征,并显着提高婴儿姿势估计的性能。我们表明,在婴儿姿势估计数据集中,凝集优于混合模型hrformer和tokenpose。此外,在可可瓣姿势估计上,我们的凝集表现优于0.8 AP。我们的代码可在github.com/szar-lab/aggpose上获得。
translated by 谷歌翻译
最近提出了基于子图的图表学习(SGRL)来应对规范图神经网络(GNNS)遇到的一些基本挑战,并在许多重要的数据科学应用(例如链接,关系和主题预测)中证明了优势。但是,当前的SGRL方法遇到了可伸缩性问题,因为它们需要为每个培训或测试查询提取子图。扩大规范GNN的最新解决方案可能不适用于SGRL。在这里,我们通过共同设计学习算法及其系统支持,为可扩展的SGRL提出了一种新颖的框架Surel。 Surel采用基于步行的子图表分解,并将步行重新形成子图,从而大大降低了子图提取的冗余并支持并行计算。具有数百万个节点和边缘的六个同质,异质和高阶图的实验证明了Surel的有效性和可扩展性。特别是,与SGRL基线相比,Surel可以实现10 $ \ times $ Quad-Up,具有可比甚至更好的预测性能;与规范GNN相比,Surel可实现50%的预测准确性。
translated by 谷歌翻译
面部检测是为了在图像中搜索面部的所有可能区域,并且如果有任何情况,则定位面部。包括面部识别,面部表情识别,面部跟踪和头部姿势估计的许多应用假设面部的位置和尺寸在图像中是已知的。近几十年来,研究人员从Viola-Jones脸上检测器创造了许多典型和有效的面部探测器到当前的基于CNN的CNN。然而,随着图像和视频的巨大增加,具有面部刻度的变化,外观,表达,遮挡和姿势,传统的面部探测器被挑战来检测野外面孔的各种“脸部。深度学习技术的出现带来了非凡的检测突破,以及计算的价格相当大的价格。本文介绍了代表性的深度学习的方法,并在准确性和效率方面提出了深度和全面的分析。我们进一步比较并讨论了流行的并挑战数据集及其评估指标。进行了几种成功的基于深度学习的面部探测器的全面比较,以使用两个度量来揭示其效率:拖鞋和延迟。本文可以指导为不同应用选择合适的面部探测器,也可以开发更高效和准确的探测器。
translated by 谷歌翻译
Temporal modeling is key for action recognition in videos. It normally considers both short-range motions and long-range aggregations. In this paper, we propose a Temporal Excitation and Aggregation (TEA) block, including a motion excitation (ME) module and a multiple temporal aggregation (MTA) module, specifically designed to capture both short-and long-range temporal evolution. In particular, for short-range motion modeling, the ME module calculates the feature-level temporal differences from spatiotemporal features. It then utilizes the differences to excite the motion-sensitive channels of the features. The long-range temporal aggregations in previous works are typically achieved by stacking a large number of local temporal convolutions. Each convolution processes a local temporal window at a time. In contrast, the MTA module proposes to deform the local convolution to a group of subconvolutions, forming a hierarchical residual architecture. Without introducing additional parameters, the features will be processed with a series of sub-convolutions, and each frame could complete multiple temporal aggregations with neighborhoods. The final equivalent receptive field of temporal dimension is accordingly enlarged, which is capable of modeling the long-range temporal relationship over distant frames. The two components of the TEA block are complementary in temporal modeling. Finally, our approach achieves impressive results at low FLOPs on several action recognition benchmarks, such as Kinetics, Something-Something, HMDB51, and UCF101, which confirms its effectiveness and efficiency.
translated by 谷歌翻译
Deep neural networks are vulnerable to adversarial examples, which poses security concerns on these algorithms due to the potentially severe consequences. Adversarial attacks serve as an important surrogate to evaluate the robustness of deep learning models before they are deployed. However, most of existing adversarial attacks can only fool a black-box model with a low success rate. To address this issue, we propose a broad class of momentum-based iterative algorithms to boost adversarial attacks. By integrating the momentum term into the iterative process for attacks, our methods can stabilize update directions and escape from poor local maxima during the iterations, resulting in more transferable adversarial examples. To further improve the success rates for black-box attacks, we apply momentum iterative algorithms to an ensemble of models, and show that the adversarially trained models with a strong defense ability are also vulnerable to our black-box attacks. We hope that the proposed methods will serve as a benchmark for evaluating the robustness of various deep models and defense methods. With this method, we won the first places in NIPS 2017 Non-targeted Adversarial Attack and Targeted Adversarial Attack competitions.
translated by 谷歌翻译
The deployment of deep convolutional neural networks (CNNs) in many real world applications is largely hindered by their high computational cost. In this paper, we propose a novel learning scheme for CNNs to simultaneously 1) reduce the model size; 2) decrease the run-time memory footprint; and 3) lower the number of computing operations, without compromising accuracy. This is achieved by enforcing channel-level sparsity in the network in a simple but effective way. Different from many existing approaches, the proposed method directly applies to modern CNN architectures, introduces minimum overhead to the training process, and requires no special software/hardware accelerators for the resulting models. We call our approach network slimming, which takes wide and large networks as input models, but during training insignificant channels are automatically identified and pruned afterwards, yielding thin and compact models with comparable accuracy. We empirically demonstrate the effectiveness of our approach with several state-of-the-art CNN models, including VGGNet, ResNet and DenseNet, on various image classification datasets. For VGGNet, a multi-pass version of network slimming gives a 20× reduction in model size and a 5× reduction in computing operations.
translated by 谷歌翻译
Dense retrievers have made significant strides in obtaining state-of-the-art results on text retrieval and open-domain question answering (ODQA). Yet most of these achievements were made possible with the help of large annotated datasets, unsupervised learning for dense retrieval models remains an open problem. In this work, we explore two categories of methods for creating pseudo query-document pairs, named query extraction (QExt) and transferred query generation (TQGen), to augment the retriever training in an annotation-free and scalable manner. Specifically, QExt extracts pseudo queries by document structures or selecting salient random spans, and TQGen utilizes generation models trained for other NLP tasks (e.g., summarization) to produce pseudo queries. Extensive experiments show that dense retrievers trained with individual augmentation methods can perform comparably well with multiple strong baselines, and combining them leads to further improvements, achieving state-of-the-art performance of unsupervised dense retrieval on both BEIR and ODQA datasets.
translated by 谷歌翻译
电子显微镜数据集的自动分析提出了多个挑战,例如训练数据集规模的限制,样品质量和实验条件变化引起的数据分布的变化等。受过训练的模型继续提供可接受的细分/新数据上的分类性能,并量化与其预测相关的不确定性。在机器学习的广泛应用中,已经采用了各种方法来量化不确定性,例如贝叶斯建模,蒙特卡洛辍学,合奏等。目的是解决电子显微镜数据域特有的挑战,两种不同类型的类型这项工作实施了预训练的神经网络的合奏。合奏在两相混合物中对冰晶进行语义分割,从而跟踪其相变成水。第一个合奏(EA)由具有不同基础体系结构的U-NET样式网络组成,而第二系列合奏(ER-I)由随机初始化的U-NET样式网络组成,每个基础学习者都具有相同的基础架构'一世'。基础学习者的编码者已在Imagenet数据集上进行了预训练。对EA和ER的性能进行了三个不同的指标评估:准确性,校准和不确定性。可以看出,与ER相比,EA具有更高的分类精度,并且可以更好地校准。尽管这两种类型的集合的不确定性量化是可比的,但ER所表现出的不确定性得分依赖于其基本成员('i')的特定架构,并且不一致地比EA更好。因此,与像ER这样的合奏设计相比,像EA这样的合奏设计对电子显微镜数据集的分析所带来的挑战似乎可以更好地解决。
translated by 谷歌翻译