The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Recently, the success of pre-training in text domain has been fully extended to vision, audio, and cross-modal scenarios. The proposed pre-training models of different modalities are showing a rising trend of homogeneity in their model structures, which brings the opportunity to implement different pre-training models within a uniform framework. In this paper, we present TencentPretrain, a toolkit supporting pre-training models of different modalities. The core feature of TencentPretrain is the modular design. The toolkit uniformly divides pre-training models into 5 components: embedding, encoder, target embedding, decoder, and target. As almost all of common modules are provided in each component, users can choose the desired modules from different components to build a complete pre-training model. The modular design enables users to efficiently reproduce existing pre-training models or build brand-new one. We test the toolkit on text, vision, and audio benchmarks and show that it can match the performance of the original implementations.
translated by 谷歌翻译
High-fidelity facial avatar reconstruction from a monocular video is a significant research problem in computer graphics and computer vision. Recently, Neural Radiance Field (NeRF) has shown impressive novel view rendering results and has been considered for facial avatar reconstruction. However, the complex facial dynamics and missing 3D information in monocular videos raise significant challenges for faithful facial reconstruction. In this work, we propose a new method for NeRF-based facial avatar reconstruction that utilizes 3D-aware generative prior. Different from existing works that depend on a conditional deformation field for dynamic modeling, we propose to learn a personalized generative prior, which is formulated as a local and low dimensional subspace in the latent space of 3D-GAN. We propose an efficient method to construct the personalized generative prior based on a small set of facial images of a given individual. After learning, it allows for photo-realistic rendering with novel views and the face reenactment can be realized by performing navigation in the latent space. Our proposed method is applicable for different driven signals, including RGB images, 3DMM coefficients, and audios. Compared with existing works, we obtain superior novel view synthesis results and faithfully face reenactment performance.
translated by 谷歌翻译
Recent cross-lingual cross-modal works attempt to extend Vision-Language Pre-training (VLP) models to non-English inputs and achieve impressive performance. However, these models focus only on understanding tasks utilizing encoder-only architecture. In this paper, we propose ERNIE-UniX2, a unified cross-lingual cross-modal pre-training framework for both generation and understanding tasks. ERNIE-UniX2 integrates multiple pre-training paradigms (e.g., contrastive learning and language modeling) based on encoder-decoder architecture and attempts to learn a better joint representation across languages and modalities. Furthermore, ERNIE-UniX2 can be seamlessly fine-tuned for varieties of generation and understanding downstream tasks. Pre-trained on both multilingual text-only and image-text datasets, ERNIE-UniX2 achieves SOTA results on various cross-lingual cross-modal generation and understanding tasks such as multimodal machine translation and multilingual visual question answering.
translated by 谷歌翻译
区分计算机生成(CG)和自然摄影图像(PG)图像对于验证数字图像的真实性和独创性至关重要。但是,最近的尖端生成方法使CG图像中的合成质量很高,这使得这项具有挑战性的任务变得更加棘手。为了解决这个问题,提出了具有深层质地和高频特征的联合学习策略,以进行CG图像检测。我们首先制定并深入分析CG和PG图像的不同采集过程。基于这样的发现,即图像采集中的多个不同模块将导致对图像中基于卷积神经网络(CNN)渲染的不同敏感性不一致,我们提出了一个深层纹理渲染模块,以增强纹理差异和歧视性纹理表示。具体而言,生成语义分割图来指导仿射转换操作,该操作用于恢复输入图像不同区域中的纹理。然后,原始图像和原始图像和渲染图像的高频组件的组合被馈入配备了注意机制的多支球神经网络,该神经网络分别优化了中间特征,并分别促进了空间和通道维度的痕量探索。在两个公共数据集和一个具有更现实和多样化图像的新构建的数据集上进行的广泛实验表明,所提出的方法的表现优于现有方法,从而明确的余量。此外,结果还证明了拟议方法后处理操作和生成对抗网络(GAN)生成的图像的检测鲁棒性和泛化能力。
translated by 谷歌翻译
风险评分系统已被广泛地部署在许多应用程序中,这些应用程序根据用户的行为序列将风险分数分配给了。尽管许多具有复杂设计的深度学习方法已经取得了令人鼓舞的结果,但由于公平,解释性和合规性考虑,黑框的性质阻碍了他们的应用。在这些敏感情况下,基于规则的系统被认为是可靠的。但是,构建规则系统是劳动密集型的。专家需要从用户行为序列,基于统计数据的设计规则中找到信息统计信息,并为每个规则分配权重。在本文中,我们弥合了有效但黑色框模型与透明规则模型之间的差距。我们提出了一种两阶段的方法Rudi,该方法将黑框教师模型的知识提炼成基于规则的学生模型。我们设计了一种基于蒙特卡洛树搜索的统计生成方法,该方法可以在第一阶段提供一组信息统计信息。然后,通过模仿教师模型的输出,将统计数据与我们提出的神经逻辑网络组成逻辑规则。我们在三个现实世界公共数据集和一个工业数据集上评估了Rudi,以证明其有效性。
translated by 谷歌翻译
近年来,双相面孔皮草草图合成的显着进展随着生成的对抗性网络(GAN)的发展。双相面孔光学素材合成可以应用于数字娱乐和执法等宽范围的领域。然而,由于实际场景中的草图和复杂的照片变化,产生现实照片和不同的草图遭受了极大的挑战。为此,我们提出了一种新颖的语义驱动生成的对抗网络来解决上述问题,与图形表示学习合作。具体而言,我们将Class-Wise语义布局注入发电机以提供基于样式的空间监督,用于合成面部照片和草图。此外,为了提高生成的结果的保真度,我们利用语义布局来构造两种类型的代表性图,该图表示综合图像的类内语义特征和级别的结构特征。此外,我们基于所提出的代表性图设计了两种类型的约束,其便于保存生成的面部照片和草图中的细节。此外,为了进一步增强合成图像的感知质量,我们提出了一种新的双相培训策略,致力于通过迭代周期培训来细化所产生的结果。在CUFS和CUFSF数据集上进行了广泛的实验,以证明我们提出的方法实现了最先进的性能的突出能力。
translated by 谷歌翻译
学习者语料库收集L2学习者产生的语言数据,即第二或外语学习者。这种资源与第二语言采集研究,外语教学和自动语法纠错有关。但是,几乎没有焦点汉语作为外语(CFL)学习者的学习者语料库。因此,我们建议构建大规模的多维注释的中国学习者语料库。要构建语料库,我们首先获得CFL学习者生成的大量富有的富主题文本。然后我们设计一个注释方案,包括句子可接受性得分以及语法错误和基于流畅的校正。我们构建一个众群平台,有效地执行注释(https://yaclc.wenmind.net)。我们命名语料库yaclc(又一个中国学习者语料库)并将其释放为Cuge基准(http://cuge.baai.ac.cn)。通过分析语料库中的原始句子和注释,我们发现Yaclc具有相当大的尺寸和非常高的注释质量。我们希望这项语料库能够进一步加强中国国际教育和中国自动语法纠错的研究。
translated by 谷歌翻译
光保护综合技术的快速进展达到了真实和操纵图像之间的边界开始模糊的临界点。最近,一个由Mega-Scale Deep Face Forgery DataSet,由290万个图像组成和221,247个视频的伪造网络已被释放。它是迄今为止的数据规模,操纵(7个图像级别方法,8个视频级别方法),扰动(36个独立和更混合的扰动)和注释(630万个分类标签,290万操纵区域注释和221,247个时间伪造段标签)。本文报告了Forgerynet-Face Forgery Analysis挑战2021的方法和结果,它采用了伪造的基准。模型评估在私人测试集上执行离线。共有186名参加比赛的参与者,11名队伍提交了有效的提交。我们将分析排名排名的解决方案,并展示一些关于未来工作方向的讨论。
translated by 谷歌翻译
交通优化挑战,如负载平衡,流量调度和提高数据包交付时间,是广域网(WAN)中困难的在线决策问题。例如,需要复杂的启发式方法,以找到改善分组输送时间并最小化可能由链接故障或拥塞引起的中断的最佳路径。最近的加强学习(RL)算法的成功可以提供有用的解决方案,以建立更好的鲁棒系统,这些系统从无模式设置中学习。在这项工作中,我们考虑了一条路径优化问题,专门针对数据包路由,在大型复杂网络中。我们开发和评估一种无模型方法,应用多代理元增强学习(MAMRL),可以确定每个数据包的下一跳,以便将其传递到其目的地,最短的时间整体。具体地,我们建议利用和比较深度策略优化RL算法,以便在通信网络中启用分布式无模型控制,并呈现基于新的Meta学习的框架Mamrl,以便快速适应拓扑变化。为了评估所提出的框架,我们用各种WAN拓扑模拟。我们广泛的数据包级仿真结果表明,与古典最短路径和传统的加强学习方法相比,Mamrl即使网络需求增加也显着降低了平均分组交付时间;与非元深策略优化算法相比,我们的结果显示在连杆故障发生的同时出现相当的平均数据包交付时间时减少较少的剧集中的数据包丢失。
translated by 谷歌翻译