表格在许多科学文件中简明扼要地存在重要信息。像数学符号,方程和跨越小区等视觉功能使得从研究文件中的表难以实现结构和内容提取。本文讨论了对科学表图像识别的ICDAR 2021竞争的数据集,任务,参与者的方法和结果。具体地,竞争的任务是将表格图像转换为其相应的乳胶源代码。我们提出了两个子组织。在子任务1中,我们要求参与者从图像重建乳胶结构代码。在子任务2中,我们要求参与者从图像重建乳胶内容代码。本报告描述了数据集和地面真理规范,详细介绍了使用的性能评估指标,提出了最终结果,并总结了参与方法。 vcgroup的提交提交了子任务1和55%的最高精确匹配的准确度分数,适用于子任务2,分别以5%和12%击败以前的基线。虽然仍然可以对模型的识别能力进行改进,但是通过挑战从业者解决特定限制和分享方法的问题,促进了全自动表识别系统的开发。该平台将在https://competitions.codalab.org/competitions/26979中留下挑战后的提交。
translated by 谷歌翻译
本文介绍了用于文档图像分析的图像数据集的系统文献综述,重点是历史文档,例如手写手稿和早期印刷品。寻找适当的数据集进行历史文档分析是促进使用不同机器学习算法进行研究的关键先决条件。但是,由于实际数据非常多(例如,脚本,任务,日期,支持系统和劣化量),数据和标签表示的不同格式以及不同的评估过程和基准,因此找到适当的数据集是一项艰巨的任务。这项工作填补了这一空白,并在现有数据集中介绍了元研究。经过系统的选择过程(根据PRISMA指南),我们选择了56项根据不同因素选择的研究,例如出版年份,文章中实施的方法数量,所选算法的可靠性,数据集大小和期刊的可靠性出口。我们通过将其分配给三个预定义的任务之一来总结每个研究:文档分类,布局结构或语义分析。我们为每个数据集提供统计,文档类型,语言,任务,输入视觉方面和地面真实信息。此外,我们还提供了这些论文或最近竞争的基准任务和结果。我们进一步讨论了该领域的差距和挑战。我们倡导将转换工具提供到通用格式(例如,用于计算机视觉任务的可可格式),并始终提供一组评估指标,而不仅仅是一种评估指标,以使整个研究的结果可比性。
translated by 谷歌翻译
无约束的手写文本识别仍然具有挑战性的计算机视觉系统。段落识别传统上由两个模型实现:第一个用于线分割和用于文本线路识别的第二个。我们提出了一个统一的端到端模型,使用混合注意力来解决这项任务。该模型旨在迭代地通过线路进行段落图像线。它可以分为三个模块。编码器从整个段落图像生成特征映射。然后,注意力模块循环生成垂直加权掩模,使能专注于当前的文本线特征。这样,它执行一种隐式线分割。对于每个文本线特征,解码器模块识别关联的字符序列,导致整个段落的识别。我们在三个流行的数据集赛中达到最先进的字符错误率:ribs的1.91%,IAM 4.45%,读取2016年3.59%。我们的代码和培训的模型重量可在HTTPS:// GitHub上获得.com / fefodeeplearning / watermentattentocroc。
translated by 谷歌翻译
The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 3rd International Workshop on Reading Music Systems, held in Alicante on the 23rd of July 2021.
translated by 谷歌翻译
无约束的手写文本识别是一项具有挑战性的计算机视觉任务。传统上,它是通过两步方法来处理的,结合了线细分,然后是文本线识别。我们第一次为手写文档识别任务提出了无端到端的无分段体系结构:文档注意网络。除文本识别外,该模型还接受了使用类似XML的方式使用开始和结束标签标记文本零件的训练。该模型由用于特征提取的FCN编码器和用于复发令牌预测过程的变压器解码器层组成。它将整个文本文档作为输入和顺序输出字符以及逻辑布局令牌。与现有基于分割的方法相反,该模型是在不使用任何分割标签的情况下进行训练的。我们在页面级别的Read 2016数据集以及CER分别为3.43%和3.70%的双页级别上获得了竞争成果。我们还为Rimes 2009数据集提供了页面级别的结果,达到CER的4.54%。我们在https://github.com/factodeeplearning/dan上提供所有源代码和预训练的模型权重。
translated by 谷歌翻译
The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 2nd International Workshop on Reading Music Systems, held in Delft on the 2nd of November 2019.
translated by 谷歌翻译
我们介绍了Dessurt,这是一个相对简单的文档理解变压器,能够在各种文档任务上进行微调,而不是先前的方法。它接收文档映像和任务字符串作为输入,并作为输出以任意文本自动添加。由于Dessurt是端到端体系结构,除了文档理解外,还可以执行文本识别,因此它不需要像先前方法那样需要外部识别模型。Dessurt比先前的方法更灵活,并且能够处理各种文档域和任务。我们表明,该模型可在9种不同的数据集任务组合中有效。
translated by 谷歌翻译
文本识别是文档数字化的长期研究问题。现有的方法通常是基于CNN构建的,以用于图像理解,并为Char-Level文本生成而建立RNN。此外,通常需要另一种语言模型来提高整体准确性作为后处理步骤。在本文中,我们提出了一种使用预训练的图像变压器和文本变压器模型(即Trocr)提出的端到端文本识别方法,该模型利用了变压器体系结构,以实现图像理解和文字级级文本生成。TROR模型很简单,但有效,可以通过大规模合成数据进行预训练,并通过人体标记的数据集进行微调。实验表明,TROR模型的表现优于印刷,手写和场景文本识别任务上的当前最新模型。Trocr模型和代码可在\ url {https://aka.ms/trocr}上公开获得。
translated by 谷歌翻译
Handwritten Text Recognition (HTR) is more interesting and challenging than printed text due to uneven variations in the handwriting style of the writers, content, and time. HTR becomes more challenging for the Indic languages because of (i) multiple characters combined to form conjuncts which increase the number of characters of respective languages, and (ii) near to 100 unique basic Unicode characters in each Indic script. Recently, many recognition methods based on the encoder-decoder framework have been proposed to handle such problems. They still face many challenges, such as image blur and incomplete characters due to varying writing styles and ink density. We argue that most encoder-decoder methods are based on local visual features without explicit global semantic information. In this work, we enhance the performance of Indic handwritten text recognizers using global semantic information. We use a semantic module in an encoder-decoder framework for extracting global semantic information to recognize the Indic handwritten texts. The semantic information is used in both the encoder for supervision and the decoder for initialization. The semantic information is predicted from the word embedding of a pre-trained language model. Extensive experiments demonstrate that the proposed framework achieves state-of-the-art results on handwritten texts of ten Indic languages.
translated by 谷歌翻译
本文提出了2022年访问量的挑战的最终结果。 OOV竞赛介绍了一个重要方面,而光学角色识别(OCR)模型通常不会研究,即,在培训时对看不见的场景文本实例的识别。竞赛编制了包含326,385张图像的公共场景文本数据集的集合,其中包含4,864,405个场景文本实例,从而涵盖了广泛的数据分布。形成了一个新的独立验证和测试集,其中包括在训练时出词汇量不超出词汇的场景文本实例。竞争是在两项任务中进行的,分别是端到端和裁剪的文本识别。介绍了基线和不同参与者的结果的详尽分析。有趣的是,在新研究的设置下,当前的最新模型显示出显着的性能差距。我们得出的结论是,在此挑战中提出的OOV数据集将是要探索的重要领域,以开发场景文本模型,以实现更健壮和广义的预测。
translated by 谷歌翻译
表被广泛用于几种类型的文档,因为它们可以以结构化的方式带来重要信息。在科学论文中,表可以概括新颖的发现并总结实验结果,从而使研究可以与学者相提并论。几种方法执行了在文档图像上使用的表分析,从PDF文件转换期间丢失了有用的信息,因为OCR工具可能容易出现识别错误,尤其是在表中的文本。这项工作的主要贡献是解决桌子提取问题,利用图形神经网络。节点特征富含适当设计的表示形式嵌入。这些表示形式不仅有助于更好地区分纸张的其他部分,还可以将表单元与桌子标题区分开。我们通过合并PublayNet和PubTables-1M数据集中提供的信息,在获得的新数据集上实验评估了所提出的方法。
translated by 谷歌翻译
离线手写数学表达识别(HMER)是数学表达识别领域的主要领域。与在线HMER相比,由于缺乏时间信息和写作风格的可变性,离线HMER通常被认为是一个更困难的问题。在本文中,我们目的是使用配对对手学习的编码器模型。语义不变的特征是从手写数学表达图像及其编码器中的印刷数学表达式中提取的。学习语义不变的特征与Densenet编码器和变压器解码器相结合,帮助我们提高了先前研究的表达率。在Crohme数据集上进行了评估,我们已经能够将最新的Crohme 2019测试集结果提高4%。
translated by 谷歌翻译
在包括搜索在内的各种应用程序中,积极消费数字文档的研究范围为研究范围。传统上,文档中的搜索是作为文本匹配的问题施放的,忽略了结构化文档,表格等中常见的丰富布局和视觉提示。为此,我们提出了一个大多数未探索的问题:“我们可以搜索其他类似的snippets在目标文档页面中存在给定文档摘要的单个查询实例吗?”。我们建议单体将其作为单拍的摘要检测任务解决。单体融合了摘要和文档的视觉,文本和空间方式的上下文,以在目标文档中找到查询片段。我们进行了广泛的消融和实验,显示单体从一击对象检测(BHRL),模板匹配和文档理解(Layoutlmv3)中优于几个基线。由于目前的任务缺乏相关数据,因此我们对单体进行了编程生成的数据训练,该数据具有许多视觉上相似的查询片段和来自两个数据集的目标文档对 - Flamingo表单和PublayNet。我们还进行人类研究以验证生成的数据。
translated by 谷歌翻译
Transformers are widely used in NLP tasks. However, current approaches to leveraging transformers to understand language expose one weak spot: Number understanding. In some scenarios, numbers frequently occur, especially in semi-structured data like tables. But current approaches to rich-number tasks with transformer-based language models abandon or lose some of the numeracy information - e.g., breaking numbers into sub-word tokens - which leads to many number-related errors. In this paper, we propose the LUNA framework which improves the numerical reasoning and calculation capabilities of transformer-based language models. With the number plugin of NumTok and NumBed, LUNA represents each number as a whole to model input. With number pre-training, including regression loss and model distillation, LUNA bridges the gap between number and vocabulary embeddings. To the best of our knowledge, this is the first work that explicitly injects numeracy capability into language models using Number Plugins. Besides evaluating toy models on toy tasks, we evaluate LUNA on three large-scale transformer models (RoBERTa, BERT, TabBERT) over three different downstream tasks (TATQA, TabFact, CrediTrans), and observe the performances of language models are constantly improved by LUNA. The augmented models also improve the official baseline of TAT-QA (EM: 50.15 -> 59.58) and achieve SOTA performance on CrediTrans (F1 = 86.17).
translated by 谷歌翻译
我们提出了基于神经网络的手写文本识别(HTR)模型体系结构,可以训练,以识别手写或印刷文本的完整页面,而无需图像分割。它基于图像到序列体系结构,它可以提取图像中存在的文本,然后正确地序列,而无需对文本和非文本的方向,布局和大小施加任何约束。此外,还可以训练它以生成与格式,布局和内容相关的辅助标记。我们使用角色级别的词汇,从而实现任何主题的语言和术语。该模型在IAM数据集中在段落级别识别中实现了新的艺术品。当对现实世界手写的免费表测试答案进行评估时 - 与弯曲和倾斜的线条,图纸,表,数学,化学和其他符号进行评估时,它的性能要比所有市售的HTR Cloud API都要好。它作为商业Web应用程序的一部分部署在生产中。
translated by 谷歌翻译
我们介绍了两个数据增强技术,它与Reset-Bilstm-CTC网络一起使用,显着降低了在手写文本识别(HTR)任务上的最佳报告结果之外的字错误率(WER)和字符错误率(CER)。我们应用了一种基于打印文本(StackMix)的删除文本(手写污染)和手写文本生成方法的新型增强,这被证明在HTR任务中非常有效。StackMix使用弱监督框架来获得字符边界。因为这些数据增强技术与所使用的网络无关,所以也可以应用于增强其他网络的性能和HTR的方法。十个手写文本数据集的广泛实验表明,手写墨水增强和StackMix显着提高了HTR模型的质量
translated by 谷歌翻译
Code completion aims to help improve developers' productivity by suggesting the next code tokens from a given context. Various approaches have been proposed to incorporate abstract syntax tree (AST) information for model training, ensuring that code completion is aware of the syntax of the programming languages. However, existing syntax-aware code completion approaches are not on-the-fly, as we found that for every two-thirds of characters that developers type, AST fails to be extracted because it requires the syntactically correct source code, limiting its practicality in real-world scenarios. On the other hand, existing on-the-fly code completion does not consider syntactic information yet. In this paper, we propose PyCoder to leverage token types, a kind of lightweight syntactic information, which is readily available and aligns with the natural order of source code. Our PyCoder is trained in a multi-task training manner so that by learning the supporting task of predicting token types during the training phase, the models achieve better performance on predicting tokens and lines of code without the need for token types in the inference phase. Comprehensive experiments show that PyCoder achieves the first rank on the CodeXGLUE leaderboard with an accuracy of 77.12% for the token-level predictions, which is 0.43%-24.25% more accurate than baselines. In addition, PyCoder achieves an exact match of 43.37% for the line-level predictions, which is 3.63%-84.73% more accurate than baselines. These results lead us to conclude that token type information (an alternative to syntactic information) that is rarely used in the past can greatly improve the performance of code completion approaches, without requiring the syntactically correct source code like AST-based approaches do. Our PyCoder is publicly available on HuggingFace.
translated by 谷歌翻译
Scene text spotting is of great importance to the computer vision community due to its wide variety of applications. Recent methods attempt to introduce linguistic knowledge for challenging recognition rather than pure visual classification. However, how to effectively model the linguistic rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input. Correspondingly, we propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting. Firstly, the autonomous suggests enforcing explicitly language modeling by decoupling the recognizer into vision model and language model and blocking gradient flow between both models. Secondly, a novel bidirectional cloze network (BCN) as the language model is proposed based on bidirectional feature representation. Thirdly, we propose an execution manner of iterative correction for the language model which can effectively alleviate the impact of noise input. Finally, to polish ABINet++ in long text recognition, we propose to aggregate horizontal features by embedding Transformer units inside a U-Net, and design a position and content attention module which integrates character order and content to attend to character features precisely. ABINet++ achieves state-of-the-art performance on both scene text recognition and scene text spotting benchmarks, which consistently demonstrates the superiority of our method in various environments especially on low-quality images. Besides, extensive experiments including in English and Chinese also prove that, a text spotter that incorporates our language modeling method can significantly improve its performance both in accuracy and speed compared with commonly used attention-based recognizers.
translated by 谷歌翻译
文档AI或Document Intelligence是一个相对较新的研究主题,指的是自动阅读,理解和分析业务文档的技术。它是自然语言处理和计算机视觉的重要研究方向。近年来,深度学习技术的普及已经大大提高了文档AI的发展,如文件布局分析,视觉信息提取,文档视觉问题应答,文档图像分类等。本文简要评论了一些代表性模型,任务和基准数据集。此外,我们还介绍了早期的启发式规则的文档分析,统计机器学习算法,深度学习方法,尤其是预训练方法。最后,我们展望未来的Document AI研究方向。
translated by 谷歌翻译
几乎所有场景文本发现(检测和识别)方法依赖于昂贵的框注释(例如,文本线框,单词级框和字符级框)。我们首次证明培训场景文本发现模型可以通过每个实例的单点的极低成本注释来实现。我们提出了一种端到端的场景文本发现方法,将场景文本拍摄作为序列预测任务,如语言建模。给予图像作为输入,我们将所需的检测和识别结果作为一系列离散令牌制定,并使用自动回归变压器来预测序列。我们在几个水平,多面向和任意形状的场景文本基准上实现了有希望的结果。最重要的是,我们表明性能对点注释的位置不是很敏感,这意味着它可以比需要精确位置的边界盒更容易地注释并自动生成。我们认为,这种先锋尝试表明了场景文本的重要机会,比以前可能的比例更大的比例更大。
translated by 谷歌翻译