Data-centric AI has shed light on the significance of data within the machine learning (ML) pipeline. Acknowledging its importance, various research and policies are suggested by academia, industry, and government departments. Although the capability of utilizing existing data is essential, the capability to build a dataset has become more important than ever. In consideration of this trend, we propose a "Data Management Operation and Recipes" that will guide the industry regardless of the task or domain. In other words, this paper presents the concept of DMOps derived from real-world experience. By offering a baseline for building data, we want to help the industry streamline its data operation optimally.
translated by 谷歌翻译
With the recent advance in neural machine translation demonstrating its importance, research on quality estimation (QE) has been steadily progressing. QE aims to automatically predict the quality of machine translation (MT) output without reference sentences. Despite its high utility in the real world, there remain several limitations concerning manual QE data creation: inevitably incurred non-trivial costs due to the need for translation experts, and issues with data scaling and language expansion. To tackle these limitations, we present QUAK, a Korean-English synthetic QE dataset generated in a fully automatic manner. This consists of three sub-QUAK datasets QUAK-M, QUAK-P, and QUAK-H, produced through three strategies that are relatively free from language constraints. Since each strategy requires no human effort, which facilitates scalability, we scale our data up to 1.58M for QUAK-P, H and 6.58M for QUAK-M. As an experiment, we quantitatively analyze word-level QE results in various ways while performing statistical analysis. Moreover, we show that datasets scaled in an efficient way also contribute to performance improvements by observing meaningful performance gains in QUAK-M, P when adding data up to 1.58M.
translated by 谷歌翻译
随着预培训的语言模型变得更加要求资源,因此资源丰富的语言(例如英语和资源筛选)语言之间的不平等正在恶化。这可以归因于以下事实:每种语言中的可用培训数据量都遵循幂律分布,并且大多数语言都属于分布的长尾巴。一些研究领域试图缓解这个问题。例如,在跨语言转移学习和多语言培训中,目标是通过从资源丰富的语言中获得的知识使长尾语言受益。尽管成功,但现有工作主要集中于尝试尽可能多的语言。结果,有针对性的深入分析主要不存在。在这项研究中,我们专注于单一的低资源语言,并使用跨语性培训(XPT)进行广泛的评估和探测实验。为了使转移方案具有挑战性,我们选择韩语作为目标语言,因为它是一种孤立的语言,因此与英语几乎没有类型的分类。结果表明,XPT不仅优于表现或与单语模型相当,该模型训练有大小的数据,而且在传输过程中也很高。
translated by 谷歌翻译
BlenderBot 2.0是通过使用Internet搜索模块和多次会话来反映实时信息和记住用户信息来表示开放式聊天聊天的对话模型。尽管如此,模型仍然有改进的空间。为此,我们从三个角度检查了BlenderBot 2.0限制和错误:模型,数据和用户。从数据的角度来看,我们突出了在众包流程期间向工人提供的不明确指南,以及缺乏在收集的数据中炼制仇恨言论的过程,并验证基于互联网的信息的准确性。从用户的角度来看,我们确定了百分之九种类型的展示2.0问题,并彻底调查了它们的原因。此外,对于每个观点来说,提出了实际改进方法,我们讨论了几个潜在的未来研究方向。
translated by 谷歌翻译
我们提出了一个基于深度学习的外语学习平台,命名为FreeLalky,因为使用人形机器人NAO和各种深入学习模型,他们会受到对外语言的焦虑的人。嵌入在NAO的基于角色的对话系统为用户提供了一个有趣和一致的多转对话。此外,语法纠错系统促进了用户语法技能的改进。因此,我们的系统支持基于角色对话的个性化学习,并使用语法错误反馈促进语法学习用户。此外,我们通过人类评估通过替换与NAO机器人的谈话来替换真正的人类,验证了FreeTalky是否提供了减轻卵杆菌的实际帮助。
translated by 谷歌翻译
自动编辑(APE)的数据建筑需要广泛而专家级别的人力努力,因为它包含一个涉及识别句子中的错误并提供合适的修订的精心级别。因此,我们开发了一个自我监督的数据生成工具,可作为Web应用程序部署,这最大限度地减少了人类监督,并从并行语料库构建了具有英语作为目标语言的多种语言对的个性化浏览数据。可以使用此工具进行数据为中心的猿类研究,涉及许多尚未研究的语言对,由于缺乏合适的数据而尚未研究。
translated by 谷歌翻译
质量估算数据(QE)培训的数据昂贵,需要大量的人工劳动力。在这项研究中,我们专注于数据以数据为中心的方法,同时执行QE,随后提出一个完全自动的伪QE数据集生成工具,通过仅接收单根或并行语料库作为输入而产生QE数据集。因此,通过数据增强或鼓励多种语言对利用QE的适用性来增强QE性能。此外,我们打算公开发布这款用户友好的QE数据集生成工具,因为我们认为此工具为社区提供了开发QE数据集的新的,廉价的方法。
translated by 谷歌翻译
有几种原因,抽象对话摘要是一项有挑战性的任务。首先,谈话中的大多数重要信息通过与不同纹理样式的多方交互来跨越话语。其次,对话通常是非正式结构,其中不同的个人表达个人观点,与文本摘要不同,通常针对新闻文章等正式文件的任务。为解决这些问题,我们专注于来自各个扬声器和独特的句法结构之间的话语之间的关联。扬声器具有唯一的文本方式,可以包含语言信息,例如声音。因此,我们通过利用语言信息(即POS标记)来构建语法感知模型,这通过自然区分从各个扬声器发出的句子来减轻上述问题。我们采用了多任务学习的语法感知信息和对话摘要。据我们所知,我们的方法是第一种将多任务学习应用于对话摘要任务的方法。 Samsum语料库(大规模对话摘要语料库)的实验表明,我们的方法改善了香草模型。我们进一步分析了我们对基线模型的方法的成本和益处。
translated by 谷歌翻译
The automated segmentation and tracking of macrophages during their migration are challenging tasks due to their dynamically changing shapes and motions. This paper proposes a new algorithm to achieve automatic cell tracking in time-lapse microscopy macrophage data. First, we design a segmentation method employing space-time filtering, local Otsu's thresholding, and the SUBSURF (subjective surface segmentation) method. Next, the partial trajectories for cells overlapping in the temporal direction are extracted in the segmented images. Finally, the extracted trajectories are linked by considering their direction of movement. The segmented images and the obtained trajectories from the proposed method are compared with those of the semi-automatic segmentation and manual tracking. The proposed tracking achieved 97.4% of accuracy for macrophage data under challenging situations, feeble fluorescent intensity, irregular shapes, and motion of macrophages. We expect that the automatically extracted trajectories of macrophages can provide pieces of evidence of how macrophages migrate depending on their polarization modes in the situation, such as during wound healing.
translated by 谷歌翻译
According to the rapid development of drone technologies, drones are widely used in many applications including military domains. In this paper, a novel situation-aware DRL- based autonomous nonlinear drone mobility control algorithm in cyber-physical loitering munition applications. On the battlefield, the design of DRL-based autonomous control algorithm is not straightforward because real-world data gathering is generally not available. Therefore, the approach in this paper is that cyber-physical virtual environment is constructed with Unity environment. Based on the virtual cyber-physical battlefield scenarios, a DRL-based automated nonlinear drone mobility control algorithm can be designed, evaluated, and visualized. Moreover, many obstacles exist which is harmful for linear trajectory control in real-world battlefield scenarios. Thus, our proposed autonomous nonlinear drone mobility control algorithm utilizes situation-aware components those are implemented with a Raycast function in Unity virtual scenarios. Based on the gathered situation-aware information, the drone can autonomously and nonlinearly adjust its trajectory during flight. Therefore, this approach is obviously beneficial for avoiding obstacles in obstacle-deployed battlefields. Our visualization-based performance evaluation shows that the proposed algorithm is superior from the other linear mobility control algorithms.
translated by 谷歌翻译