智能论文笔记

Learnings from Technological Interventions in a Low Resource Language: Enhancing Information Access in Gondi

Devansh Mehta , Harshita Diddee , Ananya Saxena , Anurag Shukla , Sebastin Santy , Ramaravind Kommiya Mothilal , Brij Mohan Lal Srivastava , Alok Sharma , Vishnu Prasad , Venkanna U

分类：自然语言处理

2022-11-29

The primary obstacle to developing technologies for low-resource languages is the lack of representative, usable data. In this paper, we report the deployment of technology-driven data collection methods for creating a corpus of more than 60,000 translations from Hindi to Gondi, a low-resource vulnerable language spoken by around 2.3 million tribal people in south and central India. During this process, we help expand information access in Gondi across 2 different dimensions (a) The creation of linguistic resources that can be used by the community, such as a dictionary, children's stories, Gondi translations from multiple sources and an Interactive Voice Response (IVR) based mass awareness platform; (b) Enabling its use in the digital domain by developing a Hindi-Gondi machine translation model, which is compressed by nearly 4 times to enable it's edge deployment on low-resource edge devices and in areas of little to no internet connectivity. We also present preliminary evaluations of utilizing the developed machine translation model to provide assistance to volunteers who are involved in collecting more data for the target language. Through these interventions, we not only created a refined and evaluated corpus of 26,240 Hindi-Gondi translations that was used for building the translation model but also engaged nearly 850 community members who can help take Gondi onto the internet.

translated by 谷歌翻译

Taking a Language Detour: How International Migrants Speaking a Minority Language Seek COVID-Related Information in Their Host Countries

Ge Gao , Jian Zheng , Eun Kyoung Choe , Naomi Yamashita

分类：自然语言处理

2022-09-07

在公共危机时期，寻求信息对于人们的自我保健和福祉至关重要。广泛的研究调查了经验理解和技术解决方案，以促进受影响地区的家庭公民寻求信息。但是，建立有限的知识是为了支持需要在其东道国发生危机的国际移民。当前的论文对居住在日本和美国（n = 14）的两名中国移民（n = 14）进行了访谈研究。参与者反思了他们在共同大流行期间寻求经验的信息。反思补充了两周的自我追踪，参与者保持了相关信息寻求实践的记录。我们的数据表明，参与者经常绕开语言绕道，或访问普通话资源以获取有关其东道国疫情爆发的信息。他们还进行了战略性利用普通话信息，以进行选择性阅读，交叉检查以及对日语或英语的共同信息的上下文化解释。尽管这种做法增强了参与者对共同相关信息收集和感官的有效性，但他们有时会通过有时认识的方式使人们处于不利地位。此外，参与者缺乏对审查以移民为导向的信息的认识或偏爱，尽管该信息可用，这些信息是由东道国公共当局发布的。在这些发现的基础上，我们讨论了改善国际移民在非本地语言和文化环境中寻求共同相关信息的解决方案。我们主张包容性危机基础设施，这些基础设施将吸引以当地语言流利程度，信息素养和利用公共服务的经验的不同水平的人们。

translated by 谷歌翻译

Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks

Barack Wanjawa , Lilian Wanzare , Florence Indede , Owen McOnyango , Edward Ombui , Lawrence Muchemi

分类：自然语言处理

2022-08-25

土著非洲语言在人工智能中被归类为服务不足，并且数字包容性和信息获取差。挑战是如何在没有必要数据的情况下使用机器学习和深度学习模型。 Kencorpus是一种肯尼亚语言语料库，打算弥合有关如何收集和存储文本和语音数据的差距，足以启用数据驱动的解决方案，例如机器翻译，多语言社区中的问题回答和转录。 Kencorpus是一种主要在肯尼亚说的三种语言的语料库（文本和语音）：斯瓦希里语，Dholuo和Luhya（方言Lumarachi，Lulogooli和Lubukusu）。该语料库打算填补开发数据集的空白，该数据集可用于低资源语言的自然语言处理和机器学习任务。这些语言中的每一种都为语言语料库贡献了文本和语音数据。数据收集是由社区，学校和合作伙伴（媒体，出版商）的研究人员完成的。 Kencorpus有5,594个项目的集合，为4,442个文本（560万字）和1,152个语音文件（177小时）。基于这些数据，还开发了其他数据集，例如Dholuo和Luhya的POS标记集（分别为50,000和93,000个单词），来自Swahili文本（7,537 QA对）的问答对，以及将文本转换为Swahili（12,400句子）。数据集可用于机器学习任务，例如文本处理，注释和翻译。该项目还在QA任务的文本和机器学习语音和机器学习中为概念系统提供了证明，最初的结果证实了Kencorpus对机器学习社区的可用性。 Kencorpus是这些低资源语言的第一个此类语料库，并且是学习和共享类似作品的经验的基础。

translated by 谷歌翻译

Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi

Ritesh Kumar , Siddharth Singh , Shyam Ratan , Mohit Raj , Sonal Sinha , bornini lahiri , Vivek Seshadri , Kalika Bali , Atul Kr. Ojha

分类：自然语言处理

2022-06-26

在本文中，我们使用语言数据收集的现场方法讨论了四种低资源印度语语言的演讲语料库的过程中的工作 - Awadhi，Bhojpuri，Braj和Magahi。目前，语料库的总大小约为18小时（每种语言约4-5小时），并用语法信息进行转录和注释，例如词性标签，形态学特征和普遍的依赖关系。我们讨论了以这些语言收集数据的方法，其中大多数是在Covid-19大流行中心进行的，其中之一是为低收入群体带来一些额外的收入，说这些语言。在本文中，我们还讨论了这些语言中自动语音识别系统的基线实验的结果。

translated by 谷歌翻译

Resources for Turkish Natural Language Processing: A critical survey

Çağrı Çöltekin , A. Seza Doğruöz , Özlem Çetinoğlu

分类：自然语言处理

2022-04-11

本文介绍了对土耳其语可用于的语料库和词汇资源的全面调查。我们审查了广泛的资源，重点关注公开可用的资源。除了提供有关可用语言资源的信息外，我们还提供了一组建议，并确定可用于在土耳其语言学和自然语言处理中进行研究和建筑应用的数据中的差距。

translated by 谷歌翻译

Building Machine Translation Systems for the Next Thousand Languages

Ankur Bapna , Isaac Caswell , Julia Kreutzer , Orhan Firat , Daan van Esch , Aditya Siddhant , Mengmeng Niu , Pallavi Baljekar , Xavier Garcia , Wolfgang Macherey

分类：自然语言处理 | 人工智能 | 机器学习

2022-05-09

在本文中，我们分享了我们努力建立能够翻译一千多种语言的实用机器翻译（MT）系统的发现。我们在三个研究领域中描述了结果：（i）通过利用半监督预训练的语言识别和开发数据驱动的过滤技术来构建1500多种语言的清洁，网挖数据集；（ii）通过利用大规模的多语言模型来开发用于服务不足的语言的实用MT模型，该模型训练了有监督的并行数据，以使用100多种高资源语言和单语言数据集，以增加1000多种语言；（iii）研究这些语言的评估指标的局限性，并对我们MT模型的输出进行定性分析，突出显示了这些类型模型的几种频繁误差模式。我们希望我们的工作为旨在为当前研究的语言构建MT系统的从业者提供有用的见解，并突出显示可以补充Data-Sparse设置中大量多语言模型的弱点的研究方向。

translated by 谷歌翻译

Democratizing Machine Translation with OPUS-MT

Jörg Tiedemann , Mikko Aulamo , Daria Bakshandaeva , Michele Boggia , Stig-Arne Grönroos , Tommi Nieminen , Alessandro Raganato , Yves Scherrer , Raul Vazquez , Sami Virpioja

分类：自然语言处理

2022-12-04

This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows. We discuss our on-going mission of increasing language coverage and translation quality, and also describe on-going work on the development of modular translation models and speed-optimized compact solutions for real-time translation on regular desktops and small devices.

translated by 谷歌翻译

NLP for Language Varieties of Italy: Challenges and the Path Forward

Alan Ramponi

分类：自然语言处理

2022-09-20

意大利的特征是欧洲一种一种独一无二的语言多样性格局，该景观暗中编码了当地知识，文化传统，艺术表达及其演讲者的历史。但是，意大利的30多种语言品种有几代人内消失的风险。语言技术在保存濒危语言方面具有主要作用，但是目前，它在资源不足，主要缺乏标准拼写术的品种中挣扎，主要用于口语环境。在本文中，我们介绍了意大利的语言背景，并讨论了意大利语言品种开发NLP技术面临的挑战。我们提供潜在的方向，并倡导从以机器为中心转向以说话者为中心的NLP的范式转变。最后，我们建议建立一个当地社区，旨在为意大利语言和方言的言语和语言技术负责，参与式发展。

translated by 谷歌翻译

A Systematic Review and Thematic Analysis of Community-Collaborative Approaches to Computing Research

Ned Cooper , Tiffanie Horne , Gillian Hayes , Courtney Heldreth , Michal Lahav , Jess Scon Holbrook , Lauren Wilcox

分类：人工智能

2022-07-09

在进行研究，设计和系统开发时，HCI研究人员一直在将注意力从个人用户转移到社区。但是，我们的领域尚未建立对社区合并研究方法的挑战，利益和承诺的凝聚力，系统的理解。我们对47个计算研究论文进行了系统的综述和主题分析，讨论了与社区的参与性研究，以开发过去二十年来，以开发技术文物和系统。从这篇评论中，我们确定了与项目演变相关的七个主题：从建立社区伙伴关系到维持结果。我们的发现表明，这些项目的特征是几个紧张关系，其中许多与研究人员的力量和位置以及计算研究环境有关，相对于社区伙伴。我们讨论了我们的发现的含义，并提供方法论建议，以指导HCI，并更广泛地计算研究中心社区的实践。

translated by 谷歌翻译

Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources

Xinyan Velocity Yu , Akari Asai , Trina Chatterjee , Junjie Hu , Eunsol Choi

分类：自然语言处理 | 人工智能

2022-11-28

While the NLP community is generally aware of resource disparities among languages, we lack research that quantifies the extent and types of such disparity. Prior surveys estimating the availability of resources based on the number of datasets can be misleading as dataset quality varies: many datasets are automatically induced or translated from English data. To provide a more comprehensive picture of language resources, we examine the characteristics of 156 publicly available NLP datasets. We manually annotate how they are created, including input text and label sources and tools used to build them, and what they study, tasks they address and motivations for their creation. After quantifying the qualitative NLP resource gap across languages, we discuss how to improve data collection in low-resource languages. We survey language-proficient NLP researchers and crowd workers per language, finding that their estimated availability correlates with dataset availability. Through crowdsourcing experiments, we identify strategies for collecting high-quality multilingual data on the Mechanical Turk platform. We conclude by making macro and micro-level suggestions to the NLP community and individual researchers for future multilingual data development.

translated by 谷歌翻译

Rebuilding Trust: Queer in AI Approach to Artificial Intelligence Risk Management

Ashwin , William Agnew , Umut Pajaro , Hetvi Jethwani , Arjun Subramonian

分类：人工智能

2021-09-21

值得信赖的人工智能（AI）已成为一个重要的话题，因为在AI系统及其创造者中的信任已经丢失。研究人员，公司和政府具有远离技术开发，部署和监督的边缘化群体的长期和痛苦的历史。结果，这些技术对小群体的有用甚至有害。我们争辩说，渴望信任的任何AI开发，部署和监测框架必须纳入女权主义，非剥削参与性设计原则和强大，外部和持续监测和测试。我们还向考虑到透明度，公平性和问责制的可靠性方面的重要性，特别是考虑对任何值得信赖的AI系统的核心价值观的正义和转移权力。创建值得信赖的AI通过资金，支持和赋予Grassroots组织，如AI Queer等基层组织开始，因此AI领域具有多样性和纳入可信和有效地发展的可信赖AI。我们利用AI的专家知识Queer通过其多年的工作和宣传来讨论以及如何以及如何在数据集和AI系统中使用如何以及如何在数据集和AI系统中使用以及沿着这些线路的危害。基于此，我们分享了对AI的性别方法，进一步提出了Queer认识论并分析它可以带来AI的好处。我们还讨论了如何在愿景中讨论如何使用此Queer认识论，提出与AI和性别多样性和隐私和酷儿数据保护相关的框架。

translated by 谷歌翻译

Dataset for Identification of Homophobia and Transophobia in Multilingual YouTube Comments

Bharathi Raja Chakravarthi , Ruba Priyadharshini , Rahul Ponnusamy , Prasanna Kumar Kumaresan , Kayalvizhi Sampath , Durairaj Thenmozhi , Sathiyaraj Thangasamy , Rajendran Nallathambi , John Phillip McCrae

分类：自然语言处理

2021-09-01

社交媒体平台上的滥用内容的增长增加对在线用户的负面影响。对女同性恋，同性恋者，跨性别或双性恋者的恐惧，不喜欢，不适或不疑虑被定义为同性恋/转铁症。同性恋/翻译语音是一种令人反感的语言，可以总结为针对LGBT +人的仇恨语音，近年来越来越受到兴趣。在线同性恋恐惧症/ Transphobobia是一个严重的社会问题，可以使网上平台与LGBT +人有毒和不受欢迎，同时还试图消除平等，多样性和包容性。我们为在线同性恋和转鸟以及专家标记的数据集提供了新的分类分类，这将允许自动识别出具有同种异体/传递内容的数据集。我们受过教育的注释器并以综合的注释规则向他们提供，因为这是一个敏感的问题，我们以前发现未受训练的众包注释者因文化和其他偏见而诊断倡导性的群体。数据集包含15,141个注释的多语言评论。本文介绍了构建数据集，数据的定性分析和注册间协议的过程。此外，我们为数据集创建基线模型。据我们所知，我们的数据集是第一个已创建的数据集。警告：本文含有明确的同性恋，转基因症，刻板印象的明确陈述，这可能对某些读者令人痛苦。

translated by 谷歌翻译

Beyond "Fairness:" Structural (In)justice Lenses on AI for Education

Michael Madaio , Su Lin Blodgett , Elijah Mayfield , Ezekiel Dixon-Román

分类：人工智能

2021-05-18

教育技术，以及他们部署的学校教育系统，制定了特定的意识形态，了解有关知识的重要事项以及学习者应该如何学习。作为人工智能技术 - 在教育和超越 - 可能导致边缘社区的不公平结果，已经制定了各种方法来评估和减轻AI的有害影响。然而，我们争辩于本文认为，在AI模型中的性能差异的基础上评估公平的主导范式是面对教育AI系统（RE）生产的系统性不公平。我们在批判理论和黑色女权主义奖学金中汲取了结构性不公正的镜头，以批判性地审查了几个普遍研究的和广泛采用的教育AI类别，并探讨了他们如何融入和重现结构不公正和不公平的历史遗产和不公平的历史遗产。他们模型绩效的奇偶阶段。我们关闭了替代愿景，为教育ai提供更公平的未来。

translated by 谷歌翻译

KenSwQuAD -- A Question Answering Dataset for Swahili Low Resource Language

Barack W. Wanjawa , Lilian D. A. Wanzare , Florence Indede , Owen McOnyango , Lawrence Muchemi , Edward Ombui

分类：自然语言处理 | 机器学习

2022-05-04

The need for Question Answering datasets in low resource languages is the motivation of this research, leading to the development of Kencorpus Swahili Question Answering Dataset, KenSwQuAD. This dataset is annotated from raw story texts of Swahili low resource language, which is a predominantly spoken in Eastern African and in other parts of the world. Question Answering (QA) datasets are important for machine comprehension of natural language for tasks such as internet search and dialog systems. Machine learning systems need training data such as the gold standard Question Answering set developed in this research. The research engaged annotators to formulate QA pairs from Swahili texts collected by the Kencorpus project, a Kenyan languages corpus. The project annotated 1,445 texts from the total 2,585 texts with at least 5 QA pairs each, resulting into a final dataset of 7,526 QA pairs. A quality assurance set of 12.5% of the annotated texts confirmed that the QA pairs were all correctly annotated. A proof of concept on applying the set to the QA task confirmed that the dataset can be usable for such tasks. KenSwQuAD has also contributed to resourcing of the Swahili language.

translated by 谷歌翻译

When Creators Meet the Metaverse: A Survey on Computational Arts

Lik-Hang Lee , Zijun Lin , Rui Hu , Zhengya Gong , Abhishek Kumar , Tangyao Li , Sijia Li , Pan Hui

分类：人工智能 | 机器学习

2021-11-26

MetaVerse，巨大的虚拟物理网络空间，为艺术家带来了前所未有的机会，将我们的身体环境的每个角落与数字创造力混合。本文对计算艺术进行了全面的调查，其中七个关键主题与成权相关，描述了混合虚拟物理现实中的新颖艺术品。主题首先涵盖了MetaVerse的建筑元素，例如虚拟场景和字符，听觉，文本元素。接下来，已经反映了诸如沉浸式艺术，机器人艺术和其他用户以其他用户的方法提供了沉浸式艺术，机器人艺术和其他用户中心的若干非凡类型的新颖创作。最后，我们提出了几项研究议程：民主化的计算艺术，数字隐私和搬迁艺术家的安全性，为数字艺术品，技术挑战等等的所有权认可。该调查还担任艺术家和搬迁技术人员的介绍材料，以开始在超现实主义网络空间领域创造。

translated by 谷歌翻译

Too Brittle To Touch: Comparing the Stability of Quantization and Distillation Towards Developing Lightweight Low-Resource MT Models

Harshita Diddee , Sandipan Dandapat , Monojit Choudhury , Tanuja Ganu , Kalika Bali

分类：自然语言处理

2022-10-27

Leveraging shared learning through Massively Multilingual Models, state-of-the-art machine translation models are often able to adapt to the paucity of data for low-resource languages. However, this performance comes at the cost of significantly bloated models which are not practically deployable. Knowledge Distillation is one popular technique to develop competitive, lightweight models: In this work, we first evaluate its use to compress MT models focusing on languages with extremely limited training data. Through our analysis across 8 languages, we find that the variance in the performance of the distilled models due to their dependence on priors including the amount of synthetic data used for distillation, the student architecture, training hyperparameters and confidence of the teacher models, makes distillation a brittle compression mechanism. To mitigate this, we explore the use of post-training quantization for the compression of these models. Here, we find that while distillation provides gains across some low-resource languages, quantization provides more consistent performance trends for the entire range of languages, especially the lowest-resource languages in our target set.

translated by 谷歌翻译

Conversational AI Systems for Social Good: Opportunities and Challenges

Peng Qi , Jing Huang , Youzheng Wu , Xiaodong He , Bowen Zhou

分类：自然语言处理

2021-05-13

会话人工智能（Convai）系统最近吸引了许多学术和商业关注，在两端都取得了重大进展。但是，现有的工作讨论了如何在现实世界应用中开发和部署这些系统的社会益处，具有全面的案例研究和利弊分析。在本文中，我们简要介绍了社区对更好的康沃系统的进展，并反思现有技术如何帮助推进来自各种角度的社会良好举措，这些角度是社区中的共同知识。我们进一步讨论了Convai System以更好地帮助我们实现这些目标的挑战，并突出了其在现实世界中开发和部署所涉及的风险。

translated by 谷歌翻译

AI in HCI Design and User Experience

Wei Xu

分类：人工智能

2023-01-03

In this chapter, we review and discuss the transformation of AI technology in HCI/UX work and assess how AI technology will change how we do the work. We first discuss how AI can be used to enhance the result of user research and design evaluation. We then discuss how AI technology can be used to enhance HCI/UX design. Finally, we discuss how AI-enabled capabilities can improve UX when users interact with computing systems, applications, and services.

translated by 谷歌翻译

Robots as Mental Well-being Coaches: Design and Ethical Recommendations

Minja Axelsson , Micol Spitale , Hatice Gunes

分类：机器人

2022-08-31

最近十年表明，人们对机器人作为福祉教练的兴趣越来越大。但是，尚未提出针对机器人设计作为促进心理健康的教练的凝聚力和全面的准则。本文详细介绍了基于基于扎根理论方法的定性荟萃分析的设计和道德建议，该方法是通过三项以用户为中心的涉及机器人福祉教练的三个不同的以用户为中心进行的，即：（1）与参与性设计研究一起进行的。 11名参与者由两位潜在用户组成，他们与人类教练一起参加了简短的专注于解决方案的实践研究，以及不同学科的教练，（2）半结构化的个人访谈数据，这些数据来自20名参加积极心理学干预研究的参与者借助机器人福祉教练胡椒，（3）与3名积极心理学研究的参与者以及2名相关的福祉教练进行了一项参与式设计研究。在进行主题分析和定性荟萃分析之后，我们将收集到收敛性和不同主题的数据整理在一起，并从这些结果中提炼了一套设计准则和道德考虑。我们的发现可以在设计机器人心理福祉教练时考虑到关键方面的关键方面。

translated by 谷歌翻译

Thread With Caution: Proactively Helping Users Assess and Deescalate Tension in Their Online Discussions

Jonathan P. Chang , Charlotte Schluger , Cristian Danescu-Niculescu-Mizil

分类：人工智能 | 自然语言处理

2022-12-02

Incivility remains a major challenge for online discussion platforms, to such an extent that even conversations between well-intentioned users can often derail into uncivil behavior. Traditionally, platforms have relied on moderators to -- with or without algorithmic assistance -- take corrective actions such as removing comments or banning users. In this work we propose a complementary paradigm that directly empowers users by proactively enhancing their awareness about existing tension in the conversation they are engaging in and actively guides them as they are drafting their replies to avoid further escalation. As a proof of concept for this paradigm, we design an algorithmic tool that provides such proactive information directly to users, and conduct a user study in a popular discussion platform. Through a mixed methods approach combining surveys with a randomized controlled experiment, we uncover qualitative and quantitative insights regarding how the participants utilize and react to this information. Most participants report finding this proactive paradigm valuable, noting that it helps them to identify tension that they may have otherwise missed and prompts them to further reflect on their own replies and to revise them. These effects are corroborated by a comparison of how the participants draft their reply when our tool warns them that their conversation is at risk of derailing into uncivil behavior versus in a control condition where the tool is disabled. These preliminary findings highlight the potential of this user-centered paradigm and point to concrete directions for future implementations.

translated by 谷歌翻译