The ReprGesture entry to the GENEA Challenge 2022

Sicheng Yang yangsc21@mails.tsinghua.edu.cn Tsinghua UniversityShenzhenChina , Zhiyong Wu Tsinghua UniversityShenzhenChinaThe Chinese University of Hong KongHong Kong SARChina zywu@sz.tsinghua.edu.cn 0000-0001-8533-0524 , Minglei Li liminglei29@huawei.com Huawei Cloud Computing Technologies Co., LtdShenzhenChina , Mengchen Zhao zhaomengchen@huawei.com Huawei Noah’s Ark LabShenzhenChina , Jiuxin Lin linjx21@mails.tsinghua.edu.cn , Liyang Chen cly21@mails.tsinghua.edu.cn and Weihong Bao bwh21@mails.tsinghua.edu.cn Tsinghua UniversityShenzhenChina

2022

Abstract.

This paper describes the ReprGesture entry to the Generation and Evaluation of Non-verbal Behaviour for Embodied Agents (GENEA) challenge 2022. The GENEA challenge provides the processed datasets and performs crowdsourced evaluations to compare the performance of different gesture generation systems. In this paper, we explore an automatic gesture generation system based on multimodal representation learning. We use WavLM features for audio, FastText features for text and position and rotation matrix features for gesture. Each modality is projected to two distinct subspaces: modality-invariant and modality-specific. To learn inter-modality-invariant commonalities and capture the characters of modality-specific representations, gradient reversal layer based adversarial classifier and modality reconstruction decoders are used during training. The gesture decoder generates proper gestures using all representations and features related to the rhythm in the audio. Our code, pre-trained models and demo are available at https://github.com/YoungSeng/ReprGesture.

gesture generation, data-driven animation, modality-invaiant, modality-specific, representation learning, deep learning

^†^†journalyear: 2022^†^†copyright: rightsretained^†^†conference: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION; November 7–11, 2022; Bengaluru, India^†^†booktitle: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION (ICMI ’22), November 7–11, 2022, Bengaluru, India^†^†doi: 10.1145/3536221.3558066^†^†isbn: 978-1-4503-9390-4/22/11^†^†ccs: Computing methodologies Artificial intelligence^†^†ccs: Human-centered computing Human computer interaction (HCI)^†^†ccs: Computing methodologies Natural language processing

1. Introduction

Nonverbal behavior plays a key role in conveying messages in human communication (Kucherenko et al., 2021b), including facial expressions, hand gestures and body gestures. Co-speech gestures introduce better self-expression. In the virtual world, it helps to present a rather realistic digital avatar. Gesture generation studies how to generate human-like, natural, speech-oriented gestures. There are many different techniques for gesture generation. In this paper, we focus on the task of speech-driven gesture generation. Representative speech-driven gesture generation are either rule-based or data-driven (Yoon et al., 2020).

Many data-driven works for gesture generation are based on multimodal fusion and representation learning. Taras et al. map speech acoustic and semantic features into continuous 3D gestures (Kucherenko et al., 2020). Youngwoo et al. propose an end-to-end model to generate co-speech gestures using text, audio, and speaker identity (Yoon et al., 2020). Jing et al. sample gesture in a variational autoencoder (VAE) latent space and infer rhythmic motion from speech prosody to address the non-deterministic mapping from speech to gesture (Xu et al., 2022). Taras et al. propose a speech-driven gesture-production method based on representation learning (Kucherenko et al., 2021a). Xian et al. propose the hierarchical audio features extractor and pose inferrer to learn discriminative representations (Liu et al., 2022b). Jing et al. present a co-speech gesture generation model whose latent space is split into shared code and motion-specific code (Li et al., 2021).

However, gesture generation is a challenging task because of cross-modality learning issue and the weak correlation between speech and gestures. The inherent heterogeneity of the representations creates a gap among different modalities. It is necessary to address the weak correlation among different modalities and provide a holistic view of the multimodal data during gesture generation.

Inspired by (Yoon et al., 2020) and (Hazarika et al., 2020), we propose a gesture generation system based on multimodal representation learning. In particular, we first extract features of audio, text and gestures. Then, a system consisting of four components is proposed: (1) Each modality is projected to two distinct representations: modality-invariant and modality-specific. (2) A gradient reversal layer based adversarial classifier is used to reduce the discrepancy between the modality-invariant representations of each modality. (3) Modality decoders are used to reconstruct each modality, allowing modality-specific representations to capture the details of their respective modality. (4) The gesture decoder takes six modality representations (two per modality) and rhythm-related features in audio as its input and generates proper gestures.

The main contributions of our work are: (1) A multimodal representation learning approach is proposed for gesture generation, which ensures comprehensive decoupling of multimodal data. (2) To solve the problem of heterogeneity of different modalities in feature fusion, each modality is projected to two subspaces (modality-invariant and modality-specific) to get multimodal representations using domain learning and modality reconstruction. (3) Ablation studies demonstrate the role of different components in the system.

The task of the GENEA 2022 challenge is to generate corresponding gestures from the given audio and text. A complete task description can be accessed in (Yoon et al., 2022). We submitted our system to the GENEA 2022 challenge to be evaluated with other gesture generation systems in a large user study.

2. Method

Figure 1. Gesture generation through modality -invariant and -specific subspaces.

2.1. The architecture of the proposed system

As shown in Figure 1, the system generates a sequence of human gestures from a sequence of $u_{m} (m \in {t, a, g})$ that contain the features of text, audio and seed gestures. The architecture of the proposed model consists of five modules: feature extraction, modality representation, modality reconstruction, domain learning and gesture generation. The following describes each of these modules in detail.

2.1.1. Feature extraction

For each of the modality, the pipeline of extracting features is as follows:

Text: We first use FastText (Bojanowski et al., 2017) to get the word embeddings. Padding tokens are inserted to make the words temporally match the gestures by following (Yoon et al., 2020). One-dimensional (1D) convolutional layers are then adopted to generate 32-D text feature sequence $U_{t}$ (‘ $t$ ’ for ‘text’) from the 300-D word embeddings.
Audio: All audio recordings are downsampled to 16kHz, and features are generated from the pre-trained models of WavLM Large (Chen et al., 2021). We further adjust sizes, strides and padding in the 1D convolutional layers to reduce the dimension of features from 1024 to 128 forming the final audio feature sequence $U_{a}$ (‘ $a$ ’ for ‘audio’).
Gesture: Due to the poor quality of hand motion-capture, we only use 18 joints corresponding to the upper body without hands or fingers. Root normalization is used to make objects face the same direction. We apply standard normalization (zero mean and unit variant) to all joints. Seed gestures for the first few frames are utilized for better continuity between consecutive syntheses, as in (Yoon et al., 2020). On top of these, position and 3 × 3 rotation matrix features are computed, and the size of final gesture sequence $U_{g}$ (‘ $g$ ’ for ‘gesture’) feature is 216.

2.1.2. Modality representation

First, for each modality $m \in {t, a, g}$ , we use a linear layer with leaky ReLU activation and layer normalization to map its feature sequence $U_{m}$ into a new feature sequence $u_{m} \in R^{T \times d_{h}}$ with the same feature dimension $d_{h}$ . Then, we project each sequence $u_{m}$ to two distinct representations: modality-invariant $h_{m}^{c}$ and modality-specific $h_{m}^{p}$ . Afterwards, $h_{m}^{c}$ learns a shared representation in a common subspace with distributional similarity constraints (Guo et al., 2019). $h_{m}^{p}$ captures the unique characteristics of that modality. We derive the representations using the simple feed-forward neural encoding functions:

(1)

h_{m}^{c} = E_{c} (u_{m}; θ^{c}), h_{m}^{p} = E_{p} (u_{m}; θ_{m}^{p})

Encoder $E_{c}$ shares the parameters $θ^{c}$ across all three modalities, whereas $E_{p}$ assigns separate parameters $θ_{m}^{p}$ for each modality.

2.1.3. Representation learning

Domain learning can improve a model’s ability to extract domain-invariant features (Bousmalis et al., 2016). We use an adversarial classifier to minimize domain loss that reduces the discrepancy among shared representations of each modality.

The domain loss can be formulated as:

(2)

L_{d o m a i n} = - 3 \sum i = 1 E [log (D_{r e p r} (d_{m}))]

where $D_{r e p r}$ represents feed-forward neural discriminator, $d_{m}$ represents the result after gradient reversal of $h_{m}^{p}$ .

The modality reconstruction loss $L_{recon}$ is computed on the reconstructed modality and the original input $u_{m}$ . The $L_{recon}$ is used to ensure the hidden representations to capture the details of their respective modality.

Specifically, a modality decoder $D$ is proposed to reconstruct $u_{m}$ :

(3)

{^u}_{m} = D (h_{m}^{c} + h_{m}^{p}; θ^{d})

where $θ^{d}$ are the parameters of the modality decoder. The modality reconstruction loss can then be computed as:

(4)

Lrecon=13⎛⎝∑m∈{t,a,g}∥um−^um∥22dh⎞⎠

where $∥ \cdot ∥_{2}^{2}$ is the squared $L_{2}$ -norm.

2.1.4. Gesture generation

Figure 2. Architecture of the gesture generation module.

We use generative adversarial network (GAN) based gesture decoder for generating gestures. Gestures are directly related to rhythm and beat, thus we concatenate audio rhythm related features (pitch, energy and volume) and the output of six stacked modality representations together and send them to Transformer encoders with multi-head self-attention as the generator, as shown in Figure 2. The generator part is trained using $L_{g e s t u r e}$ consisting of the Huber loss and the MSE loss, and the discriminator part is trained with $L_{G A N}$ .

(5)

L_{g e s t u r e} = α \cdot E [\frac{1}{t} t \sum i = 1 HuberLoss (g_{i}, {^g}_{i})] + β \cdot E [\frac{1}{t} t \sum i = 1 ∥ (g_{i}, {^g}_{i}) ∥_{2}^{2}]

(6)

L_{G A N} = - E [log (D_{g e s t u r e} (g))] - E [log (1 - D_{g e s t u r e} (^g))]

where $D_{g e s t u r e}$ represents gesture discriminator using multilayered bidirectional gated recurrent unit (GRU) (Cho et al., 2014) that outputs binary output for each time step, $t$ is the length of the gesture sequence, $g_{i}$ represents the $i$ th human gesture, $_{i}$ represents the $i$ th generated gesture.

The loss of the proposed system can be computed as:

(7)

L_{t o t a l} = L_{g e s t u r e} + γ \cdot L_{G A N} + δ \cdot L_{d o m a i n} + ϵ \cdot L_{r e c o n}

2.2. Data processing and experiment setup

2.2.1. Data and data processing

In the challenge, the Talking With Hands 16.2M (Lee et al., 2019) is used as the standard dataset. Each video is separated into two independent sides with one speaker each. The audio and text in the dataset have been aligned. For more details please refer to the challenge paper (Yoon et al., 2022). We note that the data in the training, validation and test sets are extremely unbalanced, so we only use the data from the speaker with identity ”1” for training. And we believe that if speech and gesture data are trained on the same person, the gesture behavior would match the speech.

2.2.2. Experiment setup

The proposed system is trained on training data only, using the ADAM (Kingma and Ba, 2014) optimizer (learning rate is e-4, $β_{1}$ = 0.5, $β_{2}$ = 0.98) with a batch size of 128 for 100 steps. We set $α = 300$ , $β = 50$ for Equation (5) and $γ = 5, δ = 0.1, ϵ = 0.1$ (we noticed in our experiments that too large $δ$ and $ϵ$ will lead to non-convergence) for Equation (7). There is a warm-up period of 10 epochs in which the $L_{G A N}$ is not used ( $γ$ = 0). The feature dimension $d_{h}$ of sequence $u_{m}$ is 48. During training, each training sample having 100 frames is sampled with a stride of 10 from the valid motion sections; the initial 10 frames are used as seed gesture poses and the model is trained to generate the remaining 90 poses (3 seconds).

3. Evaluation

3.1. Evaluation setup

The GENEA Challenge 2022 evaluation is divided into two tiers, and we participated in the upper-body motion tier. The challenge organizers conducted a detailed evaluation comparing all submitted systems(Yoon et al., 2022). The challenge evaluates human-likeness to assess motion quality, and appropriateness to assess how well the gestures match the speech. The evaluation is based on the HEMVIP methodology (Jonell et al., 2021) and Mean Opinion Score (MOS) (Itu-T, 1996). There are in total 11 systems participated in the upper-body tier. The following abbreviations are used to represent each model in the evaluation:

UNA: Ground truth (‘U’ for the upper-body tier, ‘NA’ for ‘natural’).
UBT: The official text-based baseline (Yoon et al., 2019), which takes transcribed speech text with word-level timing information as the input modality (‘B’ for ‘baseline’, ‘T’ for ‘text’).
UBA: The official audio-based baseline (Kucherenko et al., 2019), which takes speech audio into account when generating output (‘A’ for ‘audio’).
USJ–USQ: 8 participants’ submissions to the upper-body tier (ours is USN).

For more details about the evaluation studies, please refer to the challenge paper (Yoon et al., 2022).

3.2. Subjective evaluation results and discussion

3.2.1. Human-likeness Evaluation

(a) Red bars are the median ratings (each with a 0.05 confidence interval); yellow diamonds are mean ratings (also with a 0.05 confidence interval). Box edges are at 25 and 75 percentiles, while whiskers cover 95% of all ratings for each condition.
(b) White means that the condition listed on the — (a) Box visualizing the ratings distribution in Upper-body study.

In this evaluations, study participants are asked to rate “How human-like does the gesture motion appear?” on a scale from 0 (worst) to 100 (best). Bar plots and significance comparisons are shown in Figure 3(b). Our system (USN) receives a median score of 44 and a mean score of 44.2, and is ranked fourth among the participating systems.

3.2.2. Appropriateness evaluation

Figure 4. Bar plots visualizing the response distribution in the appropriateness studies. The blue bar (bottom) represents responses where subjects preferred the matched motion, the light grey bar (middle) represents tied (“They are equal”) responses, and the red bar (top) represents responses preferring mismatched motion, with the height of each bar being proportional to the fraction of responses in each category. The black horizontal line bisecting the light grey bar shows the proportion of matched responses after splitting ties, each with a 0.05 confidence interval. The dashed black line indicates chance-level performance.

In this evaluation, participants are asked to choose the character on the left, on the right, or indicate that the two are equally well matched to response “Please indicate which character’s motion best matches the speech, both in terms of rhythm and intonation and in terms of meaning.” Bar plots are shown in Figure 4. Our system (USN) receives a “Percent matched” 54.6, which identifies how often participants preferred matched over mismatched motion in terms of appropriateness. Our system is rated seventh in appropriateness among the participants’ submissions. It should be noted that the difference of our system to the five higher-ranked systems (USL, UBA, USO, USK and USJ) is not significant. Furthermore, if we only consider the ratio of matched motion, i.e., the blue bar in Figure 4, our system is ranked fifth among the participating systems.

3.3. Ablation studies

\topruleName

Average jerk

Average

acceleration

Global

CCA

CCA for

each sequence

Hellinger

distance average

↓

FGD on

feature space

↓

FGD on raw

data space

↓

\midruleGround Truth (GT)

18149.74

\pm

2252.61

401.24

\pm

67.57

1.000

1.00

\pm

0.00

0.0

ReprGesture

2647.59

\pm

1200.05

146.90

\pm

46.09

0.726

0.95

\pm

0.02

0.155

0.86

184.753

w/o WavLM

1775.09

\pm

512.08

77.53

\pm

21.92

0.761

0.94

\pm

0.03

0.353

3.054

321.383

w/o

L_{G A N}

9731.54

\pm

3636.06

242.15

\pm

81.81

0.664

0.93

\pm

0.03

0.342

2.053

277.539

w/o

L_{r e c o n}

533.95

\pm

193.18

39.49

\pm

12.23

0.710

0.93

\pm

0.03

0.283

0.731

659.150

w/o

L_{d o m a i n}

2794.79

\pm

1153.75

135.62

\pm

25.13

0.707

0.94

\pm

0.03

0.267

0.653

874.209

w/o Repr

2534.34

\pm

1151.38

123.02

\pm

40.90

0.723

0.94

\pm

0.04

0.298

0.829

514.706

\bottomrule

Table 1. Ablation studies results. ‘w/o’ is short for ‘without’. Bold indicates the best metric, i.e. the one closest to the ground truth.

Moreover, we conduct ablation studies to address the performance effects from different components in the system. The GENEA challenge computes some objective metrics of motion quality by GENEA numerical evaluations¹¹1https://github.com/genea-workshop/genea_numerical_evaluations. For calculation and meaning of these objective evaluation metrics, please refer to the challenge paper (Yoon et al., 2022). A perfect natural system should have average jerk and acceleration very similar to natural motion. The closer the Canonical correlation analysis (CCA) to 1, the better. Lower Hellinger distance and Fréchet gesture distance (FGD) are better. To compute the FGD, we train an autoencoder using the training set of the challenge.

The results of our ablations studes are summarized in Table 1. Supported by the results, when we do not use WavLM to extract audio features, but use 1D convolution instead, the Hellinger distance average and FGD on feature space present the worst performance. When the model is trained without the GAN loss, the average jerk and average acceleration are better, but the global CCA and CCA for each sequence are decreased. When the reconstruction loss is removed, the average jerk and average acceleration are worst. The generated gesture movements are few and of small range. When the model is trained using Central Moment Discrepancy (CMD) loss (Hazarika et al., 2020) instead of domain loss, the best FGD on feature space and the worst FGD on raw data space are obtained. When the modality representations are removed (w/o Repr), we feed the modality sequence $u_{t}, u_{a}$ and $u_{g}$ directly to the gesture decoder and only use the $L_{t a s k}$ loss, the performances of all metrics have deteriorated except for FGD on feature space.

4. Conclusions and discussion

In this paper, we propose a gesture generation system based on multimodal representation learning, where the considered modalities include text, audio and gesture. Each modality is projected into two different subspaces: modality-invariant and modality-specific. To learn the commonality among different modalities, an adversarial classifier based on gradient reversal layer is used. To capture the features of modality-specific representations, we adopt a modality reconstruction decoder. The gesture decoder utilizes all representations and audio rhythmic features to generate appropriate gestures. In subjective evaluation, our system is ranked fourth among the participating systems in human-likeness evaluation, and ranked seventh in appropriateness evaluation. Whereas, for appropriateness, the differences between our system and the five higher-ranked systems are not significant.

For appropriateness evaluation, whether there is a relationship between subjective evaluation and segmentation duration deserves to be investigated. The segments are around 8 to 10 seconds during evaluation(Yoon et al., 2022). We believe that a longer period of time (e.g. 20-30 seconds) might produce more pronounced and convincing appropriateness results.

There is room for improvement in this research. First, we only use data from one person to learn gesture due to unbalanced dataset issue. Such one-to-one mapping could produce boring and homogeneous gestures during inference. Second, the finger motions are not considered because of the low motion-capture quality. Such finger motions could be involved in the future if some data cleanup procedures could be conducted. Third, besides text and audio, more modalities (e.g. emotions, facial expressions and semantic meaning of gestures (Liu et al., 2022a)) could be taken into consideration to generate more appropriate gestures.

Acknowledgements.

This work is supported by Shenzhen Science and Technology Innovation Committee (WDZC20200818121348001), National Natural Science Foundation of China (62076144) and Shenzhen Key Laboratory of next generation interactive media innovative technology (ZDSYS20210623092001004).

References

P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017) Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. External Links: ISSN 2307-387X, Document, Link, https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00051/1567442/tacl_a_00051.pdf Cited by: 1st item.
K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan (2016) Domain separation networks. In Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29, pp. . External Links: Link Cited by: §2.1.3.
S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, and F. Wei (2021) WavLM: large-scale self-supervised pre-training for full stack speech processing. CoRR abs/2110.13900. External Links: Link, 2110.13900 Cited by: 2nd item.
K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. empirical methods in natural language processing. Cited by: §2.1.4.
W. Guo, J. Wang, and S. Wang (2019) Deep multimodal representation learning: a survey. IEEE Access 7 (), pp. 63373–63394. External Links: Document Cited by: §2.1.2.
D. Hazarika, R. Zimmermann, and S. Poria (2020) MISA: modality-invariant and -specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 1122–1131. External Links: ISBN 9781450379885, Link Cited by: §1, §3.3.
P. Itu-T (1996) Methods for subjective determination of transmission quality. ITU-T Recommendation P.800. External Links: Link Cited by: §3.1.
P. Jonell, Y. Yoon, P. Wolfert, T. Kucherenko, and G. E. Henter (2021) HEMVIP: human evaluation of multiple videos in parallel. In Proceedings of the 2021 International Conference on Multimodal Interaction, ICMI ’21, New York, NY, USA, pp. 707–711. External Links: ISBN 9781450384810, Link, Document Cited by: §3.1.
D. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. Computer Science. Cited by: §2.2.2.
T. Kucherenko, D. Hasegawa, G. E. Henter, N. Kaneko, and H. Kjellström (2019) Analyzing input and output representations for speech-driven gesture generation. In Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents, IVA ’19, New York, NY, USA, pp. 97–104. External Links: ISBN 9781450366724, Link, Document Cited by: 3rd item.
T. Kucherenko, D. Hasegawa, N. Kaneko, G. E. Henter, and H. Kjellström (2021a) Moving fast and slow: analysis of representations and post-processing in speech-driven automatic gesture generation. International Journal of Human–Computer Interaction 37 (14), pp. 1300–1316. External Links: Document, Link, https://doi.org/10.1080/10447318.2021.1883883 Cited by: §1.
T. Kucherenko, P. Jonell, S. van Waveren, G. E. Henter, S. Alexandersson, I. Leite, and H. Kjellström (2020) Gesticulator: a framework for semantically-aware speech-driven gesture generation. In Proceedings of the 2020 International Conference on Multimodal Interaction, pp. 242–250. External Links: ISBN 9781450375818, Link Cited by: §1.
T. Kucherenko, P. Jonell, Y. Yoon, P. Wolfert, and G. E. Henter (2021b) A large, crowdsourced evaluation of gesture generation systems on common data: the genea challenge 2020. In 26th International Conference on Intelligent User Interfaces, IUI ’21, New York, NY, USA, pp. 11–21. External Links: ISBN 9781450380171, Link, Document Cited by: §1.
G. Lee, Z. Deng, S. Ma, T. Shiratori, S. Srinivasa, and Y. Sheikh (2019) Talking with hands 16.2m: a large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Vol. , pp. 763–772. External Links: Document Cited by: §2.2.1.
J. Li, D. Kang, W. Pei, X. Zhe, Y. Zhang, Z. He, and L. Bao (2021) Audio2Gestures: generating diverse gestures from speech audio with conditional variational autoencoders. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Vol. , pp. 11273–11282. External Links: Document Cited by: §1.
H. Liu, Z. Zhu, N. Iwamoto, Y. Peng, Z. Li, Y. Zhou, E. Bozkurt, and B. Zheng (2022a) BEAT: a large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. ArXiv abs/2203.05297. Cited by: §4.
X. Liu, Q. Wu, H. Zhou, Y. Xu, R. Qian, X. Lin, X. Zhou, W. Wu, B. Dai, and B. Zhou (2022b) Learning hierarchical cross-modal association for co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10462–10472. Cited by: §1.
J. Xu, W. Zhang, Y. Bai, Q. Sun, and T. Mei (2022) Freeform body motion generation from speech. ArXiv abs/2203.02291. Cited by: §1.
Y. Yoon, B. Cha, J. Lee, M. Jang, J. Lee, J. Kim, and G. Lee (2020) Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Trans. Graph. 39 (6). External Links: ISSN 0730-0301, Link, Document Cited by: §1, §1, §1, 1st item, 3rd item.
Y. Yoon, W. Ko, M. Jang, J. Lee, J. Kim, and G. Lee (2019) Robots learn social skills: end-to-end learning of co-speech gesture generation for humanoid robots. In 2019 International Conference on Robotics and Automation (ICRA), Vol. , pp. 4303–4309. External Links: Document Cited by: 2nd item.
Y. Yoon, P. Wolfert, T. Kucherenko, C. Viegas, T. Nikolov, M. Tsakov, and G. E. Henter (2022) The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation. In Proceedings of the ACM International Conference on Multimodal Interaction, ICMI ’22. Cited by: §1, §2.2.1, §3.1, §3.1, §3.3, §4.