The Glass Ceiling of Automatic Evaluation
in Natural Language Generation

Pierre Colombo

^{1}

, Maxime Peyrard

^{2}

, Nathan Noiry

^{3}

Robert West

^{2}

Pablo Piantanida

^{4}

^{1}

MICS, CentraleSupélec, Université Paris-Saclay,

^{2}

EPFL,

^{3}

althiqa.io

^{4}

ILLS, McGill - ETS - MILA - CNRS - Université Paris-Saclay - CentraleSupélec
colombo.pierre@centralesupelec.fr

Abstract

Automatic evaluation metrics capable of replacing human judgments are critical to allowing fast development of new methods. Thus, numerous research efforts have focused on crafting such metrics. In this work, we take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics altogether. As metrics are used based on how they rank systems, we compare metrics in the space of system rankings. Our extensive statistical analysis reveals surprising findings: automatic metrics – old and new – are much more similar to each other than to humans. Automatic metrics are not complementary and rank systems similarly. Strikingly, human metrics predict each other much better than the combination of all automatic metrics used to predict a human metric. It is surprising because human metrics are often designed to be independent, to capture different aspects of quality, e.g. content fidelity or readability. We provide a discussion of these findings and recommendations for future work in the field of evaluation.

1 Introduction

Crafting automatic evaluation metrics (AEM) able to replace human judgments is critical to guide progress in natural language generation (NLG), as such automatic metrics allow for cheap, fast, and large-scale development of new ideas. The NLG fields are then heavily influenced by the set of AEM used to decide which systems are valuable. Therefore, a large body of work has focused on improving the ability of AEM to predict human judgments.

Figure 1: Correlation with humans over time considering all existing metrics combined. On the x-axis: evaluation metrics ordered by their release time; y-axis: utterance-level Kendall’s $τ$ with human when training a model to fit human judgments with all metrics available at the time (5-Fold cross-validation with XGBoost regressor). The dotted lines represent different human annotations and datasets. Different variants of the same metrics (like ROUGE-1 and ROUGE-2) are averaged. The datasets and metrics are described in Sec. 2.

Human judgment data is typically employed to decide which metric to select based on correlation analysis with human annotations rankel-etal-2011-ranking; Owczarzak et al. (2012); Graham (2015). In this work, we take a step back and investigate the relationship between existing AEM and human judgments globally. We do not make metric recommendation but reflect upon the global progress in the field of automatic evaluation. Our work is motivated by the findings of Fig. 1. It depicts the improvement over time, when new metrics were introduced, in the ability to fit human judgments when using all existing metrics as features. The fit is measured by the correlation with humans of a trained classifier in a 5-fold cross-validation setup. Surprisingly, we observe small marginal improvement and little progress over the years.

Recent works emphasized the importance of viewing metrics in terms of how they rank systems instead of just comparing score values Novikova et al. (2018); Peyrard et al. (2021); Colombo et al. (2022). Indeed, not only ranking is a more robust framework of comparison, it is also more aligned with the way metrics are used: identifying and extracting the "best system". Thus, we perform our analysis in the space of rankings. i.e., how do metrics rank systems? By analyzing 9 datasets covering 4 tasks and 270k scores, we made the following observations:

Findings. (i) Automatic metrics are much more similar to each other, in terms of how they rank systems, than they are to human metrics. It means that AEM, even the more recent transformer-based ones are similar to the older ones when used in practice (ROUGE and BLEU). (ii) This lack of complementarity results in the inability to fit human judgments even when all these metrics are taken together as features for a classifier predicting humans. (iii) Quite surprisingly, different human dimensions – different annotations guidelines such as readability, or content fidelity – are very predictive of each other, whereas AEM are much less predictive of humans. This finding is striking because human metrics are designed to capture different and independent aspects of quality whereas AEM have been selected precisely for their ability to match humans. We would expect human metrics to be uncorrelated and automatic metric to be highly correlated with humans but we observe the opposite. First, it casts serious doubt about the ability of AEM to replace human judgments. Then, the correlation between independent human annotations of quality hints at some latent inherent goodness of systems: good systems are good in different aspect whereas bad systems are bad across all aspects.

Our findings have several consequences that can inform future research. Newly introduced metrics are not complementary to previous ones, resulting in small global improvements. As a way forward, we propose that research, instead of crafting metrics that maximize correlation with humans, focus on making metrics that also aim to be explicitly complementary to the set of existing metrics. This would enforce maximal marginal gain and ensure that the field, as a whole, makes progress towards capturing the complexity of human annotations.

For practitioners, it is common practice to report several AEM in the hope to get a better view of system performances. However, reporting several metrics that all produce similar rankings does not bring useful additional information. With our proposal, reporting a set of complementary metrics would better serve the intended purpose.

To help research build upon our work and use our measure of complementarity, we make our code available at github.

2 Methodology

Terminology. Let $X$ be the space of possible outputs for an NLG task. An NLG metric is a function $m : X \times X \to R_{+}$ which, from a given textual candidate $C \in X$ and corresponding reference $R \in X$ , computes a score $m (C, R)$ reflecting the properties that $C$ should satisfy (e.g. fluency, fidelity…). Of course, it is illusory to summarize subtle semantic properties by a single scalar and one is rather seeking for metrics that are able to discriminate between different systems. In fact, crafted AEM are evaluated by comparison to human judgments: one usually computes ranking correlations such as the Kendall’s $τ$ . Higher correlations indicating that the AEM is a better replacement for the human metrics.

Encoding metrics with rankings. Since the usage of NLG metrics is to rank systems, we choose to represent an NLG metric, automatic or human, by the ranking it induces on a set of systems or of utterances. More formally, for $N \geq 1$ NLG systems evaluated on a dataset made of $K \geq 1$ utterances, there exists a natural ranking representations of $m$ :

Each utterance $k \in {1, \dots, K}$ induces a ranking $σ_{k}^{m} \in R^{N}$ of the $N$ systems seen as a vector $σ_{k}^{m}$ , where $σ_{k}^{m} (S)$ is the rank of system $S \in {1, \dots, N}$ . For a system $S$ , the representation of a metric $m$ , noted $σ^{m, S}$ , is sum of rankings over the utterances:

σ^{m, S} := K \sum k = 1 σ_{k}^{m} (S) \in R^{N} .

(1)

We call this System level representation.

Symmetrically, each system $k \in {1, \dots, N}$ induces a ranking $σ_{n}^{m} \in R^{K}$ of the $K$ utterances, where $σ_{n}^{m} (k)$ is the rank of utterance $k$ . The Utterance level representation of $m$ is sum of rankings over the systems:

σ_{u t t}^{m} := N \sum n = 1 σ_{k}^{n} \in R^{K} .

(2)

Using the space of rankings has been shown to be more robust than the raw scores as it is less sensitive to outliers and statistical variations Novikova et al. (2017); Peyrard et al. (2021); Colombo et al. (2022). Furthermore, this representation is closely tied to Borda counts, which enjoys theoretical properties: the ranking induced by $σ^{m, S}$ is a $5$ -approximation of the Kemeny-consensus which is a good notion of average in the symmetric group Kemeny (1959); Young and Levenglick (1978); Coppersmith et al. (2006). It is moreover the fastest approximation of the Kemeny-consensus whose computation is NP-hard Ali and Meilă (2012).

Complementarity. We measure the complementarity between two metrics – humans or automatic – by the average over utterances of the distance between their rankings of systems. Formally, for two metrics $m_{0}$ and $m_{1}$ , complementarity is given by:

C (m_{0}, m_{1}) := \frac{1}{K} K \sum k = 1 d_{τ} (σ_{k}^{m_{0}}, σ_{k}^{m_{1}}),

(3)

where $d_{τ}$ is the normalized Kendall’s distance between the vectors of rank. It is related to the Kendall’s rank correlation $τ$ by: $τ = 1 - 2 d_{τ}$ .

Similarly, we define the complementarity between a metric $m_{0}$ and a set of other metrics $m := {m_{i}}_{i = 1, \dots, l}$ , as the average pairwise complementarity:

C (m_{0}, m) = \frac{1}{l} \sum i = 1, \dots, l C (m_{0}, m_{i}) .

(4)

Complementarity measures the extent to which a metric ranks systems differently than another metrics or a set of other metrics. Whether comparing two metrics or a metric with set, it is a number between 0 and 1 where 0 indicates that the metrics rank systems in the exact same order and 1 indicates the exact opposite order. In between, it counts the number of inversions between the two rank lists normalized by the number of possible pairs of systems.

2.1 Dataset description

To ensure a wide coverage of NLG we focus on four different problems i.e., dialogue generation (using PersonaChat (PC) and TopicalChat (TC) Mehri and Eskenazi (2020)), image description (relying on FLICKR Young et al. (2014)), summary evaluation (via TAC08 Dang and Owczarzak (2008), TAC10, TAC11 Owczarzak and Dang (2011), RSUM Bhandari et al. (2020) and SEVAL Fabbri et al. (2021)), and translation (focusing on multilingual quality estimation (MLQE) Ranasinghe et al. (2021)).

For each task, we gather datasets and rely on AEM such as JS [1-2] Lin et al. (2006), BLEU Papineni et al. (2002); Post (2018), Chrfpp Popović (2017), S3 (both variant pyr/resp) Peyrard et al. (2017), ROUGE Lin (2004) (including 5 of its variants Ng and Abrecht (2015)), BERTScore Zhang et al. (2019), MoverScore Zhao et al. (2019). For MLQE we solely consider several version of BERTScore, MoverScore and ContrastScore. The human evalutions criterion are specific to each dataset and will be identified by starting with an H:. Overall, our final datasets gather over 270k scores.

3 Experiments

Figure 2: Complementarity: For each dataset, the pairwise complementarity between each pair of metrics as computed by Eq. 3 both human and automatic. In these matrix plot, symmetric by design, we ordered metrics to have the human one first and the automatic ones after, the red lines trace the limit between humans and AEM.

Finding 1: Automatic metrics are similar to each other much more than they are to human metric. In Fig. 2, we report the pairwise complementarity between each pair of metrics as computed by Eq. 3 for both human and AEM. When aggregated over pairs and over datasets, we obtain an average complementarity between: (i) two human metrics of $.16 \pm .01$ , (ii) two AEM of $.20 \pm .01$ and (iii) a human and an automatic metric of $.35 \pm .02$ .

Importantly, we observe across datasets low complementarity, i.e., strong similarity, between AEM, low complementarity between human metrics but high complementarity, i.e., low similarity, between automatic and human metrics.

We draw two conclusions from this analysis: (i) AEM rank systems similarly but (ii) differently than humans. There is some nuances across datasets. The effect described above is particularly strong in the Dialog, MLQE and SUM-Eval datasets. In particular, we notice that TAC datasets, from the summarization task, have lower complementarity in general, meaning that all metrics, human and automatic, are more similar. Indeed, a lot of works have relied on these datasets to develop new metrics. Interestingly, the more recent REAL-SUM and SUM-Eval reveal much lower metric similarity.

Figure 3: Human metrics are significantly more predictive of each other than AEM. On this plot, we report the 5-fold cross-validated result of fitting an XGBoost regressor on various feature sets: (i) all available AEM, (ii) other human metrics when available, and (iii) both automatic and human metrics. The fit is measured as the average instance-level correlation in the test set.

Finding 2: Automatic metrics even all combined do not explain human metrics. If AEM are rather different than human metrics, we might wonder whether it is possible to get a good approximation of human judgments by combining existing AEM together. To account for possible correlations, we rely on XGBoost regressors with 5-fold cross-validation to predict human judgments. The training is performed on three different features space: (i) AEM only, (ii) other human metrics only and (iii) both sets of metrics combined. We compute the Kendall’s $τ$ between predictions and ground truths and report the results in Fig. 3.
The plot confirms that AEM struggle to capture human judgment subtlety: correlation rarely exceeds .4 on held-out data. In contrast, human metrics are much more predictive of each others, even if they are often supposed to capture different concepts. Finally, it is worth noting that adding AEM to human ones do not marginally improve the prediction power.

These findings cast shadows over recent progress in the field. In next section, we discuss the implications and make a proposition for future work.

4 Discussion

Our analysis reveals that automatic metrics are not complementary, and recent automatic metrics actually capture the same properties of human judgments as older ones. Furthermore, the existing metrics are not strong predictors of human judgments. Quite surprisingly, other human metrics which are often designed to be independent of each other end-up being more predictive of each other than automatic metrics. This predictability of human metrics from one another can be explained due to the available datasets: when a system is good at extracting content, it is also often good at making the content readable, when a system is bad it is often bad across the board in all human metrics. However, the fact that automatic metrics are less predictive than other human dimensions casts some shadow over recent progress in the field. It shows that the current strategy of crafting metrics with slightly better correlation than baselines with one of the human metrics has reached its limit and some qualitative change would be needed.

A promising strategy to address the limitations of automatic metrics is to report several of them, hoping that they will together give a more robust overview of system performance. However, this makes sense only if automatic metrics measure different aspects of human judgments, i.e., if they are complementary. In this work, we have seen that metrics are in fact not complementary, as they produce similar rankings of systems.

Proposition for future work To foster meaningful progress in the field of automatic evaluation, we propose that future research craft new metrics not only to maximize correlation with human judgments but also to minimize the similarity with the body of existing automatic metrics. This would ensure that the field progresses as whole by focusing on capturing aspects of human judgments that are not already captured by existing metrics. Furthermore, the reporting of several metrics that have been demonstrated to be complementary could become again a valid heuristic to get a robust overview of model performance. In practice, researchers could re-use our code and analysis to enforce complementarity by, for example, enforcing new metrics to have low complementarity as measured by Eq. 3.

5 Limitations

Even though we have considered a representative set of automatic evaluation metrics, new ones are constantly introduced and could be added to such an analysis. Similarly, new datasets could be added to the analysis and impact the results. In an effort to make our findings relevant in the long run, we release an easy-to-use code base to replicate our analysis with new metrics and datasets.

Like the majority of analysis on automatic evaluation metrics, ours rely on the assumption that human judgments are valid and meaningful. However, some works have questioned the quality of human judgments in standard datasets.

References

A. Ali and M. Meilă (2012) Experiments with kemeny ranking: what works when?. Mathematical Social Sciences 64 (1), pp. 28–40. Cited by: §A.2, §2.
J. J. Bartholdi, C. A. Tovey, and M. A. Trick (1989) The computational difficulty of manipulating an election. Social Choice and Welfare 6 (3), pp. 227–241. Cited by: §A.2.
M. Bhandari, P. Gour, A. Ashfaq, P. Liu, and G. Neubig (2020) Re-evaluating evaluation in text summarization. Cited by: §2.1.
V. D. Blondel, J. Guillaume, R. Lambiotte, and E. Lefebvre (2008) Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment 2008 (10), pp. P10008. Cited by: §B.2.
E. Chapuis, P. Colombo, M. Labeau, and C. Clavel (2021) Code-switched inspired losses for generic spoken dialog representations. arXiv preprint arXiv:2108.12465. Cited by: §B.3.
E. Chapuis, P. Colombo, M. Manica, M. Labeau, and C. Clavel (2020) Hierarchical pre-training for sequence labelling in spoken dialog. arXiv preprint arXiv:2009.11152. Cited by: §B.3.
C. Chhun, P. Colombo, C. Clavel, and F. M. Suchanek (2022) Of human criteria and automatic metrics: a benchmark of the evaluation of story generation. arXiv preprint arXiv:2208.11646. Cited by: §B.3.
P. Colombo, E. Chapuis, M. Labeau, and C. Clavel (2021a) Improving multimodal fusion via mutual dependency maximisation. arXiv preprint arXiv:2109.00922. Cited by: §B.3.
P. Colombo, E. Chapuis, M. Manica, E. Vignon, G. Varni, and C. Clavel (2020) Guiding attention in sequence-to-sequence models for dialogue act prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 7594–7601. Cited by: §B.3.
P. Colombo, C. Clave, and P. Piantanida (2021b) InfoLM: a new metric to evaluate summarization & data2text generation. arXiv preprint arXiv:2112.01589. Cited by: §B.3.
P. Colombo, C. Clavel, C. Yack, and G. Varni (2021) Beam search with bidirectional strategies for neural response generation. In Proceedings of The Fourth International Conference on Natural Language and Speech Processing (ICNLSP 2021), Trento, Italy, pp. 139–146. External Links: Link Cited by: §B.3.
P. Colombo, N. Noiry, E. Irurozki, and S. Clémençon (2022) What are the best systems? new perspectives on nlp benchmarking. arXiv preprint arXiv:2202.03799. Cited by: §1, §2.
P. Colombo, P. Piantanida, and C. Clavel (2021) A novel estimator of mutual information for learning to disentangle textual representations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 6539–6550. External Links: Link, Document Cited by: §B.3.
P. Colombo, G. Staerman, C. Clavel, and P. Piantanida (2021c) Automatic text evaluation through the lens of wasserstein barycenters. arXiv preprint arXiv:2108.12463. Cited by: §B.3.
P. Colombo, G. Staerman, N. Noiry, and P. Piantanida (2022) Learning disentangled textual representations via statistical measures of similarity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, pp. 2614–2630. External Links: Link, Document Cited by: §B.3.
P. Colombo, W. Witon, A. Modi, J. Kennedy, and M. Kapadia (2019) Affect-Driven Dialog Generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 3734–3743. External Links: Link, Document Cited by: §B.3.
P. Colombo (2021) Learning to represent and generate text using information measures. Ph.D. Thesis, Institut polytechnique de Paris. Cited by: §B.3.
D. Coppersmith, L. Fleischer, and A. Rudra (2006) Ordering by weighted number of wins gives a good ranking for weighted tournaments. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pp. 776–782. Cited by: §2.
H. T. Dang and K. Owczarzak (2008) Overview of the tac 2008 update summarization task.. In TAC, Cited by: §2.1.
C. Dwork, R. Kumar, M. Naor, and D. Sivakumar (2001) Rank aggregation methods for the web. In Proceedings of the 10th international conference on World Wide Web, pp. 613–622. Cited by: §A.2.
A. R. Fabbri, W. Kryściński, B. McCann, C. Xiong, R. Socher, and D. Radev (2021) Summeval: re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics 9, pp. 391–409. Cited by: §2.1.
Y. Graham (2015) Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 128–137. External Links: Document, Link Cited by: §1.
I. T. Jolliffe and J. Cadima (2016) Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 374 (2065), pp. 20150202. Cited by: §B.1.
J. G. Kemeny (1959) Mathematics without numbers. Daedalus 88 (4), pp. 577–591. Cited by: §2.
C. Lin, G. Cao, J. Gao, and J. Nie (2006) An information-theoretic approach to automatic evaluation of summaries. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pp. 463–470. Cited by: §2.1.
C. Lin (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: §2.1.
S. Mehri and M. Eskenazi (2020) Usr: an unsupervised and reference free evaluation metric for dialog generation. arXiv preprint arXiv:2005.00456. Cited by: §2.1.
J. Ng and V. Abrecht (2015) Better summarization evaluation with word embeddings for rouge. arXiv preprint arXiv:1508.06034. Cited by: §2.1.
J. Novikova, O. Dušek, A. Cercas Curry, and V. Rieser (2017) Why we need new evaluation metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2241–2252. External Links: Link, Document Cited by: §2.
J. Novikova, O. Dušek, and V. Rieser (2018) RankME: reliable human ratings for natural language generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 72–78. External Links: Link, Document Cited by: §1.
K. Owczarzak, J. M. Conroy, H. T. Dang, and A. Nenkova (2012) An Assessment of the Accuracy of Automatic Evaluation in Summarization. In Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization, pp. 1–9. Cited by: §1.
K. Owczarzak and H. T. Dang (2011) Overview of the tac 2011 summarization track: guided task and aesop task. In Proceedings of the Text Analysis Conference (TAC 2011), Gaithersburg, Maryland, USA, November, Cited by: §2.1.
K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th ACL, Philadelphia, Pennsylvania, USA, pp. 311–318. External Links: Link, Document Cited by: §2.1.
M. Peyrard, T. Botschen, and I. Gurevych (2017) Learning to score system summaries for better content selection evaluation.. In Proceedings of the Workshop on New Frontiers in Summarization, pp. 74–84. Cited by: §2.1.
M. Peyrard, W. Zhao, S. Eger, and R. West (2021) Better than average: paired evaluation of nlp systems. arXiv preprint arXiv:2110.10746. Cited by: §1, §2.
M. Popović (2017) ChrF++: words helping character n-grams. In Proceedings of the second WMT, pp. 612–618. Cited by: §2.1.
M. Post (2018) A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771. Cited by: §2.1.
T. Ranasinghe, C. Orasan, and R. Mitkov (2021) An exploratory analysis of multilingual word-level quality estimation with cross-lingual transformers. arXiv preprint arXiv:2106.00143. Cited by: §2.1.
G. Staerman, P. Mozharovskyi, S. Clémençon, and F. d’Alché-Buc (2021) Depth-based pseudo-metrics between probability distributions. Cited by: §B.3.
W. Witon, P. Colombo, A. Modi, and M. Kapadia (2018) Disney at IEST 2018: predicting Emotions Using an Ensemble. In Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Brussels, Belgium, pp. 248–253. External Links: Link, Document Cited by: §B.3.
H. P. Young and A. Levenglick (1978) A consistent extension of condorcet’s election principle. SIAM Journal on applied Mathematics 35 (2), pp. 285–300. Cited by: §2.
P. Young, A. Lai, M. Hodosh, and J. Hockenmaier (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2, pp. 67–78. Cited by: §2.1.
T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi (2019) Bertscore: evaluating text generation with bert. Cited by: §2.1.
W. Zhao, M. Peyrard, F. Liu, Y. Gao, C. M. Meyer, and S. Eger (2019) Moverscore: text generation evaluating with contextualized embeddings and earth mover distance. Cited by: §2.1.

Appendix A Extended Methodology

a.1 Utterance level Representation

In the main paper, we focus on System level representation. Each utterance $k \in {1, \dots, K}$ induces a ranking $σ_{k}^{m} \in R^{N}$ of the $N$ systems, where $σ_{k}^{m} (n)$ is the rank of system $n$ . The system level representation of $m$ is the sum of rankings over the utterances:

σ_{s y s}^{m} := K \sum k = 1 σ_{k}^{m} \in R^{N} .

(5)

In the supplementary, we also provide an analysis at Utterance level representation. Each system $k \in {1, \dots, N}$ induces a ranking $σ_{n}^{m} \in R^{K}$ of the $K$ utterances, where $σ_{n}^{m} (k)$ is the rank of utterance $k$ . The utterance level representation of $m$ is the sum of rankings over the systems:

σ_{u t t}^{m} := N \sum n = 1 σ_{k}^{n} \in R^{K} .

(6)

a.2 A remark on the rank representations

For a given family of $l \geq 1$ objects, the formal mathematical object describing a ranking is a permutation $σ \in S_{l}$ which describes how the objects must be interchanged to be ordered. The set of permutations is a group where the notion of mean is not straightforward since the addition of two permutations is not a well defined object. For a given family $σ_{1}, \dots, σ_{p}$ , the classical surrogate is called a Kemeny consensus, defined by:

σ^{*} \in a r g m i n σ \in S_{l} p \sum i = 1 d (σ_{i}, σ),

(7)

where $d$ the Kendall distance, given by:

d (η, τ) := \sum 1 \leq i, j \leq N 1_{(η_{i} - η_{j}) (τ_{i} - τ_{j}) < 0} .

(8)

However, computing a Kemeny consensus is a NP hard problem Bartholdi et al. (1989); Dwork et al. (2001). It turns out that the Borda count, defined as the sum of ranks induced by the permutations, is a very good approximation of the Kemeny consensus Ali and Meilă (2012), justifying our choices (5) and (6).

Appendix B Extending Finding 1 using clustering analysis

In this section, we want to obtain a visual and interpretable representation of both automatic and human metrics to understand their relationships better. Formally, we study the abstract space of metrics when encoded at the System or Utterance level. We ask the two following questions:

What is the effective dimension of this space?
Does it exist clusters of metrics?

b.1 Representing the metrics in a 2D space

In (a) and (b), we report the variance analysis given by a PCA Jolliffe and Cadima (2016) for each dataset at the System and Utterance levels, respectively.
Analysis: We observe that only a few components (less than 6) are needed to explain over 80 % of the variance. This behavior is typical to all considered datasets and can be observed when studying the ranks at the System and Instance levels.
Takeaways: Automatic and human metrics present in our datasets can be represented in a low-dimensional space. This confirms the low complementarity already observed in the main paper: the effective dimension of metrics is small. We will use the two first components in the next experiments to represent the metrics in a 2D space.

b.2 Finding similar groups of metrics

In Figure 5 and Figure 6, we represent all the considered metrics (both human and automatic) on the 2D-dimensional space corresponding to the two first components of the PCA. We cluster the metrics with the Louvain Algorithm Blondel et al. (2008) performed on the similarity matrix between metrics.

Analysis: From the figure, we observe a low number of clusters, i.e., two in most cases and at most three in the case of utterance level representations. When using system-level representation, the Human metrics have their cluster in all the configurations except for FLICKR, where H:overall is in the same cluster as $J S_{2}$ . We observe a similar trend when studying the utterance level representation: human metrics often belong to the same cluster, which contains a low number of automatic metrics. It is also worth noting that in most figures, human metrics are isolated.

Takeaways: This experiment further validates Findings 1: Automatic metrics are similar to each other much more than they are to human metric. The proposed procedure could be used in the future to find properties of newly introduced metrics and obtain visual representations of the metrics.

b.3 Extension to other types of tasks

In the futur we would like to incoporate more metrics such as BaryScore Colombo et al. (2021c), InfoLM Colombo et al. (2021b), DepthScore Staerman et al. (2021) and apply our methodology to other tasks such as affect driven sentence generation Colombo et al. (2019, 2021); Colombo (2021); Colombo et al. (2021, 2020, 2021a); Witon et al. (2018); Colombo et al. (2022); Chapuis et al. (2020, 2021) or story generation Chhun et al. (2022).

Appendix C Further results for Findings 2

In this section, we provide further experiments that validate Findings 2 and provide a method for future research to understand newly introduced metrics better. Specifically, we aim to answer the following research question:

In Findings 1 we showcase that human metrics carry different information than automatic metrics. How to measure the amount of information missing between the automatic and human metrics?
What metric or group of metrics are the most useful to predict a given human metric?

c.1 Measuring the information missing in automatic metrics

In this subsection, we extend the result provided by Figure 3. We measure the ratio between the MSE-error of a linear regression trained with automatic metrics together with human metrics and a linear regression trained only with automatic metrics for varying regularization coefficient. For each dataset, we provide mean and variance corresponding to the prediction of available human metrics. When solely one human metric is available, the dataset is not considered.

Observations: From the Figure 7, we observe a strong decrease in error when adding human metrics to predict another human metric. When $α$ increases, all the coefficients are set to 0, and the relative MSE is thus 0. It is worth noting that these observations hold for both system and utterance level representation. When observing the details per dataset, we observe a similar trend for all human metrics.

Takeaways: When predicting a specific human metric, other human metrics contain useful predictive information that is not present in the automatic metric.

(a) Aggregated score for each task when using System Level Representation.

c.2 Which metrics are the most useful to predict human judgment at the System level?

For this experiment we will rely on a Lasso Regression and denote the multiplier of the L1 term $α$ . For several values of $α$ (x-axis), we report each metric’s weights (y-axis) in Figures 8 and 9.

Observations: When increasing the weights given to the L1 penalization term, we observe that the regression weights of the human metrics are the ones that are the last to be set to 0. Human metrics contain the most relevant information. It is worth noting that this phenomenon is generic across the datasets and human criteria.

Takeaways: Human metrics are the most useful metrics when predicting other metrics.