Multi-modal Contrastive Representation Learning for Entity Alignment

Zhenxi Lin¹, Ziheng Zhang¹, Meng Wang², Yinghui Shi³, Xian Wu¹, Yefeng Zheng¹
¹Tencent Jarvis Lab, Shenzhen, China
²School of Computer Science and Engineering, Southeast University, Nanjing, China
³School of Cyber Science and Engineering, Southeast University, Nanjing, China
{chalerislin,zihengzhang}@tencent.com
{meng.wang,shiyinghui}@seu.edu.cn
{kevinxwu,yefengzheng}@tencent.com

Abstract

Multi-modal entity alignment aims to identify equivalent entities between two different multi-modal knowledge graphs, which consist of structural triples and images associated with entities. Most previous works focus on how to utilize and encode information from different modalities, while it is not trivial to leverage multi-modal knowledge in entity alignment because of the modality heterogeneity. In this paper, we propose MCLEA, a Multi-modal Contrastive Learning based Entity Alignment model, to obtain effective joint representations for multi-modal entity alignment. Different from previous works, MCLEA considers task-oriented modality and models the inter-modal relationships for each entity representation. In particular, MCLEA firstly learns multiple individual representations from multiple modalities, and then performs contrastive learning to jointly model intra-modal and inter-modal interactions. Extensive experimental results show that MCLEA outperforms state-of-the-art baselines on public datasets under both supervised and unsupervised settings.¹¹1The source code is available at https://github.com/lzxlin/MCLEA.

1 Introduction

Knowledge graphs (KGs) such as DBpedia lehmann2015dbpedia and YAGO mahdisoltani2014yago3 employ the graph structure to represent real-world knowledge, where the concepts are represented as nodes and the relationships among concepts are represented as edges. KGs have been widely applied to knowledge-driven applications to boost their performance, like recommendation system cao2019unifying, information extraction han2018neural and question answering ijcai2021p611. In recent years, an increasing amount of knowledge has been represented in multi-modal formats, such as MMKG liu2019mmkg and Richpedia wang2020richpedia. These multi-modal KGs usually contain images as the visual modality, like profile photos, thumbnails, or posters. The augmented visual modality has shown the significant capability to improve KG-based applications chen2020mmea. It was also proven that the incorporation of visual modality can enhance the contextual semantics of entities and also achieve improved KG embeddings wang2021visual.

Due to the large scope of real-world knowledge, most KGs are often incomplete, and multiple different KGs are usually complementary to one another. As a result, integrating multiple KGs into a unified one can enlarge the knowledge coverage and can also assist in refining KG by discovering the potential flaws chen2020mmea. To integrate heterogeneous multi-modal knowledge graphs, the task of multi-modal entity alignment (MMEA) is therefore proposed, which aims to discover equivalent entities referring to the same real-world object. Several previous MMEA works have shown that the inclusion of visual modality in modeling helps to improve the performance of entity alignment. For instance, MMEA chen2020mmea and EVA liu2021visual proposed distinct multi-modal fusion modules to integrate entity representations from multiple modalities into joint embeddings and they achieved state-of-the-art performance.²²2To distinguish the model MMEA from the task MMEA, we use EA to denote multi-modal entity alignment for the rest of the paper. However, these methods mainly utilize existing representations from different modalities, and customized representation learning for EA is not fully explored. In addition, existing methods only explore the use of diverse multi-modal representations to enhance the contextual embedding of entities, the inter-modal interactions are often neglected in modeling.

To address aforementioned problems, we propose MCLEA, a Multi-modal Contrastive Learning based Entity Alignment model, which effectively integrates multi-modal information into joint representations for EA. MCLEA firstly utilizes multiple individual encoders to obtain modality-specific representations for each entity. The individually encoded information includes neighborhood structures, relations, attributes, surface forms (i.e., entity names), and images. Furthermore, we introduce contrastive learning into EA with intra-modal contrastive loss and inter-modal alignment loss. Specifically, intra-modal contrastive loss aims at distinguishing the embeddings of equivalent entities from the ones of other entities for each modality, thus generating more appropriate representations for EA. Inter-modal alignment loss, on the other hand, aims at modelling inter-modal interactions and reducing the gaps between modalities for each entity. With these two losses, MCLEA can learn discriminative cross-modal latent embeddings and ensure potentially equivalent entities close in the joint embedding space, regardless of the modality. MCLEA is also generic as it can support a wide variety of modalities. Moreover, it combines multiple losses and simultaneously learns multiple objectives using task-dependent uncertainty.

The contributions of this paper are three-fold: (i) We propose a novel method, called MCLEA, to embed information from different modalities into a unified vector space and then obtain discriminative entity representations based on contrastive learning for entity alignment. (ii) We propose two novel losses to explore intra-modal relationships and inter-modal interactions, ensuring that to-be-aligned entities between different KGs are semantically close with minimum gaps between modalities. (iii) We experimentally validate the effectiveness and superiority of MCLEA as it achieves state-of-the-art performance on several public datasets in both supervised and unsupervised settings. The overall results also suggest that our MCLEA is capable of learning more discriminative embedding space for multi-modal entity alignment.

2 Related Work

2.1 Multi-modal Knowledge Graph

While many efforts mahdisoltani2014yago3; lehmann2015dbpedia have been made to achieve large-scale KGs, there are just a few attempts to enrich KGs with multiple modalities. For example, MMKG liu2019mmkg and Richpedia wang2020richpedia utilized the rich visual resources (mainly images) to construct multi-modal knowledge graphs. Their target was to enrich the KG information via appending sufficient and diverse images to textual entities but they also brought challenges to the KG embedding methods. Unlike traditional KG embedding methods, multi-modal KG embedding methods model the textual and visual modalities at the same time zhang2019multi; wang2021visual. For example, zhang2019multi proposed MKHAN to exploit multi-modal KGs with hierarchical attention networks on question answering, and wang2021visual proposed RSME to selectively incorporate visual information during the KG embedding learning process.

Figure 1: The overall architecture of MCLEA, which combines multiple modalities (§ 3.1) and learns through two proposed losses (§ 3.2), intra-modal contrastive loss (ICL) and inter-modal alignment loss (IAL).

2.2 Multi-modal Entity Alignment

Recent studies for entity alignment mostly focused on exploring the symbolic similarities based on various features, including entity names wu2019relation; zhang2019multi, attributes liu2020exploring, descriptions zhang2019multi; tang2021bert and ontologies xiang2021ontoea. Most of them started with transforming entities from different KGs into a unified low-dimensional vector space by translation-based models or graph neural networks and then discovered their counterparts based on the similarity metrics of entity embeddings. Some surveys summarized that additional KG information, if appropriately encoded, could further improve the performance of EA methods ZQW2020VLDB; ZJX2020COLING. Some previous attempts even proposed to guide these embedding-based EA models with probabilistic reasoning qi2021unsupervised. With such findings and the increasing popularity of multi-modal knowledge graphs, how to incorporate visual modality in EA, namely multi-modal entity alignment, has begun to draw research attention but the attempts are limited. The pioneer method PoE liu2019mmkg combined the multi-modal features and measured the credibility of facts by matching the underlying semantics of entities. Afterward, MMEA chen2020mmea was proposed to integrate knowledge from different modalities (relational, visual, and numerical) and obtain the joint entity representations. Another method termed EVA liu2021visual leveraged visual knowledge and other auxiliary information to achieve EA in both supervised and unsupervised manner. Alternatively, the method HMEA guo2021multi modeled structural and visual representations in the hyperbolic space, while Masked-MMEA shi2022prob discussed the impacts of visual modality and proposed to incorporate a selectively masking technique to filter potential visual noises. These methods mainly utilize multi-modal representations to enhance the contextual embedding of entities, nevertheless, customized entity representations for EA and inter-modal interactions are often neglected in modeling. Different from previous methods, our proposed MCLEA can learn both intra-modal and inter-modal dynamics simultaneously by the proposed contrastive objectives, expecting to learn more discriminative and abundant entity representation for EA.

3 Proposed Method

We start with the problem formulation and the notations. A multi-modal KG is denoted as $G = (E, R, A, V, T)$ , where $E, R, A, V, T$ are the sets of entities, relations, attributes, images, and triples, respectively. Given $G_{1} = (E_{1}, R_{1}, A_{1}, V_{1}, T_{1})$ and $G_{2} = (E_{2}, R_{2}, A_{2}, V_{2}, T_{2})$ as two KGs to be aligned, the aim of EA is to find aligned entity pairs $A = {(e_{1}, e_{2}) | e_{1} \equiv e_{2}, e_{1} \in E_{1}, e_{2} \in E_{2}}$ , where we assume a small set of entity pairs $S$ (seeds) are given as training data. The overall architecture of the proposed MCLEA is shown in Figure 1, and its primary components, multi-modal embeddings, and contrastive representation learning will be detailed in the following sections.

3.1 Multi-Modal Embeddings

Multi-modal KGs often depict various features with multiple modalities (or views), which are complementary to each other. We investigate different embeddings from different modalities for MCLEA, including neighborhood structures, relations, attributes, names (often termed as surface forms in previous work liu2021visual), and images. Each modality is processed with an individual encoder network adapted to the nature of the signal. Furthermore, these uni-modal embeddings are aggregated with a simple weighted mechanism to form a joint embedding. Theoretically, MCLEA can support more modalities, e.g., numerical values chen2020mmea, which will be left in our future work.

3.1.1 Neighborhood Structure Embedding

The graph attention network (GAT) is a typical neural network that directly deals with structured data velivckovic2018graph. Hence, we leverage GAT to model the structural information of $G_{1}$ and $G_{2}$ , shown as “Structure Encoder” in Figure 1. Specifically, the hidden state $h_{i} \in R^{d}$ ( $d$ is the hidden size) of entity $e_{i}$ by aggregating its one-hop neighbors $N_{i}$ with self-loop is formulated as:

h_{i} = σ (\sum_{j \in N_{i}} α_{i j} h_{j}),

(1)

where $h_{j}$ is the hidden state of entity $e_{j}$ , $σ (\cdot)$ denotes the ReLU nonlinearity, and $α_{i j}$ denotes the importance of entity $e_{j}$ to entity $e_{i}$ , which is calculated with the self-attention:

(2)

where $W \in R^{d \times d}$ denotes the weight matrix, $a \in R^{2 d}$ is a learnable parameter, $\oplus$ is the concatenation operation and $η$ is the LeakyReLU nonlinearity. Motivated by li2019semi, we restrict $W$ to be a diagonal matrix to reduce computations, thus increasing the scalability of the model. To stabilize the learning process of self-attention velivckovic2018graph, we perform $K$ ( $K = 2$ ) heads of independent attention of Eq. (1) in parallel, and concatenate these features to obtain the structure embedding of entity $e_{i}$ :

h_{i}^{g} = K ⨁ k = 1 σ (\sum_{j \in N_{i}} α_{i j}^{k} h_{j}),

(3)

where $α_{i j}^{k}$ is the normalized attention coefficient computed by the $k$ -th attention. In practice, we apply a two-layer GAT model to aggregate the neighborhood information within multiple hops, and use the output of the last GAT layer as the neighborhood structure embedding.

3.1.2 Relation, Attribute, and Name Embeddings

Because the vanilla GAT operates on unlabeled graphs, it is unable to properly model relational information in multi-relational KGs. To alleviate this issue, we follow yang2019aligning and regard the relations of entity $e_{i}$ as bag-of-words feature and feed it into a simple feed-forward layer to obtain the relation embedding $h_{i}^{r}$ . For the simplicity and consistency of MCLEA, we adopt the same approach for the attribute embedding $h_{i}^{a}$ and the name embedding $h_{i}^{n}$ for entity $e_{i}$ . Therefore, these embeddings are calculated as:

h_{i}^{l} = W_{l} u_{i}^{l} + b_{l}, l \in {r, a, n},

(4)

where $W_{l}$ and $b_{l}$ are learnable parameters, $u_{i}^{r}$ is the bag-of-words relation feature, $u_{i}^{a}$ is the bag-of-words attribute feature, and $u_{i}^{n}$ is the name feature obtained by averaging the pre-trained GloVe pennington2014glove vectors of name strings. To avoid the out-of-vocabulary challenges brought by the extensive proper nouns (e.g., person names) and the limited vocabulary of word vectors, we further incorporate the character bigrams mao2021alignment of entity names as auxiliary features for name embedding.

3.1.3 Visual Embedding

We adopt the pre-trained visual model (PVM), e.g., ResNet-152 he2016deep, to learn visual embedding, shown as “Visual Encoder” in Figure 1. We feed the image $v_{i}$ of entity $e_{i}$ into the pre-trained visual model and use the final layer output before logits as the image feature. The feature is sent through a feed-forward layer to get the visual embedding:

h_{i}^{v} = W_{v} \cdot PVM (v_{i}) + b_{v} .

(5)

3.1.4 Joint Embedding

Next, we implement a simple weighted concatenation by integrating the multi-modal features into a single compact representation ${^h}_{i}$ for entity $e_{i}$ :

{^h}_{i} = ⨁ m \in M [\frac{exp (w_{m})}{\sum_{j \in M} exp (w_{j})} h_{i}^{m}],

(6)

where $M = {g, r, a, n, v}$ and $w_{m}$ is a trainable attention weight for the modality of $m$ . L2-normalization is performed on the input embeddings before the weighted concatenation.

The current joint embeddings are coarse-grained and there are no interactions between modalities. Therefore, two training strategies are designed for learning the dynamics within (intra-) and between (inter-) modalities.

3.2 Contrastive Representation Learning

As the core of MCLEA, we propose two novel losses on the uni-modal and joint representations to sufficiently capture the dynamics within and between modalities while preserving semantic proximity and minimizing the modality gap.

3.2.1 Intra-modal Contrastive Loss (ICL)

Inspired by recent work on contrastive learning (CL) chen2020simple; khosla2020supervised, we devise an intra-modal contrastive loss (ICL) that enforces the input embedding to respect the similarity of entities in the original embedding space. Meanwhile, ICL allows MCLEA to distinguish the embeddings of the same entities in different KGs from those of other entities for each modality.

Given that $S$ can be naturally regarded as positive samples, whereas any non-aligned pairs can be regarded as negative samples due to the convention of 1-to-1 alignment constraint sun2018bootstrapping. Formally, for the $i$ -th entity $e_{1}^{i} \in E_{1}$ of minibatch $B$ , the positive set is defined as $P_{i} = {e_{2}^{i} | e_{2}^{i} \in E_{2}}$ , where $(e_{1}^{i}, e_{2}^{i})$ is an aligned pair. The negative set includes two types, inner-graph unaligned pairs from the source KG $G_{1}$ and cross-graph unaligned pairs from the target KG $G_{2}$ , defined as $N_{1}^{i} = {e_{1}^{j} | \forall e_{1}^{j} \in E_{1}, j \neq i}$ and $N_{2}^{i} = {e_{2}^{j} | \forall e_{2}^{j} \in E_{2}, j \neq i}$ , respectively. Both $N_{1}^{i}$ and $N_{2}^{i}$ come from minibatch $B$ . These two types of negative samples are designed to constrain the joint embedding space, in which the semantically similar entities from the same KG stay close-by and the aligned entities from two KGs map to proximate points. Overall, we define the alignment probability distribution $q_{m} (e_{1}^{i}, e_{2}^{i})$ of the modality of $m$ for each positive pair $(e_{1}^{i}, e_{2}^{i})$ as:

		$q_{m} (e_{1}^{i}, e_{2}^{i}) =$		(7)
		$\frac{δ_{m} (e_{1}^{i}, e_{2}^{i})}{δ_{m} (e_{1}^{i}, e_{2}^{i}) + \sum e_{1}^{j} \in N_{1}^{i} δ_{m} (e_{1}^{i}, e_{1}^{j}) + \sum e_{2}^{j} \in N_{2}^{i} δ_{m} (e_{1}^{i}, e_{2}^{j})},$		(7)

where $δ_{m} (u, v) = exp (f_{m} (u)^{T} f_{m} (v) / τ_{1})$ , $f_{m} (\cdot)$ is the encoder of the modality $m$ , and $τ_{1}$ is a temperature parameter. Especially, L2-normalization is performed on the input feature embedding before computing the inner product. Notably, the distribution of Eq. (7) is directional and asymmetric for each input; the distribution for another direction is thus defined similarly for $q_{m} (e_{2}^{i}, e_{1}^{i})$ . The ICL can be formulated as:

L_{m}^{ICL} = - E_{i \in B} log [\frac{1}{2} (q_{m} (e_{1}^{i}, e_{2}^{i}) + q_{m} (e_{2}^{i}, e_{1}^{i}))] .

(8)

We apply ICL on each modality separately and also on the combined multi-modal representation as specified in Eq. (6). ICL is performed in contrastive supervised learning to learn intra-modal dynamics for more discriminative boundaries for each modality in the embedding space.

3.2.2 Inter-modal Alignment Loss (IAL)

Since the embeddings of different modalities are separately trained with ICL, their representations are not aligned and it is difficult to model the complex interaction between modalities solely with the fusion module. To alleviate this, we further propose inter-modal alignment loss (IAL), which targets at reducing the gap between the output distribution over different modalities, so that the MCLEA can model inter-modal interactions and obtain more meaningful representations.

We consider the joint embedding as the comprehensive representation due to its fusion of multi-modal features; therefore, we attempt to transfer the knowledge from the joint embedding back to uni-modal embedding so that the uni-modal embedding could better utilize the complementary information from others. Concretely, we minimize the bidirectional KL divergence over the output distribution between joint embedding and uni-modal embedding:

	$L_{m}^{IAL} = E_{i \in B} \frac{1}{2} [$	$KL (q_{o}^{'} (e_{1}^{i}, e_{2}^{i}) \| \| q_{m}^{'} (e_{1}^{i}, e_{2}^{i}))$		(9)
	$+$	$KL (q_{o}^{'} (e_{2}^{i}, e_{1}^{i}) \| \| q_{m}^{'} (e_{2}^{i}, e_{1}^{i}))],$		(9)

where $q_{o}^{'} (e_{1}^{i}, e_{2}^{i})$ , $q_{o}^{'} (e_{2}^{i}, e_{1}^{i})$ and $q_{m}^{'} (e_{1}^{i}, e_{2}^{i})$ , $q_{m}^{'} (e_{2}^{i}, e_{1}^{i})$ represent the output predictions with two directions of joint embedding and the uni-modal embedding of modality $m$ , respectively. Their calculation are similar to Eq. (7) but with a temperature parameter $τ_{2}$ . We only back-propagate through $q_{m}^{'} (e_{1}^{i}, e_{2}^{i})$ , $q_{m}^{'} (e_{2}^{i}, e_{1}^{i})$ in Eq. (9) as knowledge distillation hinton2015distilling.

IAL aims at learning interactions between different modalities within each entity, which concentrates on aggregating the distribution of different modalities and thus reduces the modality gap. Some approaches zhang2019multi; chen2020mmea attempt to learn a common space by imposing alignment constraints on the features between different modalities, but they introduce noises due to semantic heterogeneity. Different from these approaches, we distill the useful knowledge from the output prediction of multi-modal representation to the uni-modal representation, while maintaining relatively modality-specific features of each modality.

3.3 Optimization Objective

The overall loss of the MCLEA is as follows,

L = L_{o}^{ICL} + \sum_{m \in M} α_{m} L_{m}^{ICL} + \sum_{m \in M} β_{m} L_{m}^{IAL},

(10)

where $M = {g, r, a, n, v}$ , $L_{o}^{ICL}$ denotes the ICL operated on joint embedding, $α_{m}$ and $β_{m}$ are hyper-parameters that balance the importance of losses. However, manually tuning these hyper-parameters is expensive and intractable. Instead, we treat MCLEA as a multi-task learning paradigm and then use homoscedastic uncertainty kendall2018multi to weigh each loss automatically during model training. We adjust the relative weight of each task in the loss function by deriving a multi-task loss function based on maximizing the Gaussian likelihood with task-dependant uncertainty. Due to space limits, we only show the derived result and leave the detailed derivation process in the Appendix. The loss in Eq. (10) can be rewritten as:

L = L_{o}^{ICL} + \sum m \in M (\frac{1}{α_{m}^{2}} L_{m}^{ICL} + \frac{1}{β_{m}^{2}} L_{m}^{IAL} + log α_{m} + log β_{m}),

(11)

where $α_{m}$ and $β_{m}$ are automatically learned during training.

To overcome the lack of training data, we incorporate a bi-directional iterative strategy used in liu2021visual to iteratively add new aligned seeds during training. In the inference, we use the cosine similarity metric between joint embeddings of entities to determine the counterparts of entities.

MCLEA can be extended to the unsupervised setting, in which the pseudo-alignment seeds are discovered based on feature similarities of entity names mao2020mraea; ge2021make or entity images liu2021visual, accordingly resulting in different unsupervised versions of MCLEA.

4 Experiments

4.1 Experimental Setup

Datasets. Five EA datasets are adopted for evaluation, including three bilingual datasets ZH/JA/FR-EN versions of DBP15K liu2021visual and two cross-KG datasets FB15K-DB15K/YAGO15K liu2019mmkg. The detailed dataset statistics are listed in Table 5 in the Appendix. Note that not all entities have corresponding images and for those without images, MCLEA would assign random vectors for the visual modality, as the setting of EVA liu2021visual. As for DBP15K, 30% reference entity alignments are given as $S$ while as for cross-KG datasets, 20%, 50%, and 80% reference entity alignments are given liu2019mmkg.

Baselines. We compare the proposed MCLEA against 19 state-of-the-art EA methods, which can be classified into four categories: 1) structure-based methods that solely rely on structural information for aligning entities, including BootEA sun2018bootstrapping, MUGNN cao2019multi, KECG li2019semi, NAEA zhu2019neighborhood, and AliNet sun2020knowledge; 2) auxiliary-enhanced methods that utilize auxiliary information to improve the performance, including MultiKE zhang2019multi, HMAN yang2019aligning, RDGCN wu2019relation, AttrGNN liu2020exploring, BERT-INT tang2021bert and ERMC yang2021entity; 3) multi-modal methods that combine the multi-modal features to generate entity representations, including PoE liu2019mmkg, MMEA chen2020mmea, HMEA guo2021multi, and EVA liu2021visual; 4) unsupervised methods, including RREA mao2020relational, MRAEA mao2020mraea, EASY ge2021make, and SEU mao2021alignment.

Implementation Details. The hidden size of each layer of GAT is 300, while the embedding size of the other modalities is 100. We use the AdamW optimizer with a learning rate of $5 \times 10^{- 4}$ to update the parameters. The number of training epochs is 1000 with early-stopping and the batch size is 512. The hyper-parameters $τ_{1}, τ_{2}$ are set to 0.1 and 4.0, respectively. To keep in line with previous works, we use the same entity name translations and word vectors provided by xu2019cross for bilingual datasets. As for cross-KG datasets, we do not consider surface forms for fair comparison. For visual embedding, we adopt the preprocessed image features provided by liu2021visual and chen2020mmea for bilingual datasets and cross-KG datasets, where the former uses ResNet-152 as the pre-trained backbone, while the latter uses VGG-16. Previous work has revealed that surface forms are quite helpful for entity alignment liu2020exploring. For fair comparison, we divide the supervised methods on bilingual datasets into two groups based on whether surface forms are used, and we implement an MCLEA variant (MCLEA $†$ ) where the name embedding is removed. For the unsupervised setting, we implement two variants, MCLEA-V and MCLEA-N, which generate pseudo-alignment seeds based on the similarities of images and names, respectively.

Evaluation. We rank matching candidates of each to-be-aligned entity and use the metrics of Hits@1 (H@1), Hits@10 (H@10), and mean reciprocal rank (MRR). In the following tables, the best results are in bold with the second best results underlined, and “Improv. best %” denotes the relative improvement of MCLEA over the best baseline.

	Models	DBP15K $_{Z H - E N}$			DBP15K $_{J A - E N}$			DBP15K $_{F R - E N}$
	Models	H@1	H@10	MRR	H@1	H@10	MRR	H@1	H@10	MRR
w/o SF	BootEA sun2018bootstrapping	.629	.847	.703	.622	.854	.701	.653	.874	.731
	KECG li2019semi	.478	.835	.598	.490	.844	.610	.486	.851	.610
	MUGNN cao2019multi	.494	.844	.611	.501	.857	.621	.495	.870	.621
	NAEA zhu2019neighborhood	.650	.867	.720	.641	.873	.718	.673	.894	.752
	AliNet sun2020knowledge	.539	.826	.628	.549	.831	.645	.552	.852	.657
	EVA liu2021visual	.761	.907	.814	.762	.913	.817	.793	.942	.847
	MCLEA $†$ (Ours)	.816	.948	.865	.812	.952	.865	.834	.975	.885
	Improv. best %	7.2	4.5	6.3	6.6	4.3	5.9	5.2	3.5	4.5
w/ SF	MultiKE zhang2019multi	.437	.516	.466	.570	.643	.596	.714	.761	.733
	HMAN yang2019aligning	.562	.851	–	.567	.969	–	.540	.871	–
	RDGCN wu2019relation	.708	.846	–	.767	.895	–	.886	.957	–
	AttrGNN liu2020exploring	.777	.920	.829	.763	.909	.816	.942	.987	.959
	BERT-INT tang2021bert	.968	.990	.977	.964	.991	.975	.992	.998	.995
	ERMC yang2021entity	.903	.946	.899	.942	.944	.925	.962	.982	.973
	MCLEA (Ours)	.972	.996	.981	.986	.999	.991	.997	1.00	.998
	Improv. best %	0.4	0.6	0.4	2.3	0.8	1.6	0.5	0.2	0.3

Table 1: Comparative results of MCLEA without (w/o) and with (w/) surface forms (SF) against strong supervised methods on three bilingual datasets, and

†

denotes an MCLEA variant without name embedding.

4.2 Overall Results

Table 1, Table 2, and Table 3 report the performance of MCLEA against different baselines on different datasets with different settings. Overall, MCLEA and its variants mostly perform the best across all the datasets on all the metrics.

Table 1 reports the performance of MCLEA against the supervised baselines on bilingual datasets in the settings of w/ and w/o surface forms. Compared with the first group without using surface forms, MCLEA $†$ brings about 5.2% to 7.2% relative improvement in H@1 over the best baseline EVA. The superiority of MCLEA confirms that the proposed contrastive representation learning substantially promotes the performance. Specifically, compared with the second group with surface forms involvement, there are two notable observations. On one hand, MCLEA shows a clear improvement when combined with name embedding, suggesting that entity names provide useful clues for entity alignment, which has been revealed in previous work zhang2019multi; liu2020exploring; ge2021make. On the other hand, MCLEA still shows slightly better performance than the best baseline BERT-INT with 0.4% to 2.3% relative improvement in H@1 nevertheless with far fewer parameters (13M $v s .$ 110M). This also reveals that MCLEA can effectively model robust entity representations instead of attaching over-parameterized encoders. Noteworthily, BERT-INT relies heavily on entity descriptions to fine-tune BERT, but entity descriptions may not be available for every entity, and collecting them is labor-intensive, limiting the scope of its application.

Table 2 shows the comparison of multi-modal methods on two cross-KG datasets, which provides direct evidence of the effectiveness of MCLEA. When 20% training seeds are given, MCLEA outperforms the best baseline MMEA with 67.9% higher in H@1, 30.3% higher in H@10, and 49.6% higher in MRR. The performance gains are still significant when 50% and 80% alignment seeds are given. It is worth noting that the performance gains reach the highest in the 20% setting and MCLEA (20%) obtains comparable results to EVA (80%), indicating that MCLEA could better utilize the minimum number of alignment seeds to obtain effective representations. We also find that MMEA greatly outperforms EVA, we speculate that the cross-KG datasets are quite heterogeneous (w.r.t. the number of relations) compared to bilingual datasets, as shown in Table 5 in the Appendix, and the structure encoder of EVA struggles to model heterogeneous information and EVA cannot utilize the numerical knowledge in cross-KG datasets, which is well exploited in MMEA.

	Models	FB15K-DB15K			FB15K-YAGO15K
	Models	H@1	H@10	MRR	H@1	H@10	MRR
20%	PoE	.126	.251	.170	.113	.229	.154
	HMEA	.127	.369	–	.105	.313	–
	MMEA	.265	.541	.357	.234	.480	.317
	EVA $*$	.134	.338	.201	.098	.276	.158
	MCLEA (Ours)	.445	.705	.534	.388	.641	.474
	Improv. best %	67.9	30.3	49.6	65.8	33.5	49.5
50%	PoE	.464	.658	.533	.347	.536	.414
	HMEA	.262	.581	–	.265	.581	–
	MMEA	.417	.703	.512	.403	.645	.486
	EVA $*$	.223	.471	.307	.240	.477	.321
	MCLEA (Ours)	.573	.800	.652	.543	.759	.616
	Improv. best %	23.5	13.8	22.3	34.7	17.7	26.7
80%	PoE	.666	.820	.721	.573	.746	.635
	HMEA	.417	.786	–	.433	.801	–
	MMEA	.590	.869	.685	.598	.839	.682
	EVA $*$	.370	.585	.444	.394	.613	.471
	MCLEA (Ours)	.730	.883	.784	.653	.835	.715
	Improv. best %	9.6	1.6	8.7	9.2	-0.4	4.8

Table 2: Experimental results on two cross-KG datasets where X% represents the percentage of reference entity alignments used for training. The symbol

*

denotes the reproduced results.

When compared to the unsupervised methods in Table 3, both MCLEA variants perform slightly better than the best baseline with performance gain in H@1 varying from 0.7% to 6.7%. Note that using image (-V) or name (-N) similarities to produce seeds leads to almost identical results, demonstrating the effectiveness of such simple rules to enable MCLEA in the unsupervised setting.

Models	DBP15K $_{Z H - E N}$			DBP15K $_{J A - E N}$			DBP15K $_{F R - E N}$
Models	H@1	H@10	MRR	H@1	H@10	MRR	H@1	H@10	MRR
MRAEA mao2020mraea	.778	.832	–	.889	.927	–	.950	.970	–
RREA mao2020relational	.822	.964	–	.918	.978	–	.963	.992	–
EASY ge2021make	.898	.979	.930	.943	.990	.960	.980	.998	.990
SEU mao2021alignment	.900	.965	.924	.956	.991	.969	.988	.999	.992
MCLEA-V (Ours)	.959	.995	.974	.977	.999	.987	.990	1.00	.994
MCLEA-N (Ours)	.960	.994	.974	.983	.999	.990	.995	1.00	.997
Improv. best %	6.7	1.6	4.7	2.8	0.8	2.2	0.7	0.1	0.5

Table 3: Unsupervised experimental results on three bilingual datasets, where -V and -N denote different methods to generate pseudo-alignment seeds.

4.3 Model Analysis

Ablation study. The ablation experiments are performed on two bilingual datasets and the results are presented in Table 4. We first examine the individual contribution of different modalities. The removal of different modalities has varying degrees of performance drop, and entity names have shown the primary importance with the most significant drop, which is in line with the previous findings mao2020mraea; ge2021make. The structural information also shows its stable effectiveness across different datasets and other modalities make a slight contribution to MCLEA. Especially, visual information can play a more pronounced role in the absence of surface forms chen2020mmea; liu2021visual. Furthermore, we inspect various training strategies in MCLEA. It dramatically degrades the performance when removing the ICL from MCLEA, which indicates the importance of ICL in learning the intra-modal proximity. The IAL learns the interdependence between different modalities and is also beneficial to our model. Training MCLEA without the iterative strategy and replacing the uncertainty mechanism with uniform weights (i.e., w/o uncertainty) also cause decreases in performance. Overall, the ablation experiments validate the involvement of these modalities and training strategies with empirical evidence.

	Models	DBP15K $_{Z H - E N}$			DBP15K $_{J A - E N}$
	Models	H@1	H@10	MRR	H@1	H@10	MRR
	MCLEA	.972	.996	.981	.986	.999	.991
Modalities	w/o structure	.883	.956	.909	.947	.980	.959
	w/o relation	.967	.995	.978	.985	.999	.991
	w/o attribute	.961	.994	.974	.983	.999	.991
	w/o name	.816	.948	.865	.812	.952	.865
	w/o visual	.968	.994	.978	.985	.999	.991
Training	w/o ICL	.782	.892	.818	.813	.909	.844
	w/o IAL	.966	.995	.977	.980	.998	.987
	w/o iter. strategy	.942	.991	.960	.964	.995	.976
	w/o uncertainty	.969	.996	.980	.984	.999	.990

Table 4: Ablation study on two bilingual datasets.

Impact of hyper-parameters $τ_{1}, τ_{2}$ . We investigate the effects of hyper-parameters $τ_{1}, τ_{2}$ on ${DBP15K}_{Z H - E N}$ . As shown in Figure 2, different values of $τ_{1}$ have drastic effects on MCLEA, especially in terms of H@1 and MRR, which is because $τ_{1}$ controls the strength of penalties on hard negative samples and an appropriate $τ_{1}$ is conducive to learning discriminative entity embeddings. On the other hand, we observe lower variance in the performance w.r.t. $τ_{2}$ and the performance saturates when $τ_{2} = 4.0$ . The KL divergence establishes the associations between different modalities, within which $τ_{2}$ regulates the softness of the alignment distribution produced by input embedding and transfers the generalization capability of the joint embedding to uni-modal embedding.

Figure 2: Performance comparison with different values of $τ_{1}, τ_{2}$ .

Similarity Distribution of Representations. To investigate the effectiveness of entity representations, we experiment MCLEA with and without ICL/IAL on DBP15K $_{Z H - E N}$ and produce the visualization in Figure 3 by averaging the similarity distribution of the test entities and their predicted counterparts for different modalities. It can be observed that in every modality, especially in structure and name, it holds a high top-1 similarity and a large similarity variance. More importantly, it meets our expectation that contrastive learning (ICL and IAL) enables more discriminative entity learning in the joint representations.

Figure 3: Similarity visualization of representations of test entities and their top-10 predicted counterparts. The vertical axis represents different modalities with ( $+$ ) and without ( $-$ ) ICL/IAL and the horizontal axis represents the index of ranked predictions.

5 Conclusion

This paper presented a novel method termed MCLEA to address the multi-modal entity alignment. MCLEA utilizes multi-modal information to obtain the joint entity representations and it is composed of two losses, intra-modal contrastive loss, and inter-modal alignment loss, to explore the intra-modal relationships and cross-modal interactions, respectively. We experimentally validated the state-of-the-art performance of MCLEA in several public datasets and its capability of learning more discriminative embedding space for entity alignment. For future work, we plan to explore more side information such as entity descriptions to boost alignment performance.

References

Appendix

Appendix A Derivation for Adaptively Weighted Multi-task Loss

Dataset	KG	#Ent.	#Rel.	#Attr.	#Rel tr.	#Attr tr.	#Image	#Ref.
DBP15K $_{Z H - E N}$ liu2021visual	ZH	19,388	1,701	8,111	70,414	248,035	15,912	15,000
DBP15K $_{Z H - E N}$ liu2021visual	EN	19,572	1,323	7,173	95,142	343,218	14,125	15,000
DBP15K $_{J A - E N}$ liu2021visual	JA	19,814	1,299	5,882	77,214	248,991	12,739	15,000
DBP15K $_{J A - E N}$ liu2021visual	EN	19,780	1,153	6,066	93,484	320,616	13,741	15,000
DBP15K $_{F R - E N}$ liu2021visual	FR	19,661	903	4,547	105,998	273,825	14,174	15,000
DBP15K $_{F R - E N}$ liu2021visual	EN	19,993	1,208	6,422	115,722	351,094	13,858	15,000
FB15K-DB15K liu2019mmkg	FB15K	14,951	1,345	116	592,213	29,395	13,444	12,846
FB15K-DB15K liu2019mmkg	DB15K	12,842	279	225	89,197	48,080	12,837	12,846
FB15K-YAGO15K liu2019mmkg	FB15K	14,951	1,345	116	592,213	29,395	13,444	11,199
FB15K-YAGO15K liu2019mmkg	YAGO15K	15,404	32	7	122,886	23,532	11,194	11,199

Table 5: Dataset Statistics.

In this section, we treat Eq. (10) as a multi-task loss function and combine multiple objectives using homoscedastic uncertainty kendall2018multi, allowing us to automatically learn the relative weights of each loss.

Firstly, the ICL can actually be regarded as a classification loss with negative log-likelihood, i.e., predicting whether two entities are equivalent. Here, we rewrite the loss function of ICL as follows (for simplicity, here we omit the modality index and the inner-graph negative samples, and only consider the unidirectional version):

\begin{matrix} L^{ICL} & = - E_{i \in B} log q (e_{1}^{i}, e_{2}^{i}) = - E_{i \in B} log P (c = 1 | f^{W} (e_{1}^{i}, e_{2}^{i})), \end{matrix}

(12)

where $c = 1$ means that the two input entities are equivalent, otherwise $c = 0$ ; $f^{W} (\cdot, \cdot)$ is the model output with parameter $W$ . Following kendall2018multi, we adapt the negative log-likelihood to squash a scaled version of the model output with an uncertainty scalar $σ$ through a softmax function:

\begin{matrix} - log P (c = 1 | f^{W} (e_{1}^{i}, e_{2}^{i}), σ) = - log Softmax (\frac{1}{σ^{2}} f^{W} (e_{1}^{i}, e_{2}^{i})) = - \frac{1}{σ^{2}} f^{W} (e_{1}^{i}, e_{2}^{i}) + log \sum j \neq i exp (\frac{1}{σ^{2}} f^{W} (e_{1}^{i}, e_{2}^{j})), \end{matrix}

(13)

where $e_{2}^{j}$ with $j \neq i$ is the cross-graph negative samples defined in the main paper.

Applying the same assumption in kendall2018multi:

\begin{matrix} \frac{1}{σ} \sum j \neq i exp (\frac{1}{σ^{2}} f^{W} (e_{1}^{i}, e_{2}^{j})) \approx {(\sum c^{'} exp (f^{W} (e_{1}^{i}, e_{2}^{j})))}^{\frac{1}{σ^{2}}}, \end{matrix}

(14)

we can simplify Eq. (13) to:

\begin{matrix} - log P (c = 1 | f^{W} (e_{1}^{i}, e_{2}^{i}), σ) \approx - \frac{1}{σ^{2}} f^{W} (e_{1}^{i}, e_{2}^{i}) + \frac{1}{σ^{2}} log \sum j \neq i exp (f^{W} (e_{1}^{i}, e_{2}^{j})) + log (σ) = - \frac{1}{σ^{2}} log P (c = 1 | f^{W} (e_{1}^{i}, e_{2}^{i})) + log (σ), \end{matrix}

(15)

where $σ$ can be interpreted as the relative weight of the loss and automatically learned with stochastic gradient descent.

On the other hand, the IAL defines the KL divergence over the output distribution between joint embedding and uni-modal embedding (we omit the modality index and only consider the unidirectional version for simplicity):

\begin{matrix} L^{IAL} & = E_{i \in B} KL (q_{o}^{'} (e_{1}^{i}, e_{2}^{i}) | | q^{'} (e_{1}^{i}, e_{2}^{i})) = E_{i \in B} q_{o}^{'} (e_{1}^{i}, e_{2}^{i}) log \frac{q_{o}^{'} (e_{1}^{i}, e_{2}^{i})}{q^{'} (e_{1}^{i}, e_{2}^{i})} = E_{i \in B} [q_{o}^{'} (e_{1}^{i}, e_{2}^{i}) log q_{o}^{'} (e_{1}^{i}, e_{2}^{i}) - q_{o}^{'} (e_{1}^{i}, e_{2}^{i}) log q^{'} (e_{1}^{i}, e_{2}^{i})], \end{matrix}

(16)

where $q_{o}^{'} (e_{1}^{i}, e_{2}^{i})$ and $q^{'} (e_{1}^{i}, e_{2}^{i})$ represent the output predictions of joint embedding and the uni-modal embedding, respectively. Since we only back-propagate through $q^{'} (e_{1}^{i}, e_{2}^{i})$ in Eq. (9), $L^{IAL}$ is equivalent to calculating the cross-entropy loss over the two distributions:

\begin{matrix} L^{IAL} = - q_{o}^{'} (e_{1}^{i}, e_{2}^{i}) log q^{'} (e_{1}^{i}, e_{2}^{i}) . \end{matrix}

(17)

Therefore, similar to ICL, we can automatically learn the relative weight of IAL for each modality through task-dependent uncertainty. As mentioned above, the total loss in Eq. (10) can be rewritten as:

\begin{matrix} L = L_{o}^{ICL} & + \sum_{m \in M} (\frac{1}{α_{m}^{2}} L_{m}^{ICL} + log α_{m}) + \sum_{m \in M} (\frac{1}{β_{m}^{2}} L_{m}^{IAL} + log β_{m}), \end{matrix}

(18)

where $α_{m}, β_{m}$ are learnable parameters. Large $α_{m}$ ( $β_{m}$ ) will decrease the contribution of $L_{m}^{ICL}$ ( $L_{m}^{IAL}$ ) for the $m$ -th modality, whereas small $α_{m}$ ( $β_{m}$ ) will increase its contribution.

Appendix B Dataset Statistics

The detailed dataset statistics are listed in Table 5, including the number of entities (#Ent.), relations (#Rel.), attributes (#Attr.), number of relation triples (#Rel tr.) and attribute triples (#Attr tr.), number of images (#Image), and number of reference entity alignments (#Ref.). It is worth noting that not all entities have the associated images or the equivalent counterparts in the other KG.