EViT: Privacy-Preserving Image Retrieval via Encrypted Vision Transformer in Cloud Computing

Qihua Feng fengqh2020@gmail.com Jinan UniversityGuang Zhou, China510000 , Peiya Li lpy0303@jnu.edu.cn Jinan UniversityGuang Zhou, China510000 , Zhixun Lu Jinan UniversityGuang Zhou, China510000 , Chaozhuo Li Beihang UniversityBeijing, China100000 , Zefang Wang, Zhiquan Liu Jinan UniversityGuang Zhou, China510000 , Chunhui Duan Beijing Institute of TechnologyBeijing, China100000 and Feiran Huang Jinan UniversityGuang Zhou, China510000

2022

Abstract.

Image retrieval systems help users to browse and search among extensive images in real-time. With the rise of cloud computing, retrieval tasks are usually outsourced to cloud servers. However, the cloud scenario brings a daunting challenge of privacy protection as cloud servers cannot be fully trusted. To this end, image-encryption-based privacy-preserving image retrieval schemes have been developed, which first extract features from cipher-images, and then build retrieval models based on these features. Yet, most existing approaches extract shallow features and design trivial retrieval models, resulting in insufficient expressiveness for the cipher-images. In this paper, we propose a novel paradigm named Encrypted Vision Transformer (EViT), which advances the discriminative representations capability of cipher-images. First, in order to capture comprehensive ruled information, we extract multi-level local length sequence and global Huffman-code frequency features from the cipher-images which are encrypted by stream cipher during JPEG compression process. Second, we design the Vision Transformer-based retrieval model to couple with the multi-level features, and propose two adaptive data augmentation methods to improve representation power of the retrieval model. Our proposal can be easily adapted to unsupervised and supervised settings via self-supervised contrastive learning manner. Extensive experiments reveal that EViT achieves both excellent encryption and retrieval performance, outperforming current schemes in terms of retrieval accuracy by large margins while protecting image privacy effectively. Code is publicly available at https://github.com/onlinehuazai/EViT.

Image retrieval, privacy-preserving, deep learning, JPEG, Vision Transformer, self-supervised learning.

^†^†copyright: acmcopyright^†^†journalyear: 2022^†^†doi: XXXXXXX.XXXXXXX^†^†ccs: Information systems Image search^†^†ccs: Security and privacy Domain-specific security and privacy architectures^†^†ccs: Computing methodologies Image representations

1. Introduction

Image retrieval is an important research field in the computer vision community, which draws increasing attention due to its significant research impacts and tremendous practical values (Chen et al., 2021; Liu et al., 2016; Gao et al., 2017). Given a query image, the objective of image retrieval is to search similar images in an extensive image database. With the rise of cloud computing, traditional local storage mode has been changed to cloud storage, which satisfies user demand for storing massive image data on the server and enables users to access the data from any location at any time. Although cloud computing contributes to alleviating the challenge of limited local storage space and provides great convenience to users, the images are at danger of privacy leakage since the cloud server cannot be fully trusted and is vulnerable to hacking (Xia et al., 2021). A typical strategy is to encrypt the images prior to uploading them to the cloud server, but conventional image encryption algorithms may hinder the subsequent encrypted image data retrieval operation (Li and Situ, 2019; Zhang and Cheng, 2014). Therefore, it is urgent to develop privacy-preserving image retrieval technology that can provide both privacy protection and accurate retrieval simultaneously.

Schemes	Image encryption & Feature extraction			Retrieval model
	Visual	Adaptive	Multi-level	Unsupervised	Supervised	Deep learning
	security	encryption key	features	Unsupervised	Supervised	Deep learning
Zhang (Zhang and Cheng, 2014)	–	✓	✘	✓	✘	✘
Liang (Liang et al., 2019)	$↑$	✘	✘	✓	✘	✘
Xia (Xia et al., 2019)	$↑$	✘	✘	✓	✘	✘
Li (Li and Situ, 2019)	$↑$	✓	✘	✓	✘	✘
Xia (Xia et al., 2021)	$↑$	✘	✘	✓	✘	✘
Xia (Xia et al., 2022b)	$↑$	✘	✘	✓	✘	✘
Chen (Cheng et al., 2015)	$↑$	✓	✘	✘	✓	✘
Feng (Feng et al., 2021)	$↑$	✘	✘	✘	✓	✓
Ours	$↑$	✓	✓	✓	✓	✓

“✓” indicates that corresponding requirements are supported; otherwise, ‘✘” is used.
“ $↑$ ” indicates better visual security than Zhang (Zhang and Cheng, 2014) (mentioned in Section 5).

Table 1. Our scheme and the current schemes.

Existing privacy-preserving image retrieval schemes can be roughly classified into two categories: feature-encryption-based and image-encryption-based schemes (Xia et al., 2021). In the first category, the content owner first extracts features (e.g., scale-invariant feature transform) from plain-images, then separately encrypts images and features (Lu et al., 2009b; Xia et al., 2016, 2017; Janani and Brindha, 2021). In order to match cipher-images, the encrypted features of similar images should keep distance similarity (Xia et al., 2022a), and both encrypted images and features are uploaded to the server. This type of schemes can effectively protect privacy with standard cryptographic technologies, however, it performs image encryption and feature extraction/encryption separately, which may incur additional computational workload and inconvenience for the owner and users (Xia et al., 2021; Li and Situ, 2019; Zhang and Cheng, 2014). To address it, image-encryption-based schemes were proposed by extracting features from encrypted images directly (Xia et al., 2021, 2022b; Li and Situ, 2019; Xia et al., 2019; Liang et al., 2019; Cheng et al., 2016b; Zhang and Cheng, 2014; Cheng et al., 2016a, 2015). In such schemes, the content owner only needs to encrypt images, and then uploads encrypted images to the server. The task of extracting features from cipher-images can be outsourced to the server, which can decrease computational workload for owner and users. There are mainly three modules in the system of the second type of schemes, namely image encryption algorithm, feature extraction method and retrieval model. The mentioned three modules are inherently correlated to each other. The image encryption algorithm is the foundation, which retains effective ruled features in cipher-images and provides desirable image security. Feature extraction method is the bridge, which offers comprehensive inputs for retrieval model and determines whether adaptive encryption key (He et al., 2018) is supported or not (the same image can be encrypted with different secret keys and the extracted features are unchanged before and after encryption). The focus of retrieval model is to achieve excellent retrieval accuracy, which needs to couple with the extracted features and learn discriminative representations for the cipher-images.

Our work follows the idea of image-encryption-based privacy-preserving image retrieval, which can reduce additional computational workload. Zhang et al. (Zhang and Cheng, 2014) proposed the first image-encryption-based privacy-preserving scheme, while they had poor visual security with simple permutation encryption (Liang et al., 2019). In order to achieve better visual security, state-of-the-art schemes (Liang et al., 2019; Xia et al., 2019, 2021, 2022b; Li and Situ, 2019; Cheng et al., 2015; Feng et al., 2021) introduced sophisticated encryption operations (e.g., value replacement and stream cipher) to encrypt images, but still have some deficiencies in feature extraction methods and retrieval models. For example, (Zhang and Cheng, 2014; Liang et al., 2019; Xia et al., 2019, 2021, 2022b; Li and Situ, 2019; Cheng et al., 2015) just extract shallow features (e.g., histogram features) from cipher-images, which are unable to express enough information. Furthermore, (Liang et al., 2019; Xia et al., 2019, 2021, 2022b) fail to support the adaptive encryption key (He et al., 2018) because the feature spaces are randomly changed with different secret keys. Feng et al. (Feng et al., 2021) divided images into $8 \times 8$ blocks and proposed Vision Transformer (Dosovitskiy et al., 2021) (ViT) based supervised retrieval model, which achieves better retrieval performance than convolutional neural network (CNN), but it is painstaking to assign labels to images. Schemes (Liang et al., 2019; Xia et al., 2019, 2021, 2022b; Zhang and Cheng, 2014; Li and Situ, 2019) utilized trivial models (e.g., K-means (MacQueen and others, 1967) and Bag-of-Words (BOW) (Sivic and Zisserman, 2003)) to build unsupervised retrieval models, but it is difficult for these trivial models to learn non-linear embedding of complex image datasets (Xie et al., 2016), while deep neural network (DNN) is skilled in it. To capture more plentiful information of cipher-images, our feature extraction method designs multi-level features from $8 \times 8$ blocks and global Huffman-code of cipher-images. We utilize the length of variable-length integer (VLI) code unchanged with stream cipher to support the adaptive encryption key. Self-supervised learning is the typical unsupervised framework (Chen et al., 2020a; He et al., 2020; Tian et al., 2020; Gansbeke et al., 2020; Dang et al., 2021; Jang and Cho, 2021), and ViT (Dosovitskiy et al., 2021) divides an image into non-overlapping blocks to learn global dependency relations with self-attention mechanism (Vaswani et al., 2017). Hence, inspired by Feng (Feng et al., 2021), and ViT can couple with our multi-level features from $8 \times 8$ blocks of cipher-images, we build ViT-based unsupervised retrieval model in self-supervised learning manner.

In this paper, we propose a novel privacy-preserving image retrieval scheme named Encrypted Vision Transformer (EViT). First, images are encrypted during JPEG compression process, in which the VLI code of Discrete Cosine Transform (DCT) coefficient is encrypted by stream cipher. Second, EViT extracts well-designed multi-level features from cipher-images: the length sequence of DCT coefficients’ VLI code in each 8 $\times$ 8 block (local features) and the global Huffman-code frequency features, which can express more plentiful information of cipher-images. The VLI code encryption with stream cipher has an inherent advantage: the length of VLI code before and after encryption are unchanged, so multi-level features remain invariable whichever secret keys are used, namely EViT supports the adaptive encryption key. Finally, EViT adopts self-supervised contrastive learning (Chen et al., 2020a; He et al., 2020; Chen et al., 2020b) manner to build the unsupervised retrieval model, which can reduce label overhead. In order to learn discriminative representations of the cipher-images, EViT proposes modified Vision Transformer-based retrieval model, and replaces original $C l s_T o k e n$ (Dosovitskiy et al., 2021) of ViT with learnable global Huffman-code frequency feature. Conventional data augmentations in the plain-image retrieval domain (e.g., random crop) will entirely change the multi-level features of cipher-images (Section 7), so EViT directly takes two adaptive data augmentations for multi-level features to improve representation ability of retrieval model. EViT can also achieve the supervised retrieval model by easily Fine-Tuning on the unsupervised model.

As shown in Tab. 1, EViT can satisfy all requirements (e.g., adaptive encryption key, multi-level features, deep learning) simultaneously, and our experimental results show that our EViT can improve retrieval performance by large margins. The main contributions are summarized as follows:

Ingenious multi-level features, local length sequence and global Huffman-code frequency, are extracted from cipher-images to express more abundant features of cipher-images. Our feature extraction method also enables EViT to use the adaptive encryption key since the length of VLI code is unaffected by stream cipher.

To the best of our knowledge, EViT is the first to propose self-supervised learning for privacy-preserving image retrieval. In order to enhance model’s representation ability, EViT proposes two adaptive data augmentations for retrieval model, and uses learnable global Huffman-Code frequency to improve existing ViT.

Experimental results demonstrate that our EViT outperforms other state-of-the-art schemes in retrieval performance for both unsupervised and supervised learning models. EViT can not only protect image privacy but also complete image retrieval efficiently.

The rest of our paper is organized as follows. Section 2 presents the related work for privacy-preserving image retrieval. Preliminaries are introduced in Section 3. Section 4 gives our proposed scheme. Section 5 presents the experimental results. Finally, Section 6 summaries the conclusion.

2. Related work

In recent years, researchers have paid more attention on privacy-preserving image retrieval (Lu et al., 2009b, a; Xia et al., 2016; Xu et al., 2017; Xia et al., 2022b; Liu et al., 2017), and applied these techniques to boost the performance of real-life applications (Weng et al., 2015; Osadchy et al., 2010; Weng et al., 2016). The current privacy-preserving image retrieval schemes mainly can be divided into two categories (Xia et al., 2021): feature-encryption-based and image-encryption-based schemes.

Feature-encryption-based schemes: In this type of works, content owner first extracts features from plain-images, then encrypts these features and images separately. It’s noted that the features of plain-images also need to be protected because the features always contain sensitive information of plain-images (Xia et al., 2022a; Benhamouda et al., 2019). Lu et al. (Lu et al., 2009b) proposed the first privacy-preserving image retrieval scheme, and they improved retrieval speed based on safe search indexes in (Lu et al., 2009a). Wai et al. (Wong et al., 2009) developed a new asymmetric scalar-product-preserving encryption (ASPE) that preserved a special type of scalar product which could provide k-nearest neighbor (kNN) computation on encrypted features. Xia et al. (Xia et al., 2016) utilized a stable KNN algorithm to protect image feature vectors and designed a watermark-based protocol to prevent the authorized query users from illegally distributing the retrieved images. Zhang et al. (Zhang et al., 2017) used fully homomorphic encryption to encrypt the features and proposed to boost privacy-preserving search by distributed and parallel computation. Xia et al. (Xia et al., 2018) proposed a scheme with extracting features by the Scale-invariant feature transform (SIFT) feature and BOW model. They calculated the distance among image features through the Earth Mover’s Distance (EMD) and adopted a linear transformation on EMD to protect parameter information. Cheng et al. (Zhang et al., 2020) used hash codes to encrypt features which were learned by deep neural networks (DNN), and they utilized S-Tree to increase the search efficiency. Feature-encryption-based schemes can achieve great security of images, but it performs image encryption and feature extraction/encryption separately, resulting in an additional computational workload and inconvenience for the content owner and users (Li and Situ, 2019; Liang et al., 2019).

Image-encryption-based schemes: This type of work extracts features directly from cipher-images. Content owner only needs to encrypt images before uploading cipher-images to the server, the task of extracting features from cipher-images can be outsourced to the server, which solves the problem of separation of image encryption and feature extraction/encryption. Zhang et al. (Zhang and Cheng, 2014) proposed a scheme with encryption of JPEG images by permuting DCT coefficients and extracting features from these coefficients’ histogram invariance. Cheng et al. (Cheng et al., 2016b) proposed to encrypt the DC coefficients by using stream cipher, encrypt the AC coefficients by using scrambling encryption, and conduct retrieval based on the histogram of the AC coefficients. However, Cheng et al. (Cheng et al., 2016b) could not ensure JPEG format compliance. Xia et al. (Xia et al., 2019) encrypted DC coefficients by stream cipher on the Y component and encrypted U and V components by value replacement and permutation encryption, then extracted AC histograms features of Y component and color histograms features of U and V components. Liang et al. (Liang et al., 2019) extracted Huffman-code histograms from cipher-images which were encrypted by stream cipher and permutation encryption. The work (Xia et al., 2022b) encrypted images by color value substitution and permutation encryption, and extracted local color histogram features from cipher-images, then they built unsupervised bag-of-encrypted-words (BOEW) model to achieve retrieval. Li et al. (Li and Situ, 2019) proposed a new block transform encryption method using orthogonal transforms rather than $8 \times 8$ DCT. (Xia et al., 2021) extracted secure Local Binary Pattern (LBP) features from cipher-images, then built BOW model to conduct retrieval. These above schemes built unsupervised retrieval model, but they fail to use deep learning to learn non-linear embedding of cipher-images, and just extracted shallow features from cipher-images. Cheng et al. (Cheng et al., 2015) used stream cipher and permutation encryption to encrypt JPEG images, and extracted features from cipher-images by Markov process, then built supervised support vector machine (SVM) model to conduct retrieval. But it is painstaking to assign labels to images, and SVM is a linear model which is limited to learn discriminative representations of cipher-images.

Our EViT follows the idea of image-encryption-based privacy-preserving image retrieval which can decrease additional computational workload. EViT extracts multi-level features from cipher-images and introduces DNN based unsupervised retrieval model in a self-supervised manner, which can significantly improve retrieval performance.

3. Preliminaries

3.1. JPEG Compression

We encrypt images during JPEG compression process like some privacy-preserving image retrieval schemes (Zhang and Cheng, 2014; Cheng et al., 2016b; Xia et al., 2019; Liang et al., 2019; Li and Situ, 2019; Xia et al., 2022a). Here, we briefly introduce the JPEG compression process, whose overview is shown in Fig. 1. According to the JPEG compression standard (Pennebaker and Mitchell, 1992; Christopoulos et al., 2000), images are converted to YUV color space. After $8 \times 8$ DCT and quantization, each block has a total of 64 coefficients, of which the first coefficient is the direct current (DC) coefficient, and the remaining 63 coefficients are the alternating current (AC) coefficients. The DC coefficient is encoded by differential pulse code modulation (DPCM), and the remaining 63 AC coefficients in the same block are converted into a sequence using zig-zag scanning. AC coefficients are encoded with run-length encoding (RLE), which are converted to the $(r, v)$ pairs (Pennebaker and Mitchell, 1992; Christopoulos et al., 2000).

Figure 1. Overview of JPEG compression process.

The lossless Huffman variable-length entropy coding technology is used to further compress the DC differential ( $Δ D C$ ) and AC coefficients $(r, v)$ pairs, and all coefficients are coded to binary sequence. Specifically, each $Δ D C$ is encoded into two parts: DC Huffman code (DCH) and DC VLI code (DCV); each $(r, v)$ pairs is encoded as two parts: AC Huffman code (ACH) and AC VLI code (ACV). As shown in Fig. 2, we take an example for lossless Huffman variable-length entropy coding, where category is the range of amplitudes of coefficients (Pennebaker and Mitchell, 1992), and $r$ corresponds to the run value in the AC Human table (Pennebaker and Mitchell, 1992). For example, when $Δ D C = 2$ , it’s $D C H = 011$ and $D C V = 10$ , so it is coded into $01110$ ; when $(r, v) = (0, 6)$ , it’s $A C H = 100$ and $A C V = 110$ , therefore it is coded into $100110$ . Each coefficient can be coded into binary sequence through Huffman-code table mapping, more details please refer to (Pennebaker and Mitchell, 1992; Christopoulos et al., 2000).

Figure 2. An example of lossless Huffman variable-length entropy coding.

3.2. Unsupervised Contrastive Learning

Supervised learning has been widely used in many image retrieval tasks (Liu et al., 2019; Deng et al., 2020; Husain and Bober, 2019; Yang et al., 2018; Lv et al., 2018; Yates et al., 2021), but it is painstaking to assign labels to images. In order to decrease the overhead with human annotations, unsupervised algorithms are explored by researchers, such as K-means (MacQueen and others, 1967) and BOW (Sivic and Zisserman, 2003) model. Both K-means and BOW utilize the idea of clustering, but in many image databases, it is ineffective to learn images’ representations because clustering with euclidean distance cannot learn non-linear embedding (Xie et al., 2016). Deep neural networks (DNN) is competent in learning non-linear embedding which are indispensable for complex image databases. Supervised-learning based on DNN generally uses target labels to train model, but it is painstaking to assign labels to images. Self-supervised learning can build DNN model without target labels, which generates virtually unlimited labels from existing images and uses those to learn the representations. In recent years, many self-supervised learning methods have been proposed (Gidaris et al., 2018; Ren and Lee, 2018; Noroozi and Favaro, 2016; Bojanowski and Joulin, 2017; Wu et al., 2018; He et al., 2020; Chen et al., 2020a; Tian et al., 2020; Misra and van der Maaten, 2020). Due to its simple and effective property, self-supervised contrastive learning (Wu et al., 2018; Chen et al., 2020a; He et al., 2020; Tian et al., 2020) has been used in many unsupervised-learning computer vision tasks (Gansbeke et al., 2020; Dang et al., 2021; Jang and Cho, 2021; Deng et al., 2021; Xie et al., 2021; Ahmed et al., 2021). SimCLR (Chen et al., 2020a) is a popular self-supervised contrastive learning method, whose process is roughly illustrated as follows: given an image $x$ , we can obtain different images $x_{i}$ and $x_{j}$ by stochastic data augmentations such as random color distortions and random Gaussian blur; then the two images through same encoder can generate corresponding embeddings $z_{i}$ and $z_{j}$ , and SimCLR can learn representations of images by maximizing the similarity $z_{i}$ and $z_{j}$ . SimCLR obtains the positive samples of $x$ by data augmentations, the other images are negative samples. But SimCLR (Chen et al., 2020a) needs a very large batch size to build negative samples when training, this is expensive for most researchers to use GPU with large memory. Therefore MoCo (He et al., 2020; Chen et al., 2020b) proposed momentum contrast to solve the problem of large batch size by building a dynamic dictionary with a queue and momentum updating. In this paper, we utilize self-supervised contrastive learning based on MoCo to build our unsupervised-learning retrieval model.

3.3. Vision Transformer

Since the proposal of Google’s Transformer (Vaswani et al., 2017) in 2017, it has almost dominated natural language processing tasks (Devlin et al., 2019; Raffel et al., 2020; Farahani et al., 2021). Transformer is a sequence model based on self-attention mechanism, which can capture the global dependencies between words. In 2021, Google proposed to apply Transformer in computer vision (Dosovitskiy et al., 2021), called Vision Transformer (ViT). Then many researchers explore ViT and find ViT can obtain excellent performance in many computer vision tasks (Lin et al., 2021; Kim et al., 2021; Prakash et al., 2021; Li et al., 2021; Sun et al., 2021; Cheng et al., 2022), even surpass the CNN model in performance (Liu et al., 2021; Xie et al., 2021; Graham et al., 2021; Wang et al., 2021; Arnab et al., 2021). Vision Transformer not only makes breakthroughs in computer vision but also forms the unified model for computer vision and natural language processing.

ViT (Dosovitskiy et al., 2021) divides images into non-overlapping blocks, and each block is equal to a word in natural language processing. Suppose image’s size is $H \times W$ , the size of each block is $P \times P$ , hence the number of blocks is $\frac{H \times W}{P^{2}}$ . Assuming that the dimension of word embedding is $D$ , ViT uses linear projection (Dosovitskiy et al., 2021) to map each block to dimension $D$ , called block embedding. Like the $[c l a s s] t o k e n$ in BERT (Devlin et al., 2019), ViT prepends a learnable embedding $C l s_T o k e n$ in block embeddings. The goal of $C l s_T o k e n$ is to learn the representation of image, and ViT initializes $C l s_T o k e n$ with ones (dimension $D$ ). The results of adding the block and position embeddings (Dosovitskiy et al., 2021; Vaswani et al., 2017) are then used as the input to the Transformer encoder. The Transformer encoder of ViT is similar with standard Transformer (Vaswani et al., 2017), includes self-attention (Vaswani et al., 2017), layer normalization (Ba et al., 2016) and residual module (He et al., 2016). Self-attention is an import part in the Transformer encoder. When calculating self-attention, three matrices are needed: query (Q), key (K), value (V). Suppose the input of self-attention is $X$ , and the definition of self-attention is expressed as:

(1)

Q = W_{Q} X, K = W_{K} X, V = W_{V} X

(2)

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

where $W_{Q}, W_{K}, W_{V}$ are linear projection matrices, and $d_{k}$ is the dimension of $K$ . In order to capture more information, Transformer uses multi-head self-attention to learn different subspaces (Vaswani et al., 2017). Multi-head self-attention uses $h$ different linear projections to map $Q, K, V$ , and then concatenates different results of self-attention and does a linear projection. The definition of multi-head self-attention is as follows:

(3)

M u l t i H e a d (Q, K, V) = C o n c a t (h e a d_{1}, \dots, h e a d_{h}) W^{O}

(4)

h e a d_{i} = A t t e n t i o n (Q_{i}, K_{i}, V_{i}), i \in [1, h]

where $W^{O}$ is linear projection matrices.

In this paper, we use ViT as backbone in our retrieval model, and the reasons are as follows: a) ViT is popular and makes excellent performance in recent many computer vision tasks (Liu et al., 2021; Xie et al., 2021; Graham et al., 2021; Wang et al., 2021; Arnab et al., 2021; Lin et al., 2021; Kim et al., 2021; Prakash et al., 2021; Li et al., 2021; Sun et al., 2021). b) Our feature extraction method couples with the inputs of ViT. We extract features from cipher-images’ $8 \times 8$ blocks, and ViT divides image into non-overlapping blocks. So the features of each $8 \times 8$ block are equal to word embedding of ViT. c) ViT can learn global dependency relations with self-attention mechanism (Vaswani et al., 2017) compared with local CNN (Naseer et al., 2021). Previous privacy-preserving image retrieval scheme (Feng et al., 2021) also used ViT as backbone and proved that ViT was more fit to encrypted images than CNN in their experiments.

Figure 3. System model of image-encryption-based privacy-preserving image retrieval.

4. Proposed Scheme

Generally, the system of image-encryption-based privacy-preserving image retrieval includes three parts: content owner, server, and authorized users. As shown in Fig. 3, the content owner first encrypts images and uploads encrypted images to the server. Then authorized users encrypt the query image with the same encryption algorithm and upload it to the server. The server extracts features from the query image, and the retrieval model searches similar cipher-images in encrypted image database according to the features. Finally, these similar cipher-images are returned to authorized users, and authorized users decrypt images with encryption keys which are shared from the content owner through a secure channel (Liang et al., 2019; Cheng et al., 2015). In this system, three modules need to be designed: image encryption algorithm, feature extraction method, and image retrieval model. We specify these modules of EViT in detail below.

4.1. Image Encryption

Images are encrypted during JPEG compression process which is explained briefly in Section 3. Specifically, in the stage of entropy coding, we take stream exclusive-or operator for VLI code of DC and AC coefficients (Qian et al., 2014). The corresponding encryption keys are $k_{D C}$ and $k_{A C}$ , and we encrypt DCV and ACV as follows:

(5)

D C V^{^{'}} \leftarrow D C V \oplus k_{D C}

(6)

A C V^{^{'}} \leftarrow A C V \oplus k_{A C}

where $D C V^{^{'}}$ and $A C V^{^{'}}$ are encrypted VLI codes, $\oplus$ is exclusive-or operator. Our encryption keys are generated by the adaptive encryption key generation method (He et al., 2018) which uses image as input of hash function BLAKE2 (Aumasson et al., 2013), therefore different images have different encryption keys. During encryption process, original image $I$ can be compressed to encrypted JPEG bitstream, and different color spaces of $I$ have different secret keys. Stream exclusive-or with VLI code theoretically does not increase storage memory of cipher-images (Christopoulos et al., 2000). In Algorithm 1, the encryption algorithm of our proposed EVIT is presented. The encrypted bitstream is decodable because the file structure and Huffman-code are unchanged (Qian et al., 2014).

input :

I

k_{D C}

k_{A C}

output : Encrypted JPEG bitstream

1 Convert

I

from RGB to YUV color space;

2 Denote the width and height of the image

I

W

and

H

;

3 for $I_{i}$ in $I$ , $i \in Y, U, V$ do

4 Divide

I_{i}

into several

8 \times 8

non-overlapping blocks

B_{j}^{i}

j \in [1, \dots, b l k n u m]

, where

b l k n u m = \frac{W \times H}{8 \times 8}

;

5 Conduct DCT, quantization and entropy coding;

6 for $j = 1$ to $b l k n u m$ do

7 Encrypt the VLI code of

B_{j}^{i}

;

D C V_{B_{j}^{i}}^{^{'}} \leftarrow D C V_{B_{j}^{i}} \oplus k_{D C}^{B_{j}^{i}}

;

A C V_{B_{j}^{i}}^{^{'}} \leftarrow A C V_{B_{j}^{i}} \oplus k_{A C}^{B_{j}^{i}}

;

11 end for

13 end for

Algorithm 1 Encryption algorithm

4.2. Feature Extraction

Image-encryption-based schemes extract features directly from cipher-images. Existing schemes just extract shallow features (e.g. DCT histogram), which are unable to express plentiful information of cipher-images. EViT extracts multi-level features from our cipher-images: local length sequence and global Huffman-code frequency features.

Local length sequence features: EViT extracts each $8 \times 8$ block’s local features. As shown in Fig. 4, EViT calculates the corresponding VLI code’s length of DCT coefficients and builds length sequence features in each $8 \times 8$ block. For example, when $Δ D C = 5$ , it’s VLI code is ‘101’, hence the length is 3. The length sequence features are generated by zig-zag scanning (Pennebaker and Mitchell, 1992). If the coefficient is zero, then EViT denotes its length as zero. Because each block has three components (Y/U/V), EViT concatenates three length sequence features of different components. It’s noted that when extracting length sequence features for U and V components, EViT just chooses top $32$ coefficients in length sequence features due to there are many zeros of U and V components in the back coefficients (Christopoulos et al., 2000).

Figure 4. Sketch of extracting local features from our encrypted images.

Global Huffman-code frequency features: EViT extracts global Huffman-code frequency features from the cipher-images. We take a simple example to describe Huffman-code frequency features, as shown in Fig. 2. If the second row of DC Huffman table is used 10 times during entropy coding stage for an encrypted image, the corresponding Huffman-code frequency feature is denoted as 10. The rows of DC Huffman table is 12, and the rows of AC Huffman table is 162 (Pennebaker and Mitchell, 1992). Therefore, the dimension of Huffman-code frequency features is $(12 + 162) \times 3 = 522$ , where $3$ represents three components (YUV).

Global features and local features are extracted from cipher-images, which are employed to train our retrieval model. Here, different encryption keys can be utilized for different images since the stream exclusive-or operator does not change the length of VLI code. For example, suppose the DCV is ‘101’, and the different encryption keys are ‘100’ and ‘011’, then the $D C V^{^{'}}$ are ‘001’ and ‘110’ respectively. Hence the length of binary VLI code remains unchanged no matter which encryption key is used, namely the features that extracted from cipher-images are unchanged. This also means that we can use different encryption keys for different images, and the authorized user can encrypt query image with the same encryption algorithm but different keys.

4.3. Unsupervised-learning Retrieval Model

After extracting features from encrypted images, EViT uses these features to train retrieval models. EViT respectively proposes the unsupervised-learning and supervised-learning retrieval models based on deep learning. Deep image retrieval model (Chen et al., 2021) is a typical deep metric learning (Kaya and Bilge, 2019) whose aim is to learn the representations of images. Given an image, we first use learnable deep neural networks $f (\cdot)$ to learn it’s representation $h$ , and $f (\cdot)$ is generally called as backbone. Then the images’ representations are used to calculate the similarity such as cosine or euclidean distances between the query and the targets.

Figure 5. Overview of our unsupervised-learning retrieval model.

4.3.1. Loss

The unsupervised framework uses MoCo (see Section 3) due to its simple and effective property (Chen et al., 2020a; He et al., 2020; Tian et al., 2020), which does not use target labels to learn the representations of cipher-images. As shown in Fig. 5, given a cipher-image, EViT extracts features from it and denotes these features as $x$ . Using random data augmentations $t^{^{'}}$ and $t$ , we can obtain $x_{i}$ and $x_{j}$ respectively. Through backbone $f_{θ}$ , EViT can learn the representation $h_{i}$ of $x_{i}$ . For $x_{j}$ , the process is like $x_{i}$ , but the backbone $f_{θ}^{^{'}}$ is stop-gradient which is updated by momentum with $f_{θ}$ (He et al., 2020). The process of forward propagation can be defined as:

(7)

x_{i} = t^{^{'}} (x), x_{j} = t (x)

(8)

h_{i} = f_{θ} (x_{i}), h_{j} = f_{θ}^{^{'}} (x_{j}) .

The backbone $f_{θ}^{^{'}}$ is updated with momentum manner (He et al., 2020) which is described as:

(9)

f_{θ}^{^{'}} = m f_{θ}^{^{'}} + (1 - m) f_{θ}

where $m$ is momentum factor and is set to be $0.99$ following MoCo (He et al., 2020). The structures of $f_{θ}^{^{'}}$ and $f_{θ}$ are same, but with different parameters.

The loss function is like MoCo which is called InfoNCE (van den Oord et al., 2018). MoCo proposed momentum contrast to solve the problem of large batch size by building a dynamic dictionary with a queue and momentum updating. The dynamic dictionary is a queue where the current batch enqueued and the oldest batch dequeued. For one sample $x_{i}$ in the current batch, the $x_{j}$ is positive sample, and other samples in the current batch and the queue are negative samples. The loss can be defined as:

(10)

ℓ = - l o g \frac{e x p (h_{i} \cdot h_{j} / τ)}{e x p (h_{i} \cdot h_{j} / τ) + \sum_{k^{-}} e x p (h_{i} \cdot h_{k^{-}} / τ)}

where $τ$ is a temperature hyper-parameter proposed by (Wu et al., 2018), which we set to be 0.1 like MoCo. $h_{i}$ and $h_{j}$ are representations of $x_{i}$ and $x_{j}$ respectively (Fig. 5). $h_{k^{-}}$ are representations of negative samples, and “ $\cdot$ ” is dot product.

4.3.2. Backbone

Our backbone $f_{θ}$ adopts the structure of ViT (see Section 3). EViT extracts two parts features from encrypted images, local length sequence features and global Huffman-code frequency features. As shown in Fig. 6, suppose the number of blocks of a cipher-image is $\frac{H \times W}{8 \times 8}$ ( $H$ is height, $W$ is width). These local features, through linear projection (Dosovitskiy et al., 2021), produce corresponding block embeddings. Original $C l s_T o k e n$ (mentioned in Section 3) of ViT are all ones for each image, which fails to express specific information for different images. Hence, different from standard ViT (Dosovitskiy et al., 2021), EViT uses Huffman embedding to replace $C l s_T o k e n$ , which is helpful for retrieval performance in experiments. The Huffman embedding ( $H e$ ) is learned from global Huffman-code frequency features ( $g H f f$ ), which can be defined as:

(11)

H e = F C (σ (L N (F C (g H f f))))

where FC is fully-connected layer, LN is layer normalization (Ba et al., 2016), $σ$ is activation function ReLU (Agarap, 2018). In order to keep position information, we also add position embedding (Dosovitskiy et al., 2021) with block embedding and Huffman embedding. The result of these embeddings is denoted as $v_{0}$ , then through $L$ stacked Transformer encoder (Dosovitskiy et al., 2021), EViT can learn the representations $v_{L}^{0}$ of cipher-image. The $l$ - $t h$ Transformer encoder can be defined as:

(12)

v_{l}^{^{'}} = M S A (L N (v_{l - 1})) + v_{l - 1}, l = 1, 2, \dots L

(13)

v_{l} = M L P (L N (v_{l}^{^{'}})) + v_{l}, l = 1, 2, \dots L

where $M S A$ is multi-head self-attention (Vaswani et al., 2017) (Eq. 3), MLP is multi-layer perceptron block (Dosovitskiy et al., 2021). The output of $L$ stacked is $v_{L}$ , and the representation is the first embedding $v_{L}^{0}$ .

ViT (Dosovitskiy et al., 2021) uses spatial pixels to build a block of plain-images, but spatial pixels are randomly encrypted in cipher-images. EViT uses length sequence feature to replace spatial pixels in a block. Different from plain-image retrieval which can use pre-trained ViT on imagenet (Krizhevsky et al., 2012) as model’s backbone, our task is privacy-preserving image retrieval and there is no pre-trained model as our backbone, so the retrieval model needs to be trained from scratch. Generally, small learning rate and warm up (He et al., 2016) are necessary for training a new model with the structure of ViT (Touvron et al., 2021). EViT uses cosine warm up (Wolf et al., 2020) with learning rate. Our experimental results also present that cosine learning rate with warm up is helpful for retrieval performance.

4.3.3. Data augmentation

EViT uses random data augmentations $t$ and $t^{^{'}}$ for encrypted features in our unsupervised-learning retrieval model (Fig. 5). Common plain-image data augmentations, such as random crop, are not suitable for our task. For example, the plain-image is a dog, we random crop an image from plain-image. The cropped image is equal to the plain image in which the spatial structure is the dog, and just the cropped image may not have the dog’s tail. But in the encrypted images, the cropped image is not a dog after decryption. Because our encryption algorithm is conducted in JPEG compression process, each adjacent $8 \times 8$ block’s DCT coefficients of encrypted images are correlational. If some blocks are missing, the DCT coefficients of encrypted cropped image are all changed during decryption process. Once DCT coefficients, in particular DC coefficients, are changed, the spatial pixels and structures of decrypted image will be changed correspondingly. Other plain-image data augmentations are also unsuitable for our task, because these augmentations all change DCT coefficients in augmented image. Thus EViT directly conducts data augmentations with our length sequence features extracted from cipher-images.

In our retrieval model, we propose two kinds of adaptive data augmentations: random swap and splice for length sequence features. As shown in Fig. 7, we take examples to describe these two kinds of data augmentations. Our data augmentations are motivated by two aspects: a) the length sequence features are like different words, swapping two words is a typical augmentation in nature language process (Wei and Zou, 2019); b) ViT also can achieve well performance with simple block permutation (Naseer et al., 2021), and hence EViT can shuffle the length sequence features with random splice. In addition, model with random dropout (Srivastava et al., 2014) also plays a role in data augmentation (Gao et al., 2021). Because an image through model with random dropout twice can obtain different embeddings in the training stage (Gao et al., 2021). ViT (Dosovitskiy et al., 2021) itself has a random dropout function, and EViT sets the dropout in our backbone as ViT (Dosovitskiy et al., 2021). Our experimental results shows that the two adaptive data augmentations are vital for model, which can significantly improve retrieval performance.

Figure 7. Examples for our data augmentations with length sequence features.

4.4. Supervised-learning Retrieval Model

Our EViT also provides a simple supervised-learning retrieval model, which uses the same structures of backbone like the unsupervised-learning model. As shown in Fig. 8, the supervised model can obtain representations $h$ of cipher-images after backbone $f_{θ}$ (Fig. 6). Here we can use the pre-trained backbone from unsupervised-learning model. The supervised loss function ArcFace (Deng et al., 2019) is used to train our model, which is defined as:

(14)

h = f_{θ} (x), h^{^{'}} = l_{2} (h)

where $l_{2}$ is short for $l_{2}$ normalisation (Deng et al., 2019; Wang et al., 2017).

Figure 8. Overview of supervised-learning retrieval model.

ArcFace (Deng et al., 2019) is a common deep metric learning loss function, which has been widely used in retrieval tasks (Ozaki and Yokoo, 2019; Ha et al., 2020; Deng and Zafeiriou, 2019). Compared with Triplet loss (Schroff et al., 2015), ArcFace is more easy and effective (Deng et al., 2019) which gets rid of the disadvantages such as hard sample mining and combinatorial explosion in the number of triplets. ArcFace adds an additive angular margin within softmax loss (Deng et al., 2019), which can be defined as:

(15)

ℓ_{A r c F a c e} = - \frac{1}{N} N \sum i = 1 log \frac{e^{s (cos (θ_{y_{i}} + α))}}{e^{s (cos (θ_{y_{i}} + α))} + \sum_{j = 1, j \neq y_{i}}^{n} e^{s cos θ_{j}}}

where $N$ and $n$ are the batch size and the class number, and $θ_{j}$ is the angle between the representation of $i$ -th sample and $j$ -th class center. $y_{i}$ is ground-truth of $i$ -th sample, and $θ_{y_{i}}$ represents the angle between $i$ -th sample and ground-truth class center. $s$ and $α$ are hyper-parameters which represent feature re-scale and angular margin parameters, respectively. More details please refer to (Deng et al., 2019).

In the inference stage, we just need to calculate cosine distances of the representations $h$ of cipher-images, and then rank these distances and return top- $K$ results. On the basis of unsupervised-learning model, our supervised-learning model seems straightforward, and we will further explore it in the future such as combining Triplet loss (Schroff et al., 2015) and Center loss (Wen et al., 2016) to learn more discriminative representations. As shown in Algorithm 2, we present the main processes of EViT.

input : plain-images, secret keys, labels=None

1 // first module: image encryption algorithm (Section 4.1) ;

2 cipher-images

\leftarrow

Image encryption (plain-images, secret keys) ;

3 // second module: feature extraction method (Section 4.2) ;

4 features

\leftarrow

feature extraction (cipher-images) ;

5 // third module: retrieval models ;

6 unsupervised

\leftarrow

retrieval model(features) // Section 4.3 ;

7 if labels is Not None then

8 supervised

\leftarrow

Fine-Tuning (unsupervised, features, labels) // Section 4.4 ;

9 return supervised

10else

11 return unsupervised

12 end if

Algorithm 2 EViT’s main algorithm

5. Experiments And Analysis

In this section, experimental results of our proposed scheme are presented. We evaluate the performance on Corel10K dataset (Li and Wang, 2003), which is the widely used dataset by many related researches. Corel10K dataset contains 10000 images in 100 categories, with 100 images in each category. The image sizes are $128 \times 192$ or $192 \times 128$ . Our programming language is Python. In the following section, we first describe the retrieval performance, then analyze time consumption of searching, finally present the encryption performance of our EViT.

Datasets	Corel10K-a	Corel10K-b
Traing set	7000	7000
Testing set	3000	3000
Classes of Training set	70	100
Classes of Testing set	30	100

Table 2. Descriptions of Corel10K-a and Corel10K-b datasets.

5.1. Retrieval Performance

Our EViT respectively proposes the unsupervised-learning and supervised-learning retrieval models. We compare the retrieval performance of EViT with current image-encryption-based schemes whose retrieval models are divided into unsupervised and supervised model. In order to better compare retrieval performance, we split training set and testing set with two different types, and denote as Corel10K-a and Corel10K-b datasets which are described in Tab. 2. For Corel10K-a dataset, we train the retrieval model on 70 classes, so testing set has no same classes with training set (open-set classification (Deng et al., 2019)). For Corel10K-b, we train the retrieval model on 100 classes, and each class we select 70 images, and the remain 30 images of each class are in testing set (close-set classification (Deng et al., 2019)). We use stochastic gradient descent (SGD) as optimizer. The weight decay is $5 e^{- 5}$ , and SGD momentum is 0.9. We set batch size to be 14 and 35 for the unsupervised-leaning and supervised-learning model respectively. We train the retrieval models using the PyTorch framework (Adam Paszke, 2019) on a machine with Nvidia RTX2080Ti 11G GPU.

The evaluation metric of retrieval performance we use is mean Average Precision (Manning et al., 2008) (mAP) which is widely used in many retrieval tasks. When returning top- $K$ results, mAP is calculated as follows:

(16)

m A P @ K = \frac{1}{Q} Q \sum q = 1 A P @ K (q)

(17)

A P @ K (q) = \frac{1}{R_{q}} K \sum k = 1 p_{q} (k) r e l_{q} (k)

where $Q$ is the number of query images, $R_{q}$ is the number of similar images for the query $q$ , $p_{q} (k)$ is precision at rank $k$ for the query $q$ , and $r e l_{q} (k)$ is 1 if the rank $k$ result is similar to $q$ , 0 otherwise. In this paper, we use $m A P @ 100$ to evaluate retrieval performance. The higher $m A P @ 100$ , the better retrieval performance.

Schemes		Corel10K-a	Corel10K-b
Unsupervised	Xia(Xia et al., 2019)	0.378	0.230
	Liang(Liang et al., 2019)	0.321	0.217
	Xia(Xia et al., 2022b)	0.383	0.235
	Xia(Xia et al., 2021)	0.301	0.190
	Zhang (Zhang and Cheng, 2014)	0.396	0.269
	Li(Li and Situ, 2019)	0.410	0.269
	Ours-unsupervised	0.466	0.295
Supervised	Cheng(Cheng et al., 2015)	—	0.407
	Feng(Feng et al., 2021)	0.423	0.528
	Ours-supervised	0.554	0.759

Table 3. Comparison of retrieval performance with current schemes.

Here, we compare retrieval performance with current image-encryption-based schemes. All schemes are evaluated on the same testing set, and the results of comparison are shown in Tab. 3. We can see that our retrieval performance is better than other schemes. Specifically, our unsupervised-learning model can achieve 0.466 $m A P @ 100$ , which is higher about $5.6 %$ than state-of-the-art retrieval performance on the open-set Corel10K-a; on the closed-set Corel10K-b, our unsupervised-learning model can achieve 0.295 $m A P @ 100$ , which is higher about $2.6 %$ than state-of-the-art retrieval performance. For supervised-learning model, we can significantly improve retrieval performance than other supervised schemes. It is noted that Cheng (Cheng et al., 2015) used classification probability as representations of cipher-images, it is unsuitable on open-set Corel10K-a due to testing set has no same classes with training set. So far the image-encryption-based supervised-learning schemes are few, our supervised-learning model can be a strong baseline to explore it. Next, we describe more experimental details about the unsupervised-leaning and supervised-learning retrieval performance respectively.

5.1.1. Unsupervised retrieval performance

Our backbone has $L$ stacked Transformer encoders (Fig. 6), hence we use different $L$ values to demonstrate the retrieval performances on Corel10K-a and Corel10K-b datasets. As shown in Tab. 4, we can see that $L = 6$ is more suitable for our unsupervised-learning retrieval model.

$L$	4	5	6	7
Corel10K-a (mAP@100)	0.457	0.462	0.466	0.465
Corel10K-b (mAP@100)	0.289	0.291	0.295	0.292

Table 4. Unsupervised-learning retrieval performance with different

L

values on Corel10K-a and Corel10K-b datasets.

Figure 9. Comparison of learning rate schedules.

We propose two adaptive data augmentations for EViT, and have mentioned that warm up and huffman embedding are helpful for our retrieval performance (Section 4.3.2). Here, we use ablation experiments to verify how data augmentations, warm up, and huffman embedding influence the unsupervised-learning retrieval performance on Corel10K-a and Corel10K-b datasets. Our learning rate is $1 e^{- 3}$ with cosine warm up, as shown in Fig. 9, the “blue” line is cosine learning rate with warm up which is linearly increased to $1 e^{- 3}$ in the first 20 epoch; the “red” line is cosine learning rate without warm up. We add warm up and huffman embedding one by one ( $L = 6$ ), the results of ablation experiments are shown in Tab. 5. We can see that if we do not add data augmentations, warm up, and huffman embedding, the retrieval performance only can achieve 0.397 and 0.243 $m A P @ 100$ on Corel10K-a and Corel10K-b respectively. Our two adaptive data augmentations are helpful to enhance retrieval performance, which can improve about $1 %$ $m A P @ 100$ . Warm up improves retrieval performance with $1.2 %$ and $1.5 %$ on Corel10K-a and Corel10K-b respectively. Huffman embedding significantly improves retrieval performance with $4.3 %$ and $2.3 %$ on Corel10K-a and Corel10K-b respectively. The huffman embedding is learned from global Huffman-code frequency which is one of our multi-level features. The ablation experiments prove that multi-level features express more abundant information of cipher-images, which can directly improve retrieval performance.

Data augmentations		✓	✓	✓
Warm up			✓	✓
Huffman embedding				✓
Corel10K-a (mAP@100)	0.397	0.411	0.423	0.466
Corel10K-b (mAP@100)	0.243	0.257	0.272	0.295

Table 5. Ablation experiments with warm up and huffman embedding for unsupervised-learning retrieval model.

5.1.2. Supervised retrieval performance

Our supervised model is Fine-Tuning on the unsupervised model. For example, when training the supervised-learning model on Corel10K-a dataset, our backbone can use unsupervised-learning model’s parameters which are also trained on Corel10K-a as initial parameters. Unsupervised-learning model as pre-trained model for supervised-learning is common (Gansbeke et al., 2020; Chen et al., 2020a; He et al., 2020) which can accelerate model convergence and achieve better performance. Due to the backbone of supervised-learning is same as unsupervised-learning model, most of strategies such as warm up and data augmentations also are used in supervised-learning model. Apart from loss function, the supervised-learning model is almost inspired by our unsupervised-learning model.

\diagbox $α$ s	16	32	64
0.1	0.539 / 0.690	0.554 / 0.658	0.540 / 0.588
0.2	0.526 / 0.721	0.533 / 0.681	0.546 / 0.620
0.3	0.529 / 0.746	0.501 / 0.725	0.534 / 0.453
0.4	0.527 / 0.747	0.497 / 0.759	0.518 / 0.520

Table 6. Retrieval performance with different

α

and

s

on Corel10K-a/b datasets.

We mentioned that there are two hyper-parameters in ArcFace (Deng et al., 2019): $s$ and $α$ (Eq. 15), and now we use different $s$ and $α$ to observe their influence on retrieval performance when $L = 6$ on Corel10K-a and Corel10K-b (Corel10K-a/b) datasets. As shown in Tab. 6, we can see that on open-set dataset Corel10K-a, small $α$ is more fit to supervised-learning model; while close-set dataset Corel10K-b, $s$ should not set to be too large such as 64 which will decrease the retrieval performance. Due to there are same classes in training set and testing set on Corel10K-b dataset, supervised-learning model is more easy to learn the representations. Here, we use t-SNE (Van der Maaten and Hinton, 2008) to visualize the semantic space of different supervised-learning schemes (Cheng et al., 2015; Feng et al., 2021) on Corel10K-b dataset. Specifically, 10 classes are chosen from testing set, with 30 instances in each class. Fig. 10 shows that the semantic space of Cheng (Cheng et al., 2015) is a few disordered. Although the semantic space of Feng (Feng et al., 2021) is separated between inter-classes, it is not compact among intra-classes. Moreover, some classes are not pulled apart. For our EViT, it is not only more distinguishable in the inter-classes but also more closer in the intra-classes. In summary, our supervised-learning retrieval model can significantly improve retrieval performance, and there may be still room for improvement (Section 4.4). We also hope our supervised-learning model can be strong baseline to explore privacy-preserving image retrieval.

5.1.3. Why not CNN and non-end-to-end

Current deep plain-image retrieval works (Chen et al., 2021) are end-to-end, which directly use images as model’s inputs to automatically extract features (e.g. CNN features) rather than hand-craft features. However, it is impossible for our cipher-images to extract ruled features because the spatial structure information of cipher-images (e.g., pixel values) are randomly changed and disordered by secret keys. Therefore, EViT adopts the non-end-to-end manner due to the artificial features (e.g., local length sequence and global Huffman-code frequency) are ruled and can be used to effectively learn representations of cipher-images. In Section 3.3, we have mentioned that our backbone uses ViT rather than CNN, and here we compare retrieval performance with ResNet50 (He et al., 2016) (a classical CNN backbone) on the Corel10K-b dataset. As shown in Tab. 7, the non-end-to-end manner is far beyond the end-to-end manner for different backbones in retrieval performance, and ViT surpasses ResNet50 about $10 %$ $m A P @ 100$ in the non-end-to-end unsupervised manner. Tab. 7 presents that non-end-to-end manner and ViT are vital for improving retrieval performance.

Backbone	ResNet50 (ETE)	ViT (ETE)	ResNet50 (NETE)	ViT (NETE)
Unsupervised	failed	failed	0.196	0.295
Supervised	0.102	0.156	0.598	0.759

Table 7. Retrieval performance (

m A P @ 100

) with different backbone in end-to-end (ETE) and non-end-to-end (NETE) manner on Corel10K-b dataset, “failed” represents mAP@100 less than 0.1.

5.2. Time consumption

We test the time consumption of searching, and average the results on the entire Corel10K dataset. When searching similar cipher-images, we consider that the retrieval model is already trained. As shown in Tab. 8, our time consumption of searching is $0.09$ s which is acceptable. Because our unsupervised-learning and supervised-learning model use the same backbone, the search times are also same. For a fair comparison, all schemes use Python programming language, and Tab. 8 shows that EViT is more faster than other schemes in term of searching. In the future, we will deploy deep retrieval model with C++ and use Faiss (Johnson et al., 2019) to further decrease searching time.

Schemes	Zhang (Zhang and Cheng, 2014)	Liang (Liang et al., 2019)	Xia (Xia et al., 2019)	Li (Li and Situ, 2019)	Xia (Xia et al., 2021)	Xia (Xia et al., 2022b)	Feng (Feng et al., 2021)	Ours
Searching time (s)	1.01	0.11	0.12	1.01	0.36	0.35	0.15	0.09
PSNR (dB)	16.40	9.81	10.17	13.18	13.12	13.21	13.65	12.01

Table 8. Comparison of searching time and PSNR for different schemes.

5.3. Encryption performance

In the proposed EViT, images are encrypted during JPEG compression. The adopted encryption operations do not destroy the format information of JPEG, hence our encryption scheme is format-compliant to JPEG. In Fig. 11, we take five plain-images as examples to demonstrate the encryption performance of our encryption algorithm. Here, we analyze the encryption performance from four aspects: visual security, statistical attack, differential cryptanalysis, and key security.

Figure 11. Encryption examples (the first row is cipher-images, the second row is corresponding plain-images).

Visual security: According to the example shown in Fig. 11, we can find that the encrypted images are disordered enough, and do not disclose any information of plain-images. Besides the visual checking, we use Peak Signal-to-Noise Ratio (PSNR) to evaluate the visual safety. We compare our encryption algorithm with that of current privacy-preserving image retrieval schemes, and calculate the average PSNR of 10000 images, where smaller PSNR indicates better visual safety (Li and Lo, 2018). All schemes are evaluated in YUV color space. As shown in Tab. 8, we can see that some schemes (Liang et al., 2019; Xia et al., 2019) are with smaller PSNR than ours, because they used extra encryption steps apart from stream cipher. But our retrieval performance is significantly improved than them (Tab. 3).

Statistical attack: To make the statistical mode-based attack unavailable, the histograms of plain-images and that of cipher-images should be different. As shown in Fig. 12, taking the plain-image (Fig. 11 (e)) and its cipher-image as examples, it can be seen that there is no statistical correlation between the histogram of the encrypted image and that of the plain-image. Compared with the histogram of the plain-image, the frequency difference of different pixel values in the cipher-image is not so large, thus our scheme has a certain resistance ability against statistical attack.

Differential cryptanalysis: In order to resist differential cryptanalysis, minor change of plain-image such as modifying one single pixel should result in significant change in corresponding cipher-image (He et al., 2018). For example, given plain-image $P^{1}$ , we slightly change one single pixel and obtain plain-image $P^{2}$ . Then using encryption algorithm to encrypt $P^{1}$ and $P^{2}$ , we can obtain $C^{1}$ and $C^{2}$ respectively. Differential cryptanalysis attacks cryptosystem through comparing and analyzing the differences between the cipher-images $C^{1}$ and $C^{2}$ , therefore we hope there exists significant differences between them. Generally, NPCR (number of pixels change rate) and UACI (unified average changing intensity) (Chen et al., 2004) are two commonly metrics for evaluating the ability of encryption algorithm against differential attack, which are calculated as follows:

(18)

D (i, j) = {\begin{matrix} 0, C^{1} (i, j) = C^{2} (i, j) 1, C^{1} (i, j) \neq C^{2} (i, j) \end{matrix}

(19)

N P C R : N (C^{1}, C^{2}) = \frac{\sum_{i, j} D (i, j)}{H \times W} \times 100

(20)

U A C I : U (C^{1}, C^{2}) = \frac{\sum_{i, j} \frac{∣ ∣ C^{1} (i, j) - C^{2} (i, j) ∣ ∣}{255}}{H \times W} \times 100

where $C^{1}$ and $C^{2}$ are corresponding cipher-images of plain-images $P^{1}$ and $P^{2}$ , and $C (i, j)$ is the pixel value at coordinates $(i, j)$ ( $1 \leq i \leq H, 1 \leq j \leq W$ ). Here, we select 6 categories (church, girl, sky,architecture, painting, Africa) which contain 600 images from Corel10K, and change one single pixel of plain-image to calculate the NPCR and UACI. Each category averages the results. The closer the NPCR is to 100% and the UACI is to 33%, the more vital ability to resist differential attack (Chen et al., 2004). As shown in Tab. 9, we can see that our encryption algorithm has a certain resistance to differential attack.

Category	Church	Girl	Sky	Architecture	Painting	Africa
NPCR	95.55%	96.09%	96.51%	96.78%	96.41%	95.84%
UACI	47.28%	48.13%	47.52%	48.00%	48.48%	48.74%

Table 9. Mean NPCR and UACI of cipher-images with one pixel changing.

Key security: Brute-force attack is a standard ciphertext-only attack strategy in which attackers only have access to the encrypted data. Our encryption method has six encryption keys ( ${k_{D C}}_{*}, {k_{A C}}_{*}, * \in {Y, U, V}$ ), and each component has different keys. Each of these keys are 256-bits, so the key spaces of our encryption algorithm is $(2^{256})^{6}$ . It is extremely difficult to use brute-force cracking to restore the plain images. Therefore, our scheme is safe and can be secure against ciphertext-only attack.

EViT uses adaptive encryption key generation (He et al., 2018) method to encrypt images, namely different images are encrypted by different keys. We have explained the reason why EViT can use adaptive encryption key generation in Section 4.2, since different keys would generate the same features of cipher-images in EViT. Some schemes such as Zhang (Zhang and Cheng, 2014) and Li (Li and Situ, 2019) also can use different keys to encrypt different images, because they calculated the histograms of DCT in each frequency whose positions are unchanged, hence the features of cipher-images are unchanged. But some schemes such as Liang (Liang et al., 2019) and Xia (Xia et al., 2022b) could only use the same key to encrypt different images. Liang (Liang et al., 2019) permuted the AC coefficients in each block, so different keys will change the $(r, v)$ pairs which lead to the features extracted from cipher-images changed. Xia (Xia et al., 2022b) used color value replacement to encrypt images, different keys will generate different replacement tables, so the features of cipher-images also were changed. Other schemes (Xia et al., 2019, 2021; Feng et al., 2021) have the similar situations. Once the features are changed with different keys, the feature spaces of all cipher-images are different, so it leads to compromised retrieval performance.

Our encryption algorithm cannot achieve absolute secure, and its visual safety is slightly lower than other schemes. Absolute secure encryption algorithm cannot make us extract ruled features from cipher-images, and current image-encryption-based schemes generally fail to ensure absolute security. For example, Zhang (Zhang and Cheng, 2014) leaked DCT histograms of cipher-images. Liang (Liang et al., 2019) and Xia (Xia et al., 2019) achieved smaller PSNR with visual safety than ours, but their retrieval performance is about $10 %$ $m A P @ 100$ lower than ours. Moreover, they used the same keys to encrypt all images, so they fail to resist differential attack. Our EViT can achieve the best retrieval performance and nice visual security among different schemes. Current privacy-preserving image retrieval schemes have been seeking balances among different performances, for example slightly compromising visual security to significantly improve retrieval performance. This is also the purpose of our work.

6. Conclusion

In this paper, we propose a novel privacy-preserving image retrieval scheme named EViT, which can improve retrieval performance by large margins than current schemes and effectively protect security of images. First, we design multi-level features (local length sequence and global Huffman-code frequency) from cipher-images which are encrypted by stream cipher with VLI code during JPEG compression process, and EViT supports adaptive encryption keys due to the the multi-level features are unchanged with different secret keys. Second, EViT proposes the unsupervised retrieval model in a self-supervised learning manner, and adopts the structure of ViT as the model’s backbone to couple with the multi-level features. To improve retrieval performance, EViT adopts two adaptive data augmentations for retrieval model, and advances ViT with learnable global Huffman-code frequency. The supervised model can be easily achieved by Fine-Tuning on the trained unsupervised retrieval model. Experimental results show that EViT not only effectively protects image privacy but also significantly improves retrieval performance than current schemes. In future work, we will try to further improve the encryption performance while keeping retrieval accuracy.

Acknowledgment

The authors would like to thank the anonymous reviewers for their constructive comments and suggestions.

References

S. G. e. al. Adam Paszke (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 8024–8035. Cited by: §5.1.
A. F. Agarap (2018) Deep learning using rectified linear units (relu). CoRR abs/1803.08375. External Links: Link, 1803.08375 Cited by: §4.3.2.
S. A. A. Ahmed, M. Awais, and J. Kittler (2021) SiT: self-supervised vision transformer. CoRR abs/2104.03602. External Links: Link, 2104.03602 Cited by: §3.2.
A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lucic, and C. Schmid (2021) ViViT: A video vision transformer. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, pp. 6816–6826. Cited by: §3.3, §3.3.
J. Aumasson, S. Neves, Z. Wilcox-O’Hearn, and C. Winnerlein (2013) BLAKE2: simpler, smaller, fast as MD5. In Applied Cryptography and Network Security - 11th International Conference, ACNS 2013, Lecture Notes in Computer Science, Vol. 7954, pp. 119–135. Cited by: §4.1.
L. J. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. CoRR abs/1607.06450. External Links: Link, 1607.06450 Cited by: §3.3, §4.3.2.
F. Benhamouda, S. Halevi, and T. Halevi (2019) Supporting private data on hyperledger fabric with secure multiparty computation. IBM Journal of Research and Development 63 (2/3), pp. 3:1–3:8. Cited by: §2.
P. Bojanowski and A. Joulin (2017) Unsupervised learning by predicting noise. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, pp. 517–526. Cited by: §3.2.
G. Chen, Y. Mao, and C. K. Chui (2004) A symmetric image encryption scheme based on 3d chaotic cat maps. Chaos, Solitons & Fractals 21 (3), pp. 749–761. Cited by: §5.3.
T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton (2020a) A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, Proceedings of Machine Learning Research, Vol. 119, pp. 1597–1607. Cited by: §1, §1, §3.2, §4.3.1, §5.1.2.
W. Chen, Y. Liu, W. Wang, E. M. Bakker, T. Georgiou, P. W. Fieguth, L. Liu, and M. S. Lew (2021) Deep image retrieval: A survey. CoRR abs/2101.11282. External Links: Link, 2101.11282 Cited by: §1, §4.3, §5.1.3.
X. Chen, H. Fan, R. B. Girshick, and K. He (2020b) Improved baselines with momentum contrastive learning. CoRR abs/2003.04297. External Links: Link, 2003.04297 Cited by: §1, §3.2.
H. Cheng, X. Zhang, J. Yu, and F. Li (2015) Markov process based retrieval for encrypted JPEG images. In 10th International Conference on Availability, Reliability and Security, ARES 2015, pp. 417–421. Cited by: Table 1, §1, §1, §2, §4, 9(a), §5.1.2, §5.1, Table 3.
H. Cheng, X. Zhang, J. Yu, and Y. Zhang (2016a) Encrypted jpeg image retrieval using block-wise feature comparison. Journal of Visual Communication and Image Representation 40, pp. 111–117. Cited by: §1.
H. Cheng, X. Zhang, and J. Yu (2016b) AC-coefficient histogram-based retrieval for encrypted jpeg images. Multimedia Tools and Applications 75 (21), pp. 13791–13803. Cited by: §1, §2, §3.1.
Z. Cheng, X. Su, X. Wang, S. You, and C. Xu (2022) Sufficient vision transformer. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’22, New York, NY, USA, pp. 190–200. External Links: ISBN 9781450393850, Link, Document Cited by: §3.3.
C. A. Christopoulos, T. Ebrahimi, and A. N. Skodras (2000) JPEG2000: the new still picture compression standard. In Proceedings of the ACM Multimedia 2000 Workshops 2000, S. Ghandeharizadeh, S. Chang, S. Fischer, J. A. Konstan, and K. Nahrstedt (Eds.), pp. 45–49. Cited by: §3.1, §3.1, §4.1, §4.2.
Z. Dang, C. Deng, X. Yang, K. Wei, and H. Huang (2021) Nearest neighbor matching for deep clustering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, pp. 13693–13702. Cited by: §1, §3.2.
C. Deng, E. Yang, T. Liu, and D. Tao (2020) Two-stream deep hashing with class-specific centers for supervised image search. IEEE Transactions on Neural Networks and Learning Systems 31 (6), pp. 2189–2201. Cited by: §3.2.
J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2019) ArcFace: additive angular margin loss for deep face recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, pp. 4690–4699. Cited by: §4.4, §4.4, §4.4, §5.1.2, §5.1.
J. Deng and S. Zafeiriou (2019) ArcFace for disguised face recognition. In 2019 IEEE/CVF International Conference on Computer Vision Workshops, ICCV Workshops 2019, pp. 485–493. Cited by: §4.4.
Z. Deng, Y. Zhong, S. Guo, and W. Huang (2021) InsCLR: improving instance retrieval with self-supervision. CoRR abs/2112.01390. External Links: Link, 2112.01390 Cited by: §3.2.
J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, pp. 4171–4186. Cited by: §3.3, §3.3.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An image is worth 16x16 words: transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Cited by: §1, §1, §3.3, §3.3, §4.3.2, §4.3.2, §4.3.3.
M. Farahani, M. Gharachorloo, M. Farahani, and M. Manthouri (2021) ParsBERT: transformer-based model for persian language understanding. Neural Processing Letters 53 (6), pp. 3831–3847. Cited by: §3.3.
Q. Feng, P. Li, Z. Lu, G. Liu, and F. Huang (2021) End-to-end learning for encrypted image retrieval. In Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021, pp. 1839–1845. Cited by: Table 1, §1, §3.3, 9(b), §5.1.2, §5.3, Table 3, Table 8.
W. V. Gansbeke, S. Vandenhende, S. Georgoulis, M. Proesmans, and L. V. Gool (2020) SCAN: learning to classify images without labels. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part X, Lecture Notes in Computer Science, Vol. 12355, pp. 268–285. Cited by: §1, §3.2, §5.1.2.
T. Gao, X. Yao, and D. Chen (2021) SimCSE: simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, pp. 6894–6910. Cited by: §4.3.3.
X. Gao, S. C. H. Hoi, Y. Zhang, J. Zhou, J. Wan, Z. Chen, J. Li, and J. Zhu (2017) Sparse online learning of image similarity. ACM Trans. Intell. Syst. Technol. 8 (5). External Links: ISSN 2157-6904, Link, Document Cited by: §1.
S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In 6th International Conference on Learning Representations, ICLR 2018, Cited by: §3.2.
B. Graham, A. El-Nouby, H. Touvron, P. Stock, A. Joulin, H. Jégou, and M. Douze (2021) LeViT: a vision transformer in convnet’s clothing for faster inference. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, pp. 12239–12249. Cited by: §3.3, §3.3.
Q. Ha, B. Liu, F. Liu, and P. Liao (2020) Google landmark recognition 2020 competition third place solution. CoRR abs/2010.05350. External Links: Link, 2010.05350 Cited by: §4.4.
J. He, S. Huang, S. Tang, and J. Huang (2018) JPEG image encryption with improved format compatibility and file size preservation. IEEE Transactions on Multimedia 20 (10), pp. 2645–2658. Cited by: §1, §1, §4.1, §5.3, §5.3.
K. He, H. Fan, Y. Wu, S. Xie, and R. B. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, pp. 9726–9735. Cited by: §1, §1, §3.2, §4.3.1, §5.1.2.
K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, pp. 770–778. Cited by: §3.3, §4.3.2, §5.1.3.
S. S. Husain and M. Bober (2019) REMAP: multi-layer entropy-guided pooling of dense CNN features for image retrieval. IEEE Transactions on Image Processing 28 (10), pp. 5201–5213. Cited by: §3.2.
T. Janani and M. Brindha (2021) SEcure similar image matching (sesim): an improved privacy preserving image retrieval protocol over encrypted cloud database. IEEE Transactions on Multimedia. External Links: Link Cited by: §1.
Y. K. Jang and N. I. Cho (2021) Self-supervised product quantization for deep unsupervised image retrieval. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, pp. 12065–12074. Cited by: §1, §3.2.
J. Johnson, M. Douze, and H. Jégou (2019) Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7 (3), pp. 535–547. Cited by: §5.2.
M. Kaya and H. S. Bilge (2019) Deep metric learning: A survey. Symmetry 11 (9), pp. 1066–1092. Cited by: §4.3.
B. Kim, J. Lee, J. Kang, E. Kim, and H. J. Kim (2021) HOTR: end-to-end human-object interaction detection with transformers. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, pp. 74–83. Cited by: §3.3, §3.3.
A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012, P. L. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1106–1114. Cited by: §4.3.2.
J. Li and J. Z. Wang (2003) Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (9), pp. 1075–1088. Cited by: §5.
K. Li, S. Wang, X. Zhang, Y. Xu, W. Xu, and Z. Tu (2021) Pose recognition with cascade transformers. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, pp. 1944–1953. Cited by: §3.3, §3.3.
P. Li and K. Lo (2018) A content-adaptive joint image compression and encryption scheme. IEEE Transactions on Multimedia 20 (8), pp. 1960–1972. Cited by: §5.3.
P. Li and Z. Situ (2019) Encrypted JPEG image retrieval using histograms of transformed coefficients. In 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019, pp. 1140–1144. Cited by: Table 1, §1, §1, §1, §2, §2, §3.1, §5.3, Table 3, Table 8.
H. Liang, X. Zhang, and H. Cheng (2019) Huffman-code based retrieval for encrypted jpeg images. Journal of Visual Communication and Image Representation 61, pp. 149–156. Cited by: Table 1, §1, §1, §2, §2, §3.1, §4, §5.3, §5.3, §5.3, Table 3, Table 8.
K. Lin, L. Wang, and Z. Liu (2021) End-to-end human pose and mesh reconstruction with transformers. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, pp. 1954–1963. Cited by: §3.3, §3.3.
C. Liu, G. W. Yu, M. Volkovs, C. Chang, H. Rai, J. Ma, and S. K. Gorti (2019) Guided similarity separation for image retrieval. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, pp. 1554–1564. Cited by: §3.2.
D. Liu, J. Shen, Z. Xia, and X. Sun (2017) A content-based image retrieval scheme using an encrypted difference histogram in cloud computing. Inf. 8 (3), pp. 96–109. Cited by: §2.
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo (2021) Swin transformer: hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, pp. 9992–10002. Cited by: §3.3, §3.3.
Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang (2016) DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, pp. 1096–1104. Cited by: §1.
W. Lu, A. Swaminathan, A. L. Varna, and M. Wu (2009a) Enabling search over encrypted multimedia databases. In Media Forensics and Security I, part of the IS&T-SPIE Electronic Imaging Symposium, San Jose, CA, USA, January 19-21, 2009, Proceedings, SPIE Proceedings, Vol. 7254, pp. 725418. Cited by: §2, §2.
W. Lu, A. L. Varna, A. Swaminathan, and M. Wu (2009b) Secure image retrieval through feature protection. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2009, 19-24 April 2009, Taipei, Taiwan, pp. 1533–1536. Cited by: §1, §2, §2.
Y. Lv, W. Zhou, Q. Tian, S. Sun, and H. Li (2018) Retrieval oriented deep feature learning with complementary supervision mining. IEEE Transactions on Image Processing 27 (10), pp. 4945–4957. Cited by: §3.2.
J. MacQueen et al. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1, pp. 281–297. Cited by: §1, §3.2.
C. D. Manning, P. Raghavan, and H. Schütze (2008) Introduction to information retrieval. Cambridge University Press. External Links: Link, Document, ISBN 978-0-521-86571-5 Cited by: §5.1.
I. Misra and L. van der Maaten (2020) Self-supervised learning of pretext-invariant representations. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, pp. 6706–6716. Cited by: §3.2.
M. Naseer, K. Ranasinghe, S. H. Khan, M. Hayat, F. S. Khan, and M. Yang (2021) Intriguing properties of vision transformers. CoRR abs/2105.10497. External Links: Link, 2105.10497 Cited by: §3.3, §4.3.3.
M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI, Lecture Notes in Computer Science, Vol. 9910, pp. 69–84. Cited by: §3.2.
M. Osadchy, B. Pinkas, A. Jarrous, and B. Moskovich (2010) SCiFI - A system for secure face identification. In 31st IEEE Symposium on Security and Privacy, S&P 2010, 16-19 May 2010, Berleley/Oakland, California, USA, pp. 239–254. Cited by: §2.
K. Ozaki and S. Yokoo (2019) Large-scale landmark retrieval/recognition under a noisy and diverse dataset. CoRR abs/1906.04087. External Links: Link, 1906.04087 Cited by: §4.4.
W. B. Pennebaker and J. L. Mitchell (1992) JPEG: still image data compression standard. Springer Science & Business Media. Cited by: §3.1, §3.1, §4.2, §4.2.
A. Prakash, K. Chitta, and A. Geiger (2021) Multi-modal fusion transformer for end-to-end autonomous driving. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, pp. 7077–7087. Cited by: §3.3, §3.3.
Z. Qian, X. Zhang, and S. Wang (2014) Reversible data hiding in encrypted JPEG bitstream. IEEE Transactions on Multimedia 16 (5), pp. 1486–1491. Cited by: §4.1.
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, pp. 140:1–140:67. Cited by: §3.3.
Z. Ren and Y. J. Lee (2018) Cross-domain pervised multi-task feature learning using synthetic imagery. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, pp. 762–771. Cited by: §3.2.
F. Schroff, D. Kalenichenko, and J. Philbin (2015) FaceNet: A unified embedding for face recognition and clustering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, pp. 815–823. Cited by: §4.4, §4.4.
J. Sivic and A. Zisserman (2003) Video google: A text retrieval approach to object matching in videos. In 9th IEEE International Conference on Computer Vision (ICCV 2003), pp. 1470–1477. Cited by: §1, §3.2.
N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §4.3.3.
J. Sun, Z. Shen, Y. Wang, H. Bao, and X. Zhou (2021) LoFTR: detector-free local feature matching with transformers. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, pp. 8922–8931. Cited by: §3.3, §3.3.
Y. Tian, D. Krishnan, and P. Isola (2020) Contrastive multiview coding. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XI, Lecture Notes in Computer Science, Vol. 12356, pp. 776–794. Cited by: §1, §3.2, §4.3.1.
H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou (2021) Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, M. Meila and T. Zhang (Eds.), Proceedings of Machine Learning Research, Vol. 139, pp. 10347–10357. Cited by: §4.3.2.
A. van den Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. CoRR abs/1807.03748. External Links: Link, 1807.03748 Cited by: §4.3.1.
L. Van der Maaten and G. Hinton (2008) Visualizing data using t-sne.. Journal of Machine Learning Research 9 (11), pp. 2579–2605. Cited by: §5.1.2.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pp. 5998–6008. Cited by: §1, §3.3, §3.3, §3.3, §4.3.2.
F. Wang, X. Xiang, J. Cheng, and A. L. Yuille (2017) NormFace: l $_{2}$ hypersphere embedding for face verification. In Proceedings of the 2017 ACM on Multimedia Conference, MM 2017, Mountain View, CA, USA, October 23-27, 2017, Q. Liu, R. Lienhart, H. Wang, S. ”. Chen, S. Boll, Y. P. Chen, G. Friedland, J. Li, and S. Yan (Eds.), pp. 1041–1049. Cited by: §4.4.
W. Wang, E. Xie, X. Li, D. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao (2021) Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, pp. 548–558. Cited by: §3.3, §3.3.
J. W. Wei and K. Zou (2019) EDA: easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), pp. 6381–6387. Cited by: §4.3.3.
Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016) A discriminative feature learning approach for deep face recognition. In Computer Vision - ECCV 2016 - 14th European Conference, Proceedings, Part VII, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Lecture Notes in Computer Science, Vol. 9911, pp. 499–515. Cited by: §4.4.
L. Weng, L. Amsaleg, and T. Furon (2016) Privacy-preserving outsourced media search. IEEE Transactions on Knowledge and Data Engineering 28 (10), pp. 2738–2751. Cited by: §2.
L. Weng, L. Amsaleg, A. Morton, and S. Marchand-Maillet (2015) A privacy-preserving framework for large-scale content-based information retrieval. IEEE Transactions on Information Forensics and Security 10 (1), pp. 152–167. Cited by: §2.
T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush (2020) Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020, Q. Liu and D. Schlangen (Eds.), pp. 38–45. Cited by: §4.3.2.
W. K. Wong, D. W. Cheung, B. Kao, and N. Mamoulis (2009) Secure knn computation on encrypted databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, U. Çetintemel, S. B. Zdonik, D. Kossmann, and N. Tatbul (Eds.), pp. 139–152. Cited by: §2.
Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, pp. 3733–3742. Cited by: §3.2, §4.3.1.
Z. Xia, Q. Ji, Q. Gu, C. Yuan, and F. Xiao (2022a) A format-compatible searchable encryption scheme for JPEG images using bag-of-words. ACM Transactions on Multimedia Computing, Communications, and Applications 18 (3), pp. 85:1–85:18. Cited by: §1, §2, §3.1.
Z. Xia, L. Jiang, D. Liu, L. Lu, and B. Jeon (2022b) BOEW: A content-based image retrieval scheme using bag-of-encrypted-words in cloud computing. IEEE Transactions on Services Computing 15 (1), pp. 202–214. Cited by: Table 1, §1, §1, §2, §2, §5.3, Table 3, Table 8.
Z. Xia, L. Lu, T. Qiu, H. Shim, X. Chen, and B. Jeon (2019) A privacy-preserving image retrieval based on ac-coefficients and color histograms in cloud environment. Computers, Materials & Continua 58 (1), pp. 27–44. Cited by: Table 1, §1, §1, §2, §3.1, §5.3, §5.3, §5.3, Table 3, Table 8.
Z. Xia, L. Wang, J. Tang, N. N. Xiong, and J. Weng (2021) A privacy-preserving image retrieval scheme using secure local binary pattern in cloud computing. IEEE Transactions on Network Science and Engineering 8 (1), pp. 318–330. Cited by: Table 1, §1, §1, §1, §2, §2, §5.3, Table 3, Table 8.
Z. Xia, X. Wang, L. Zhang, Z. Qin, X. Sun, and K. Ren (2016) A privacy-preserving and copy-deterrence content-based image retrieval scheme in cloud computing. IEEE Transactions on Information Forensics and Security 11 (11), pp. 2594–2608. Cited by: §1, §2, §2.
Z. Xia, N. N. Xiong, A. V. Vasilakos, and X. Sun (2017) EPCBIR: an efficient and privacy-preserving content-based image retrieval scheme in cloud computing. Information Sciences 387, pp. 195–204. Cited by: §1.
Z. Xia, Y. Zhu, X. Sun, Z. Qin, and K. Ren (2018) Towards privacy-preserving content-based image retrieval in cloud computing. IEEE Transactions on Cloud Computing 6 (1), pp. 276–286. Cited by: §2.
J. Xie, R. B. Girshick, and A. Farhadi (2016) Unsupervised deep embedding for clustering analysis. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, M. Balcan and K. Q. Weinberger (Eds.), JMLR Workshop and Conference Proceedings, Vol. 48, pp. 478–487. Cited by: §1, §3.2.
Z. Xie, Y. Lin, Z. Yao, Z. Zhang, Q. Dai, Y. Cao, and H. Hu (2021) Self-supervised learning with swin transformers. CoRR abs/2105.04553. External Links: Link Cited by: §3.2, §3.3, §3.3.
Y. Xu, J. Gong, L. Xiong, Z. Xu, J. Wang, and Y. Shi (2017) A privacy-preserving content-based image retrieval method in cloud environment. Journal of Visual Communication and Image Representation 43, pp. 164–172. Cited by: §2.
J. Yang, J. Liang, H. Shen, K. Wang, P. L. Rosin, and M. Yang (2018) Dynamic match kernel with deep convolutional features for image retrieval. IEEE Transactions on Image Processing 27 (11), pp. 5288–5302. Cited by: §3.2.
A. Yates, R. Nogueira, and J. Lin (2021) Pretrained transformers for text ranking: bert and beyond. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, New York, NY, USA, pp. 2666–2668. External Links: ISBN 9781450380379, Link, Document Cited by: §3.2.
C. Zhang, L. Zhu, S. Zhang, and W. Yu (2020) TDHPPIR: an efficient deep hashing based privacy-preserving image retrieval method. Neurocomputing 406, pp. 386–398. Cited by: §2.
L. Zhang, T. Jung, K. Liu, X. Li, X. Ding, J. Gu, and Y. Liu (2017) PIC: enable large-scale privacy preserving content-based image search on cloud. IEEE Transactions on Parallel and Distributed Systems 28 (11), pp. 3258–3271. Cited by: §2.
X. Zhang and H. Cheng (2014) Histogram-based retrieval for encrypted jpeg images. In 2014 IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP), pp. 446–449. Cited by: item 2, Table 1, §1, §1, §1, §2, §3.1, §5.3, §5.3, Table 3, Table 8.