Vision-Language Adaptive Mutual Decoder for OOV-STR
Technical Report

Jinshui Hu

^{1, 2}

Chenyu Liu

^{1, 2}

Qiandong Yan

^{1}

Xuyang Zhu

^{1}

Pengcheng Xia

^{1}

Fengli Yu

^{1}

Jiajia Wu

^{1, 2}

Bing Yin

^{1, 2}

^{1}

IFLYTEK Research

^{2}

University of Science and Technology of China
{jshu, cyliu7, qdyan, xyzhu8, pcxia, flyu, jjwu, bingyin}@iflytek.com

Abstract

Recent works have shown huge success of deep learning models for common in vocabulary (IV) scene text recognition. However, in real-world scenarios, out-of-vocabulary (OOV) words are of great importance and SOTA recognition models usually perform poorly on OOV settings. Inspired by the intuition that the learned language prior have limited OOV preformence, we design a framework named Vision Language Adaptive Mutual Decoder (VLAMD) to tackle OOV problems partly. VLAMD consists of three main conponents. Firstly, we build an attention based LSTM decoder with two adaptively merged visual-only modules, yields a vision-language balanced main branch. Secondly, we add an auxiliary query based autoregressive transformer decoding head for common visual and language prior representation learning. Finally, we couple these two designs with bidirectional training for more diverse language modeling, and do mutual sequential decoding to get robuster results. Our approach achieved 70.31% and 59.61% word accuracy on IV+OOV and OOV settings respectively on Cropped Word Recognition Task of OOV-ST Challenge at ECCV 2022 TiE Workshop, where we got 1st place on both settings.

1 Introduction

Scene text recognition (STR) plays an important part on general visual understanding. Despite of the fast developing in computer vision [resnet, fasterrcnn, maskrcnn, vit] and text recognition [hmm, crnn, aster, sar, abinet], there are still some problems unsolved in real world scenarios, e.g., the OOV problem. [onvoc] reveals that SOTA methdos perform well on images with words within vocabulary but generalize poorly to images with words outside vocabulary. However, in real-world scenarios OOV words are common and of great importance, for example, toponyms, business names, URLs, random strings, etc. Hence, addapting current systems to recognize OOV instances is a crucial next step forward in terms of both research and application.

To help explore this problem, [oovchallenge] firstly propose a new benchmark that can specifically measure performance over OOV instances. Besides, a challenge named ECCV 2022 Challenge on Out of Vocabulary Scene Text Understanding (OOV-ST) is simultaneously come up, which is held at ECCV 2022 Workshop on Text in Everything (TiE). This benchmark combines common scene-text datasets as training set, i.e. Syn90k[syntext], ICDAR 2015[icdar15], TextOCR[textocr], MLT-19[mlt19], HierText[hiertext], OpenTextImages[openimagev5text]. As for the final testset, images with text instances will never occur in the above-mentioned training datasets. This benchmark will serve as the footstone of OOV research and OCR community, we also build our work on it.

In this paper, we aim to build up an adaptive and unified recognition framwork for both IV and OOV instances. Our motivation is, the recognizer shall be able to make decision on how much visual and lingustic info is used, when handling different inputs and decode steps. Our contributions can be summarized as follows:

We put forward VLAD module, which can adaptively fuse visual or linguistic features at every decode step, leading to a great improvement on OOV instances.
Besides, we propose a mutual decoding method together with a TransD head and a bidirectional modeling mechanism, which prevent recognizer from overfitting a undirectional language prior and get robuster co-decoding results both in OOV and IV settings.
We carefully evaluate our proposed VLAMD on official OOV-ST Challenge, which achieves 1st places on both OOV and IV+OOV metric, with a word accuracy of 59.61% and 70.31% respectively. Experimental results have proved the effectiveness of our method.

2 Related Works

With the development of computer vision and deep learnring, performance of STR has been improved significantly. In this section we’ll give a brief review of related works for STR and OOV Recogintion.
STR. In the past ten years, STR methods evolve from HMM based [hmm], CTC based [crnn], to Attention based [LAS, aster, sar, robustscan, abinet]. Limited by conditional independence assumption, HMM and CTC methods perform a language-free fragmented prediction, and usually need an external language model (LM). On the other hand, attention based approaches adopt an autoregressive framework, bringing an implicit LM and a more flexible spatial recognition ability. Moreover, ASTER [aster] combines a rectification module with an attention decoder to tackle irregular STR, SAR [sar] illustrates that a decoder with 2D attention is a strong baseline that achieves SOTA performance on irregular STR. Recently, query based parallel decoders are shown comparable preformance on STR, with a dynamic position enhancement branch [robustscan] or an iterative language modeling [abinet].
OOV Recognition. There are few works aim to tackle OOV problems. [onvoc] firstly point it out and proposed a joint learning baseline of attentioned based and segmentation based methods, RobustScanner [robustscan] designs a position queried module to dynamicly fuse positional and hybrid features, this two works have relieved OOV troubles from different views of feature enhancing. In addition, [openset] propose context decoupling modules to solve open-set recognition, while [seqclr, siman, readwrite] attempt to study self supervised learning in STR, we point out here that these directions will also benifit OOV researches further.

3 Approach

3.1 Overview

The overall training framework can be seen in Fig. 0(a). Given an image $I \in R^{3 \times H \times W}$ , the backbone will firstly encode it to a downsampled visual contextual feature $F \in R^{C \times \frac{H}{4} \times \frac{W}{4}}$ . Then, both the VLAD module and the TransD module will decode out two results respectively, i.e., the left to right (L2R) and the right to left (R2L) target strings. These different decode modules together with the backbone are jointly trained, where the loss contains four cross entropy loss guided by the GT target and four mutual KL loss between L2R and R2L sequences, see Eq. 12.

3.2 Backbone

For simplicity, a small CNN Plain-ViT backbone is adopted: 1) we use two Conv blocks with each stride 2 to fast downsample, getting a feature map of size $512 \times \frac{H}{4} \times \frac{W}{4}$ ; 2) we flatten the feature map to $512 \times \frac{H W}{16}$ , and send it to a standard transformer encoder used in [atten]; 3) Finally, the feature map is reshaped to $512 \times \frac{H}{4} \times \frac{W}{4}$ for decoding.

3.3 Vlamd

As mentioned before, our vision-language adaptive mutual decoders for text recognition are made up of three building blocks. One is the key Vision Language Adaptive Decoder (VLAD), one is a query based transformer decoder (TransD), and another is the bidirectional training and the mutual decoding strategy.
VLAD. The main decoding branch is based on LAS [LAS] architecture and coverage attention mechanism [COVERAGE]. Our key insight for vision-language balanced recognition is that, the adaptive choice ability from vision or language shall be a basic component of our system. To achieve this, we additionally design a position query branch named Positional Aware Attention (PAA) module, together with the reusing of the Visual Aware Attention (VAA), to extract visual-only context. Besides, an Adaptive Gated Fusion module is proposed to adaptively fuse the linguistic enhanced feature and the visual-only features. Fig. 2 illustrates VLAD in a detailed manner.

Figure 2: Detailed structure of VLAD. In addition to the main LSTM decoder, VLAD consists of three sub-modules: a) VAA module in light blue, output a visual-only feature $a_{t}$ ; b) PAA module in light orange red, output a positional aware visual-only feature $q_{t}$ ; c) Adaptive Fusion module in green, form a key dynamic visual-lingustic fusion strategy.

Mathematically, during one decoding step $t$ , a visual-only context is firstly computed using last step’s result $y_{t - 1}$ , hidden state $h_{t - 1}$ , image feature $F$ :

a_{t} = A t t e n t i o n ([y_{t - 1}; h_{t - 1}], F, F),

(1)

where $A t t e n t i o n (* q u e r y, * k e y, * v a l u e)$ denotes a basic attention layer, $a_{t}$ is the obtained visual-only feature. For simplicity, we omit coverage attention here, please refer to [COVERAGE] for detail.

Then, the hidden state $h_{t}$ for LSTM cell and the linguistic enhanced feature $r_{t}$ are updated by

h_{t} = L S T M ([y_{t - 1}; a_{t - 1}], h_{t - 1}),

(2)

r_{t} = C o n c a t (h_{t}, y_{t - 1}) .

(3)

For position aware attention module, a position embedding layer $P = [p_{1}, p_{2}, \dots, p_{t^{'}}]$ is learnt for querying, where the index $t^{'}$ denotes the decoding token id. Following [robustscan], a postion enhanced feature $F^{'}$ is used as the key, and the oringnal $F$ is used as the value:

q_{t} = A t t e n t i o n (p_{t}, F^{'}, F),

(4)

in which $q_{t}$ is the expected position queried visual-only feature.

Now, for time step $t$ , we have got the linguistic enhanced feature $r_{t}$ , the original visual-only context $a_{t}$ , the position queried visual-only feature $q_{t}$ . Finally, an Adaptive Gated Fusion (AGF) block is proposed to dynamicly merge and balance the visual and linguistic infomations. Specifically, for each decoding step, AGF learns a channel-wise gate $g_{t}$ aotumatically to merge these features:

g_{t} = S i g m o i d (W_{m} [r_{t}; a_{t}; q_{t}]),

(5)

(6)

in which $W_{m}$ and $W_{o}$ are learnable FC layers. Then, the final ouput token for step $t$ can be obtained by

y_{t} = S o f t m a x (M L P (o_{t})),

(7)

Here $M L P (\cdot)$ denotes a single layer or a two layer fully connected neural network.
TransD. The second branch in our framework is formed by a naive transformer decoder [atten]. Given the encoded map $F$ , and a learned position enbedding $Q^{'}$ similar to Eq. 4, the finaly features $O^{'} = [o_{1}^{'}, \dots, o_{t}^{'}]$ and the ouputs $Y^{'} = [y_{1}^{'}, \dots, y_{t}^{'}]$ can be formulated as:

O^{'} = T r a n s D (Q^{'}, F),

(8)

Y^{'} = S o f t m a x (M L P (O^{'})),

(9)

where $T r a n s D (\cdot)$ represents a stack of tranformer decoder layers, with self attention and cross attention in it.
Mutual Decoding. As mentioned before, for each branch, we copy it and construst two decoding targets during training and testing. Specifically, for one target string sequence $S_{L 2 R} = [s_{1}, s_{2}, \dots, s_{L}]$ , we reverse it to $S_{R 2 L} = [s_{L}, \dots, s_{2}, s_{1}]$ which will be used as the other target. As shown in Fig. 0(a), both VLAD and TransD will be added twice and supervised by $S_{L 2 R}$ and $S_{R 2 L}$ , respectively. We define VLAD’s output distribution as $Y_{L 2 R}$ and $Y_{R 2 L}$ , TransD’s output distribution as $Y_{L 2 R}^{'}$ and $Y_{R 2 L}^{'}$ , our main loss is:

	$L_{m a i n} =$	$C E (S_{L 2 R}, Y_{L 2 R}) + C E (S_{R 2 L}, Y_{R 2 L})$
	$+$	$C E (S_{L 2 R}, Y_{L 2 R}^{'}) + C E (S_{R 2 L}, Y_{R 2 L}^{'}),$		(10)

where $C E (* g t, * p r e d)$ is a standard cross entropy loss.

In order to prevent our model from overfiting an unidirectional single language prior, we propose to apply a bidirectional mutual learning strategy: for the same branch, we make them distill from each other using the L2R and R2L head on the every same token. Hence, two cross Kullback-Leibler Divergence (KLD) loss with stop gradient is used:

	$L_{m u t} (Y_{L 2 R}, Y_{R 2 L})$
				(11)

where $K L (p | | q) = N \sum i = 1 p (x) log \frac{p (x)}{q (x)}$ denotes a KLD function, $R S (\cdot)$ denotes a sequence reverse operation followed by a stop gradient layer. The overall loss of our system is:

	$L_{t o t a l} = L_{m a i n}$	$+ λ \cdot L_{m u t} (Y_{L 2 R}, Y_{R 2 L})$
		$+ λ \cdot L_{m u t} (Y_{L 2 R}^{'}, Y_{R 2 L}^{'}),$		(12)

and $λ$ is a hyper parameter.

VLAMD’s inference process is shown clearly in Fig. 0(b). Firstly, the two branch VLAD and TransD will do joint co-beam search process twice, yield a left to right N-Best list and a right to left N-Best list. Then, using a cross teacher forcing scheme, our system applys a mutual decoding method:

(13)

According to Eq. 13, for a decoding path $y_{p r e d} = [y_{p r e d}^{1}, y_{p r e d}^{2}, \dots, y_{p r e d}^{T}]$ from L2R joint co-beam search result, we will send it to R2L joint co-beam search module and vice versa. We found it efficient to acquire robuster results not only in OOV sets but also in IV+OOV settings.

Rank	Method	CRW (%)
Rank	Method	OOV	IV+OOV
1	VLAMD	59.61	70.31
2	SCATTER	59.45	69.58
3	dat	59.03	69.90
4	MaskOCR	58.65	69.63
5	Summer	58.06	68.77

Table 1: OOV-ST Challenge Results [oovchallengeresults] on test set . We list top 5 methods here accroding to the official OOV CRW metric. Submits from the same affiliation are filted out. Our VLAMD obtains best performance on both OOV and IV+OOV metrics.

4 Experiments

Datasets. We evaluate our method on OOV-ST Challenge [oovchallenge] mentioned in Sec. 1, which contains fine-grained validation and test sets with OOV or IV tags. For training, OOV-ST contains a total of 4.29M real cropped line images collected from several public datasets [icdar15, textocr, mlt19, hiertext, openimagev5text], and a corpus of 90K common words [syntext] for synthetising new data. Besides, there are 113K cropped lines for validation and 313K cropped lines for testing, respectively.

In our experiments, we do not synthesise any data ourselves. Only a subset of Synth90K [syntext] and SynthText [synth2016] are used for a short pretaining [preSync], where both are public synthetic datasets using the 90K words. Moreover, we filter out training samples that contain out of dictionary characters, remaining 3.98M real cropped lines for training.
Training details. For simplicity, all images are resized to 32x100 for both training and testing, and we do not use any data augmentation tricks. We firstly pretrain the backbone using a simple decoder on synthetic data for 4 epochs, and finetune it on 3.98M real image lines using the proposed VLADM for 10 epochs. An ADAM optimizer with multistep lr decay is adopted, and the base learning rate is set to 1e-4, weight decay is set to 1e-5, batch size is set to 128. $λ$ in Eq. 12 is set to 0.4 after applying a grid search method.

VLAD

CRW (%)
OOV	IV+OOV

RobustScanner [robustscan]

60.36

71.85

✓

60.42

72.04

✓

60.82

72.35

✓

61.84

73.42

✓

62.61

73.92

✓

64.85

75.83

Table 2: Abalation study on OOV-ST validation set. Both OOV and OOV+IV metric used in the challenge are evaluated here. BS means our baseline, and VLAD, TD, MT denotes our proposed VLAD, TransD, bidirectional and mutual decoding strategy respectively. ES denotes our 4-ensemble final model submitted to OOV-ST, in which different seeds and heads are used.

Main results. Following most works, Correcly Recognized Words (CRW) rate is used as the evaluation metric. We compare VLAMD with other participants in Tab. 1. Note that other methods seem to fall into the balance of OOV and IV, while our VLAMD obtaining 1st place on both settings.

We show the effectiveness of each module in VLAMD on validatioin set, see Tab. 2. Firstly we reproduce RobustScanner[robustscan] as a strong baseline for comparison, using the public code in [mmocr2021]. And VLAMD’s baseline in Tab. 2 is a simple decoder based on Eq. 3, using [LAS, COVERAGE]. Then, each module designed in Sec. 3 is added to our baseline cumulatively for ablation study. As shown, even our baseline can achieve comparable performance with SOTA OOV method, and the proposed VLAD, TransD, Mutual Decoding modules are all effective. Finally, our submitted VLAMD in Tab. 1 is formed by 4-ensemble models, shown in the last line of Tab. 2.

5 Conclusion

This paper summarizes technical details of our VLAMD submitted to OOT-ST Challenge briefly. Experimental results have shown the effectiveness of our proposed method on both OOV and IV settings. We have an improved version of VLAMD with more experiments, which will be extended to a long paper soon.

Vision-Language Adaptive Mutual Decoder for OOV-STR Technical Report