Lifelong Learning for Question Answering with Hierarchical Prompts

Yi Dai,¹ Hao Lang, ² Yinhe Zheng, ³ Fei Huang, ² Luo Si, ² Yongbin Li, ²

Abstract

QA models with lifelong learning (LL) abilities are important for practical QA applications, and architecture-based LL methods are reported to be an effective implementation for these models. However, it is non-trivial to extend previous approaches to QA tasks since they either require access to task identities in the testing phase or do not explicitly model samples from unseen tasks. In this paper, we propose Diana: a dynamic architecture-based lifelong QA model that tries to learn a sequence of QA tasks with a prompt enhanced language model. Four types of hierarchically organized prompts are used in Diana to capture QA knowledge from different granularities. Specifically, we dedicate task-level prompts to capture task-specific knowledge to retain high LL performances and maintain instance-level prompts to learn knowledge shared across different input samples to improve the model’s generalization performance. Moreover, we dedicate separate prompts to explicitly model unseen tasks and introduce a set of prompt key vectors to facilitate knowledge sharing between tasks. Extensive experiments demonstrate that Diana outperforms state-of-the-art lifelong QA models, especially in handling unseen tasks.

\affiliations

¹ Tsinghua University, Beijing, China
² Alibaba Group, Beijing, China
³ Lingxin AI, Beijing, China
dai-y21@mails.tsinghua.edu.cn, langhao@ymail.com, zhengyinhe1@163.com, f.huang@alibaba-inc.com, luo.si@alibaba-inc.com, shuide.lyb@alibaba-inc.com

1 Introduction

Question Answering (QA) is important for advanced AI applications Yang et al. (2019), and current QA models generally obtain high performance when learning fixed QA tasks Soares and Parreiras (2020). However, these models may not be practical in real applications, where models are usually required to support new QA tasks. Training a learned QA model on new tasks will lead to the catastrophic forgetting issue, i.e., models forget previously learned knowledge when learning new tasks French (1999). Without resorting to the expensive re-training process, Lifelong Learning (LL) enables the QA model to continuously learn new tasks while preserving previously learned knowledge Thrun and Mitchell (1995). Therefore, developing QA models with LL abilities in practical applications is important.

An effective implementation of LL is the architecture-based approach Chen et al. (2016); Rusu et al. (2016); Fernando et al. (2017); Wiwatcharakoses and Berrar (2020); Qin and Joty (2022), in which task-specific components are maintained to preserve learned knowledge Mancini et al. (2018). Recently, some studies proposed to convert QA tasks into a unified language modeling (LM) format Khashabi et al. (2020), and these studies have achieved promising QA performances with the help of prompts Brown et al. (2020); Jiang et al. (2020) and adapter Houlsby et al. (2019) modules. Some works have also tried to extend these approaches to build lifelong QA models by sharing the same underlying pre-trained language model (PLM) between different tasks Su et al. (2020). These models can learn a sequence of QA tasks with different QA formats Zhong et al. (2022).

Figure 1: An overview of Diana. A pre-trained language model is used to tackle QA tasks in different formats with hierarchically organized prompts.

However, applying the above approaches in building lifelong QA models has two major problems. First, most above architecture-based LL approaches need to access the task identity of the testing sample Wang et al. (2022, 2022). This setting limits the application of lifelong QA models because task identities for input questions are unavailable in most practical scenarios Geigle et al. (2021); Second, most previous LL models are tested only on seen tasks and do not consider testing samples from unseen tasks Boult et al. (2019); Mundt et al. (2020). This setting is violated in practical QA applications because users usually pose questions that do not belong to previously seen tasks.

There are two kinds of methods to tackle above problems:

(1) Some models organize their components at the task-level and use a classifier to determine the task identity for each testing sample Wortsman et al. (2020); Madotto et al. (2021), i.e., each task is assigned with a separate component that will be activated based on the predicted task identity (Figure 2a). This scheme generally yields high LL performances if the task identity is correctly predicted because each task is modeled by a dedicated component. However, these models usually exhibit poor generalization performance in QA applications because it is infeasible to determine task identities for questions that do not belong to any previously seen QA tasks. Moreover, the previously learned knowledge is isolated in these task-specific components and can not be shared to tackle questions from unseen tasks Rogers et al. (2021).

(2) Some models are organized at the instance-level and use a dynamic architecture for each input sample Wiwatcharakoses and Berrar (2020), i.e., a pool of fine-grained components are maintained and dynamically combined in each forward pass based on the input instance (Figure 2b). This scheme avoids the inconvenience of explicitly determining the task identity for each testing sample Wang et al. (2022) and helps LL models tackle unseen tasks since the learned knowledge is distributed into different model components Träuble et al. (2022). However, these models usually yield sub-optimal LL performance because there are no dedicated components for each task to capture task-specific knowledge Wang et al. (2022).

In this study, we combine the advantages of the above two categories and propose Diana: a dynamic architecture-based lifelong QA model. We follow previous approaches to convert QA tasks into a unified LM format and propose to learn these tasks using a prompt-enhanced PLM. Four types of hierarchically organized prompts are maintained in Diana (i.e., General Prompt, Format Prompt, Task Prompt, and Meta Prompt) to capture QA knowledge from different granularities. Specifically, the general prompt is used for all QA tasks, and the format prompts are shared between tasks in the same QA format. Moreover, a task prompt is assigned for each incoming QA task, and a pool of meta prompts are maintained and dynamically combined when handling each sample. In this way, Diana can better generalize to unseen tasks while achieving high LL performances since its components are organized at both task-level and instance-level. Further, we allocate separate prompts for unseen tasks and learn a key vector for each task prompt and meta prompt to better share knowledge between different tasks so that samples from unseen tasks can be explicitly modeled.

We perform extensive experiments on 11 benchmark QA tasks across three different formats and further test the generalization performance of our model on three unseen tasks. Results indicate that Diana outperforms state-of-the-art (SOTA) baselines on all benchmarks, especially when generalizing to unseen tasks. Our main contributions are:

We propose Diana: a novel architecture-based lifelong QA model that uses four types of hierarchically organized prompts to capture knowledge in different granularities. Both task-level and instance-level components are maintained in Diana so that it can better generalize to unseen tasks while achieving high LL performance.
We are the first to explicitly model unseen tasks in lifelong QA models. Specifically, we dedicate separate prompts for unseen tasks, and build prompt keys to facilitate knowledge sharing between different tasks.
Extensive experiments show that Diana outperformed SOTA baselines in building lifelong QA models.

2 Related Work

2.1 Lifelong Learning

LL aims at incrementally acquiring new knowledge without catastrophically forgetting previously learned ones. Generally, three categories of methods are proposed: 1. Rehearsal-based methods Rebuffi et al. (2017); Shin et al. (2017); Sun et al. (2019a); Chaudhry et al. (2019a); Buzzega et al. (2020) preserve past knowledge by replaying data from learned tasks; 2. Regularization-based methods Kirkpatrick et al. (2017); Zenke et al. (2017); Li and Hoiem (2017); Ritter et al. (2018); Farajtabar et al. (2020) consolidate model parameters that are important to previous tasks by introducing additional regularization terms; 3. Architecture-based methods Chen et al. (2016); Rusu et al. (2016); Fernando et al. (2017); Maltoni and Lomonaco (2019) add task-specific parameters to an existing base model for each task to prevent forgetting.

Experiment settings of LL methods can be generally classified into three categories based on whether the task identity is provided for each testing sample and whether it must be inferred van de Ven and Tolias (2019), i.e., task-incremental learning Mallya and Lazebnik (2018); Ebrahimi et al. (2020), domain-incremental learning Pu et al. (2021); Gao et al. (2022), and class-incremental learning Zhang et al. (2020). In this work, we focus on the domain-incremental learning setting, where task identity is not provided for each testing sample. One line of methods in this category attempt to detect the task identity for each input sample Madotto et al. (2021). However, these methods fail to generalize to unseen tasks Wang et al. (2022). Another line of methods try to build a dynamic architecture for each input sample, for example maintaining a pool of prompts that can be dynamically combined Wang et al. (2022). However, these methods yield sub-optimal performance since no task-specific parameters are used. Our model Diana is the first attempt to take advantage of the two aforementioned types of methods.

2.2 Domain Generalization

Domain Generalization (DG) aims to learn a model from several seen domains that will generalize well on unseen testing domains Muandet et al. (2013); Xu et al. (2014); Ghifary et al. (2015); Li et al. (2018); Zhao et al. (2020). An effective approach for DG maintains separate components to model domain-specific knowledge and open domain knowledge, respectively Khosla et al. (2012); Li et al. (2017, 2019). Our method allocates additional prompts for unseen tasks to improve LL generalization performance.

2.3 Unified Question Answering

Question Answering (QA) tasks have been posed in different formats Sun et al. (2019b), such as Extractive, Abstractive, and Multiple-Choice. To encourage knowledge sharing across these formats, existing approaches attempt to build a unified QA model by casting different QA tasks into a unified text-to-text format Khashabi et al. (2020); McCann et al. (2019). The most similar previous work compared to ours is ProQA Zhong et al. (2022), in which different QA tasks are unified and a set of structured prompts are used to learn these tasks. However, LL experiments in ProQA only consider two tasks and assume task identities for testing samples are available, while our model is designed to tackle more tasks without task identities for testing samples.

3 Method

3.1 Task Formulation

In this study, we aim to sequentially learn $N$ QA tasks $T_{1}, \dots, T_{N}$ that are presented in $L$ different formats $F_{1}, \dots, F_{L}$ , $(L \leq N)$ . Each task $T_{i}$ is presented in a specific QA format $F_{j}$ , and each training sample of $T_{i}$ is a tuple of a context $C$ , a question $Q$ , and an answer $A$ : ( $C, Q, A$ ). Note that the format of each task can be easily inferred from an input context and question pair ( $C, Q$ ). Our model $g_{θ}$ is built to predict $A$ based on $C$ and $Q$ . We also consider a more challenging open domain lifelong learning setting, i.e., the model needs to predict answers for unseen tasks. Therefore, we collect another $N^{'}$ unseen tasks $T_{N + 1}, \dots, T_{N + N^{'}}$ that are only used for testing. We assume that all task identities of inputs are not available in the testing phase.

Different prompt organization schemes.
(a) Each task is assigned with a separate prompt and the closest prompt to the query vector is activated.
(b) A pool of prompts are maintained and the top- — Figure 2: Different prompt organization schemes. (a) Each task is assigned with a separate prompt and the closest prompt to the query vector is activated. (b) A pool of prompts are maintained and the top- $M^{'}$ closest prompts to the query vector are activated and combined. (c) Four kinds of prompts are hierarchically organized and combined based on the distances between the query vector and prompt keys.

3.2 Framework of Hierarchical Prompts

We follow previous approaches to serialize the context $C$ , question $Q$ , and answer $A$ into text sequences Khashabi et al. (2020); Zhong et al. (2022) and use a prompt-enhanced encoder-decoder model $g_{θ}$ to learn each task $T_{i}$ in Diana. We use soft prompts Liu et al. (2021); Lester et al. (2021); Vu et al. (2022) in our study, i.e., each prompt is a sequence of trainable embeddings that are randomly initialized and optimized when learning each incoming task. For each training sample $(C, Q, A)$ from $T_{i}$ , we first construct a prompt $P (C, Q)$ based on $C$ and $Q$ . Then the encoder takes in the concatenation of $P (C, Q)$ , $C$ , and $Q$ and the decoder predicts $A$ , i.e., $A = g_{θ} ([P (C, Q); C; Q])$ , in which “ $[;]$ ” denotes the sequence concatenation operation.

A hierarchical prompt structure is maintained in Diana to construct $P (C, Q)$ (see Figure 2c), i.e., $P (C, Q)$ is a concatenation of four kinds of prompts $[P_{g}; P_{f} (F_{j}); P_{t} (T_{i}); P_{m} (C, Q)]$ , in which $F_{j}$ is the format of $T_{i}$ , $P_{g}$ is a general prompt, $P_{f} (F_{j})$ is a format prompt, $P_{t} (T_{i})$ is a task prompt and $P_{m} (C, Q)$ is a combined meta prompt. Specifically, To encode task-agnostic knowledge of QA, we use one general prompt $P_{g}$ for all incoming tasks; To capture knowledge that is shared by tasks in the same format, we assign a format-specific prompt $P_{f} (F_{j})$ for tasks that are in the same format $F_{j}$ , ( $j$ = $1, \dots, L$ ); To learn knowledge associated with each task, we allocate a separate task prompt $P_{t} (T_{i})$ for each task $T_{i}$ , ( $i$ = $1, \dots, N$ ); To capture fine-grained knowledge distributed in each input sample $(C, Q, A)$ , we maintain $M$ meta prompts ${P_{m}^{i}}_{i = 1}^{M}$ and dynamically combine these prompts based on $C$ and $Q$ to obtain $P_{m} (C, Q)$ . Moreover, to explicitly model samples from unseen tasks, we enlarge the set of task prompts with $L$ extra prompts $_{t} (F_{1})$ , $\dots$ , $_{t} (F_{L})$ , in which each prompt $_{t} (F_{j})$ models the unseen task for a particular format $F_{j}$ .

To tackle the problem of lacking task identities in the testing phase, we associate a key vector $k_{t} (T_{i})$ and $k_{m}^{j}$ to each task prompt $P_{t} (T_{i})$ and meta prompt $P_{m}^{j}$ , respectively. A fixed query function $h$ is built to map a context $C$ and question $Q$ to a query vector $q = h (C, Q)$ . Specifically, $h$ is initialized by a fixed PLM and not tuned in the training process so that the semantic encoded in the query vector $q$ does not drift when learning different tasks. For a given input $C$ and $Q$ , the corresponding task prompt $P_{t} (T_{i})$ in $P (C, Q)$ can be determined by retrieving the most similar task key vector $k_{t} (T_{i})$ using $q$ , and the combined meta prompt $P_{m} (C, Q)$ in $P (C, Q)$ can be constructed by based on the similarities between meta key vectors $k_{m}^{i}$ and $q$ . Note that we do not allocate key vectors for these $L$ extra task prompts $_{t} (F_{1})$ , $\dots$ , $_{t} (F_{L})$ that are designed to model unseen tasks. These prompts are used in Diana with an explicit unseen task determination process (see more details in Section 3.4).

A two-stage learning process is introduced in Diana when handling each training sample ( $C, Q, A$ ). The first stage focuses on learning a representation space for prompt keys so that we can determine proper prompts to construct $P (C, Q)$ by matching the query vector and each prompt key. The second stage optimizes the constructed prompt $P (C, Q)$ and the backbone language model. These two stages are detailed in the following sections.

3.3 Key Vector Space Learning

We first optimize key vectors assigned to each task prompt and meta prompt in Diana so that we can construct the concatenated prompt $P (C, Q)$ based on the input context $C$ and question $Q$ . Note that these key vectors are only used to determine the task prompt and meta prompt in $P (C, Q)$ because the general prompt $P_{g}$ is shared by all tasks in Diana and the format prompt $P_{f} (F_{j})$ can be determined based on the format of $C$ and $Q$ directly.

Task Prompt Keys

help to determine the task prompt corresponding to a given input context $C$ and question $Q$ . Specifically, we first calculate the query vector $q = h (C, Q)$ and determine the most similar task prompt key $k_{t} (T_{i})$ to $q$ . The task prompt $P_{t} (T_{i})$ associated with $k_{t} (T_{i})$ is used to construct $P (C, Q)$ . Ideally, the key vector $k_{t} (T_{i})$ for a task prompt $P_{t} (T_{i})$ should be located near the query vectors of samples from task $T_{i}$ and distant to the query vectors of samples from other tasks $T_{j}$ ( $j \neq i)$ . Therefore, when learning each task $T_{i}$ , we maintain a small memory buffer $M$ for samples from previously learned tasks $T_{j}$ , ( $j < i$ ), and design the following exponential angular triplet loss Ye et al. (2021) for each training sample $C$ and $Q$ from $T_{i}$ :

	$L_{t} =$	$e x p (\| \| h (C, Q), k_{t} (T_{i}) \| \| +$		(1)
		$m a x (1 - \| \| h (C_{n}, Q_{n}), k_{t} (T_{i}) \| \|, 0)),$		(1)

in which the operator $| | \cdot, \cdot | |$ determines the distance between two input vectors (here we use cosine distance), ( $C_{n}$ , $Q_{n}$ ) is a negative sample extracted from the memory buffer $M$ :

(C_{n}, Q_{n}) = a r g m i n (C^{'}, Q^{'}) \in M | | h (C^{'}, Q^{'}), k_{t} (T_{i}) | | .

(2)

Figure 3: Illustration of the diversity and locality property. (a) The diversity property distributes key vectors to the whole space. (b) The locality property cluster similar keys to facilitate knowledge sharing. (c) Diana aims to achieve a balance between the diversity and locality

Meta Prompt Keys

help to combine each meta prompt $P_{m}^{i}$ to produce $P_{m} (C, Q)$ . Specifically, for each input context $C$ and question $Q$ , we select $M^{'}$ meta prompt keys that are closest to the query vector $q = h (C, Q)$ , and concatenate these $M^{'}$ associated meta prompts to obtain $P_{m} (C, Q)$ . Intuitively, the knowledge associated with ( $C, Q, A$ ) is distributed in these $M^{'}$ meta prompts.

When learning meta prompt keys, we expect the distribution of these keys to balance two properties: diversity and locality (Figure 3). Specifically, the diversity property aims to distribute meta prompt keys to the whole vector space so that every meta prompt can be involved in the training process. The locality property aims to group similar prompt keys to clusters so that the knowledge of each sample can be better shared. For each input $C$ and $Q$ , we propose the following loss to enforce the above two properties:

	$L_{m} = \sum i \in S (C, Q) m a x (0, \| \| k_{m}^{i}, h (C, Q) \| \| - η) +$		(3)
	$\sum i, j \in S (C, Q) m a x (0, γ - \| \| k_{m}^{i}, k_{m}^{j} \| \|) / {M^{'}}^{2},$		(3)

where $S (C, Q)$ is the index set of these $M^{'}$ selected meta prompt keys that are closest to $h (C, Q)$ , $η$ and $γ$ are scalar hyper-parameters to control the distance margin. Specifically, the first term in the above equation enforces the locality property by pulling these selected $M^{'}$ meta prompt keys around the query vector. The second term enforces the diversity property by pushing these meta prompt keys away from each other to occupy the whole vector space.

Note that the above loss in Eq. 3 only involves individual query vectors. Thus meta prompt keys learned using this loss may not be diverse enough since samples from previously learned tasks are not considered. In this study, we extend Eq. 3 to better shape the distributions of meta prompt keys with the help of the memory buffer $M$ , in which samples from previously learned tasks are contained. Specifically, when learning the task $T_{i}$ , we first calculate query vectors for samples in $M$ and then group these query vectors into $B$ clusters (we set $B = 5 i$ in our experiments, where $i$ is the number of received tasks). Centroids of these $B$ clusters are calculated as $c_{1}, \dots, c_{B}$ . For each $C$ and $Q$ from $M$ , the subsequent loss is optimized:

{L^{'}}_{m} = \sum i \in S (C, Q) m a x (0, | | k_{m}^{i}, c_{k} | | - η),

(4)

where $c_{k}$ is the cluster centroid to which $C$ and $Q$ belong. The above loss enforces the global diversity by scattering these meta prompt keys to each centroid.

3.4 Model Training

Scheduled Sampling of Task Prompts

When training Diana, we are given the task identity of each training sample to directly obtain the corresponding task prompt $P_{t} (T_{i})$ for $P (C, Q)$ . However, naively using these golden truth task identities leads to an exposure bias issue, i.e., task prompts used in the testing phase may not be correct since we need to infer the task identity for each testing sample.

In this study, we introduce a scheduled sampling process to tackle the above exposure bias issue when learning each task $T_{i}$ . Specifically, for a given sample ( $C, Q, A$ ) in the $k$ -th training step, we toss a coin and use the golden truth task identity with probability $ϵ_{k}$ , or use the task identity inferred based on task prompt keys with probability $1 - ϵ_{k}$ Bengio et al. (2015). Note that when starting to learn each task, the corresponding prompt key is not well optimized, and thus the selected task identity is not accurate. Therefore, we set the value of $ϵ_{k}$ to favor the golden truth task identity at the beginning (i.e., when $k$ is small) and gradually switch to the inferred task identity as the training proceeds (i.e., when $k$ is large), i.e., a linear decrement of $ϵ_{k}$ is scheduled:

ϵ_{k} = m a x (0, α - k β),

(5)

in which $α$ and $β$ are scalar hyper-parameters.

Note that lifelong QA models may encounter another source of exposure bias since we may receive inputs from unseen tasks in the testing phase. In this study, we use these $L$ extra prompts ${^P}_{t} (F_{1}), \dots, {^P}_{t} (F_{L})$ to explicitly model unseen tasks. Specifically, for each training sample ( $C, Q, A$ ), we first determine its task format $F_{j}$ based on $C$ and $Q$ , and allocate a small probability to use ${^P}_{t} (F_{j})$ as its task prompt in $P (C, Q)$ . In this way, we can capture general knowledge about all tasks for a given format in ${^P}_{t} (F_{j})$ and expect this knowledge can better generalize to unseen tasks.

Train with QA Loss

For each training sample ( $C, Q, A$ ), we first construct the prompt $P (C, Q)$ for $C$ and $Q$ , and then optimize $P (C, Q)$ together with the encoder-decoder model $g_{θ}$ using the following loss:

L_{Q A} = - l o g g_{θ} (A | [P (C, Q); C; Q]) .

(6)

The overall loss that we optimize for Diana is:

L = L_{m} + {L^{'}}_{m} + L_{t} + L_{Q A} .

(7)

After learning each task $T_{i}$ , we select a small number of samples from $T_{i}$ based on the query vector of each sample to update the memory $M$ . This selection process aims to maintain diverse samples in $M$ . More details are in Appendix B.

3.5 Model Inference

In the testing phase, we first determine the prompt $P (C, Q)$ for each input context $C$ and question $Q$ , and then predict the answer sequence $A$ using the learned model $g_{θ}$ .

Adaptive Decision Boundary for Each Task

When selecting the proper task prompt in the testing phase, we propose constructing an adaptive decision boundary (ADB) for each task $T_{i}$ to determine whether the input belongs to unseen tasks. Specifically, for $T_{i}$ , a scalar boundary $δ_{i}$ is constructed following the approach proposed by Zhang et al. (2021). An input context $C$ and question $Q$ are regarded as a sample from unseen tasks if its query vector $h (C, Q)$ falls outside the boundary of every task, i.e.,

| | h (C, Q), k_{t} (T_{i}) | | > δ_{i}, \forall i \in [1, N] .

(8)

For samples from unseen tasks, we use the prompt ${^P}_{t} (F_{j})$ to construct $P (C, Q)$ , where $F_{j}$ is the format of $C$ and $Q$ .

Answer Prediction

The answer sequence $A$ is predicted from the prompt-enhanced encoder-decoder model with a greedy decoding process:

A = argmax A^{'} g_{θ} (A^{'} | [P (C, Q); C, Q]) .

(9)

4 Experiments

4.1 Datasets and Metrics

Datasets

We carry out experiments on 11 benchmark datasets covering 3 QA formats: (1) Extractive QA, including SQuAD Rajpurkar et al. (2016), NewsQA Trischler et al. (2017), and Quoref Dasigi et al. (2019); (2) Abstractive QA, including NarQA Kocisky et al. (2018), NQOpen Kwiatkowski et al. (2019), and Drop Dua et al. (2019); (3) Multiple-Choice QA, including RACE Lai et al. (2017), OBQA Mihaylov et al. (2018), MCTest Richardson et al. (2013), SIQA Sap et al. (2019), and Dream Sun et al. (2019b). We regard each dataset as an individual QA task and reserve $N^{'} = 3$ tasks as unseen tasks (Quoref, Drop, and Dream). Our model is trained on the rest of $N = 8$ seen tasks while tested on all 11 tasks. The task identity for each sample is not available when testing.

Evaluation Metrics

The evaluation of the above tasks follows Zhong et al. (2022). Specifically, we compute the accuracy of option selection for all Multi-Choice QA tasks and use Exact Match (EM) score for all Extractive QA tasks. Among Abstractive QA tasks, we use F1 score for Drop and NQOpen, and ROUGE-L Lin (2004) for NarQA.

When learning each task, we build a performance matrix $R \in R^{N \times (N + N^{'})}$ , where $R_{i, j}$ is the model performance on task $T_{j}$ after learning task $T_{i}$ . Based on $R$ , we compute the following metrics to evaluate the LL performance:

Average Performance $A_{N}$ , and $A_{N^{'}}$ is defined as the average performance of the final model on $N$ seen tasks and $N^{'}$ unseen tasks, respectively:

A_{N} = \frac{1}{N} N \sum j = 1 R_{N, j}, A_{N^{'}} = \frac{1}{N^{'}} N + N^{'} \sum j = N + 1 R_{N, j} .

(10)

Average Forget $F_{N}$ is defined as the average performance decrease of each task after it is learned:

F_{N} = \frac{1}{N - 1} N - 1 \sum j = 1 m a x i \in {1, \dots, N - 1} (R_{i, j} - R_{N, j}) .

(11)

In our experiments, we perform five runs with different random seeds and task orders. All reported scores of $A_{N}$ and $F_{N}$ are averages of these five runs. Moreover, we also report the averaged Final Performance $R_{N, j}$ of each task $T_{j}$ over these 5 runs. Ideally, we expect a strong lifelong QA model to yield high $A_{N}$ and $R_{N, j}$ , while obtaining low $F_{N}$ .

4.2 Implementation Details

We use T5-base Raffel et al. (2020) to initialize our encoder-decoder model, and set the lengths of soft prompts $P_{g}$ , $P_{f}$ , $P_{t}$ , $P_{m}$ to 20, 40, 40, 20, respectively. We use a fixed T5-base encoder with an average pooling layer to obtain the query vector. We maintain totally $M = 30$ meta prompts, and for each sample ( $C, Q$ ) we choose $M^{'} = 5$ meta prompts to construct $P_{m} (C, Q)$ . We use the AdamW Loshchilov and Hutter (2017) optimizer with a learning rate of 1e-4 and batch size of 64. Each task is trained for five epochs. We set $η = 0.15$ and $γ = 0.3$ in Eq.3 and $α = 0.9$ and $β = 3 e - 4$ in Eq.5. We maintain $50$ samples from each learned task in the memory $M$ . All experiments are performed on 4 V100 GPUs. See more details in Appendix A.

Task-ID	Methods	Buffer	$R_{N, j}$								$A_{N}$	$F_{N}$
in Test	Methods	Size	SQuAD	NewsQA	NarQA	NQOpen	RACE	OBQA	MCTest	SIQA	$A_{N}$	$F_{N}$
Available	ProQA	0	67.66	38.73	37.96	37.72	53.75	43.73	68.27	57.73	50.69	12.10
Available	ProQA+ER	50	71.20	40.17	41.94	39.00	57.09	47.00	77.94	57.67	54.00	7.27
Unavailable	Finetune	0	57.58	35.84	33.74	34.49	50.28	42.20	65.67	54.72	46.81	15.47
	EWC	0	59.84	36.44	34.88	35.14	50.54	43.43	66.52	55.68	47.81	14.55
	FLCB	0	58.73	36.97	34.27	34.90	51.63	41.53	66.60	55.39	47.50	14.98
	AdapterCL	0	59.64	37.31	37.42	36.70	49.57	41.80	66.67	55.54	48.08	13.29
	L2P	0	62.98	36.23	35.79	36.49	49.00	41.93	66.98	55.77	48.15	13.89
	DualPrompt	0	62.60	36.36	34.35	36.53	52.10	42.67	67.57	56.26	48.54	13.66
	ER	50	65.08	38.72	39.07	36.48	55.90	43.53	74.31	57.29	51.30	10.72
	DER++	50	67.08	39.03	39.91	36.93	56.42	44.13	74.77	57.77	52.01	10.05
	AFPER	50	68.14	40.79	40.16	38.89	55.08	46.60	75.33	56.52	52.69	9.28
	Diana w/o $M$	0	65.51	37.78	37.35	37.41	54.14	46.27	68.50	57.41	50.30	12.68
	Diana	50	74.44	42.91	43.16	40.05	59.08	48.47	78.44	60.92	55.93	6.75
	Multitask	-	80.22	44.74	47.30	41.72	64.05	51.00	83.44	61.41	59.23	-

Table 1: Model performance on seen tasks. Best results (except the upper bound Multitask) are bold. Our model Diana significantly outperforms other baselines on all metrics with

p

-value

<

0.05 (

t

-test).

4.3 Baselines

We use the following competitive baselines:

Regularization-based methods: EWC Kirkpatrick et al. (2017) adopts the elastic weight consolidation approach to add regularization on parameter changes; FLCB Gao et al. (2022) uses knowledge learned from previous tasks to guide future task learning; Rehearsal-based methods: ER Chaudhry et al. (2019b) replays memory samples from previous tasks to consolidate learned knowledge; DER++ Buzzega et al. (2020) augments ER with a $L_{2}$ loss on the soft labels; AFPER Mi et al. (2020) combines ER with an adaptive elastic weight consolidation mechanism; Architecture-based methods: AdapterCL Madotto et al. (2021) allocates separate adapter modules for different tasks; L2P Wang et al. (2022) attaches a group of prompts on a pre-trained model to share fine-grained knowledge; DualPrompt Wang et al. (2022) uses different prompts to encode task-invariant and task-specific knowledge; ProQA Zhong et al. (2022) uses a unified structural prompt to implement lifelong QA models. Note that ProQA requires obtaining task identities in the testing phase.

We combine ProQA and ER to implement a stronger baseline ProQA+ER, in which samples from previous tasks are replayed for the ProQA model, and we also implement a variant of Diana by removing the memory buffer Diana w/o $M$ . We further report the performance for sequentially fine-tuning the QA model on all tasks (Finetune) and multi-task learning (Multitask). Note that the performance of Multitask is generally regarded as the upper bound of LL models.

All above baselines are implemented following same settings of our model, including the backbone PLM, prompt size, and memory size used for replay. Note that for the ProQA baseline, we follow its original setting to provide task identities for testing samples when evaluating.

4.4 Experiment Results

Results on Seen Tasks

Table 1 shows the result on eight seen tasks. Diana outperforms all competitive baselines. When task identities are unavailable, Diana outperforms the best performing baseline AFPER with a large margin, i.e., 6.15% relative improvement on the $A_{N}$ score and 27.26% relative decrease on the $F_{N}$ score. We can also observe that: (1) Diana even outperforms the ProQA+ER baseline, which leaks task identities in testing. This proves the superiority of our model design. (2) When task identities are unavailable, Diana w/o $M$ outperforms all baselines that do not use the memory buffer. This demonstrates that Diana’s hierarchical prompts help to improve the LL performance even without the memory buffer.

Results on Unseen Tasks

Table 2 shows the result on three unseen tasks. Diana yields the best performances on all metrics. It also achieves a relative improvement of 9.49% on the $A_{N}$ score compared with the best baseline DER++. We can also observe that: (1) When $M$ is unavailable, models that share knowledge through fine-grained components (i.e., Diana and L2P) generally obtain high performance, and our model that allocates extra prompts for unseen tasks achieves the best performance. This validates the effectiveness of using hierarchical prompts to explicitly model unseen tasks. (2) It is interesting to see that our model even outperforms Multitask when the $M$ is available. This further proves that our model is effective in modeling unseen tasks.

Methods	Buffer	$R_{N, j}$			$A_{N^{'}}$
Methods	Size	Quoref	Drop	Dream	$A_{N^{'}}$
ProQA	0	33.40	18.29	55.85	35.85
ProQA+ER	50	35.87	19.78	58.35	38.00
Finetune	0	33.08.	18.10	55.36	35.51
EWC	0	33.43	18.14	56.65	36.07
FLCB	0	34.85	18.31	56.88	36.68
AdapterCL	0	35.47	17.83	57.21	36.84
L2P	0	36.22	19.18	57.40	37.60
DualPrompt	0	35.22	18.52	56.25	36.66
ER	50	35.14	18.56	59.71	37.80
DER++	50	36.15	19.08	60.17	38.47
AFPER	50	35.26	18.83	56.29	36.79
Diana w/o $M$	0	37.95	20.32	59.39	39.22
Diana	50	40.42	22.91	63.03	42.12
Multitask	-	36.27	22.99	62.60	40.62

Table 2: Model performance on unseen tasks. Best results (except Multitask) are bold. Diana significantly outperforms other baselines on all metrics with

p

-value

<

0.05 (

t

-test).

4.5 Ablation Studies

We conduct ablation studies on different components of Diana. Specifically, three types of variants are implemented:

1. Each prompt type is ablated: w/o general prompt, w/o format prompt, w/o task prompt, w/o meta prompt.

2. Schemes to enhance task prompts are ablated: w/o Sched. Sampling removes the scheduled sampling scheme and only uses the ground truth task identities in training; w/o G.T. Identity is similar to the above variant. Instead it only uses predicted task identities in training; w/o Neg. Samples only uses positive samples to train task prompt keys, i.e., the second term in Eq.1 is removed; w/o ADB uses fixed decision boundaries instead of ADBs to detect unseen tasks.

3. Schemes to enhance meta prompts are ablated: w/o Sample Dive. does not enforce the diversity property on each sample by removing the second term in Eq.3; w/o Memory Dive. does not exploit memory samples to enhance the diversity property by removing the loss $L_{m}^{'}$ (Eq.4); w/o Cluster does not cluster samples in $M$ , i.e., $c_{k}$ in Eq.4 is replaced with the query vector of each sample from $M$ .

Categories	Variants	$A_{N}$	$F_{N}$	$A_{N^{'}}$
Prompt Types	w/o General Prompt	55.47	6.93	40.74
	w/o Format Prompt	55.11	7.03	40.59
	w/o Task Prompt	53.87	8.50	39.66
	w/o Meta Prompt	53.46	8.56	40.04
Task prompt	w/o Sched. Sampling	55.15	7.43	42.00
	w/o G.T. Identity	54.16	7.61	41.27
	w/o Neg. Samples	54.97	7.66	41.78
	w/o ADB	55.48	6.98	41.01
Meta prompt	w/o Sample Dive.	55.24	6.91	41.23
	w/o Memory Dive.	55.02	7.41	41.48
	w/o Cluster	55.46	6.99	41.51
Diana		55.93	6.75	42.12

Table 3: Ablation studies of model components and training strategies. Each result is an average of 5 random runs.

Table 3 shows that Diana outperforms all above variants. We can also observe that: (1) “w/o Meta Prompt” achieves low performance. This indicates that these fine-grained meta prompts are more important in building lifelong QA models. (2) The scheduled sampling scheme helps to learn better task prompts and thus improves the LL performance. (3) ADB improves model performance on unseen tasks (i.e., $A_{N^{'}}$ ) by a large margin. (4) Enforcing the diversity property of meta prompt keys is important to obtain good key representations and facilitates the learning of each task.

4.6 More Analysis

Task Identity Detection Performance

Architecture-based LL models need to detect task identities of input samples when these identities are unavailable in the testing phase. To verify the performance of the task identity detector implemented in Diana, we compare our approach with other task identity detectors: (1) Perplexity-based detector implemented in baseline “AdapterCL” determines the task identities based on the perplexity of the PLM when different adapter modules are activated. (2) Distance-based detector implemented in our variant “w/o Neg. Samples” determines the task identity based on the distance between each key and query vectors. (3) Advanced distance-based detector implemented in our variant “w/o ADB” utilizes negative samples based on the above detector. Note that we do not apply ADB in the above two distance-based detectors. On our testing data, the above three approaches achieve a task detection accuracy of 59.84%, 52.72%, and 63.43%, respectively, while performance of Diana reaches 66.97%. This verifies the effectiveness of our task prompt keys in detecting task identities. More detailed comparisons of these task identity detectors can be found in Appendix C.

Criteria	Models	$Z$ =2	$Z$ =3	$Z$ =5	$Z$ =10
Locality	w/o Sample Dive.	0.73	0.72	0.70	0.48
	w/o Memory Dive.	0.74	0.72	0.69	0.63
	Diana	0.74	0.73	0.70	0.66
Diversity	w/o Sample Dive.	0.63	0.61	0.59	0.40
	w/o Memory Dive.	1.00	0.89	0.77	0.53
	Diana	1.00	0.96	0.89	0.63

Table 4: Quantitative analysis of the locality and diversity for meta prompt keys.

Distribution of Meta Prompt Keys

We also analyze the distribution of meta prompt keys $K = {k_{m}^{j}}_{j = 1}^{M}$ constructed in Diana, which are expected to balance the locality and diversity property. Specifically, we introduce two metrics to quantify these two properties. For the diversity property, we follow Mansoury et al. (2020) to measure whether these meta prompt keys cover the whole vector space:

D i v e r s i t y = | M \cup j = 1 N_{Z} (k_{m}^{j}, M) | / (Z \cdot M),

(12)

where $N_{Z} (k_{m}^{j}, M)$ represents the set of top- $Z$ nearest samples in $M$ around $k_{m}^{j}$ , and $| \cdot |$ returns the sample count of a set. High diversity scores are received if we can scatter meta prompt keys near every query vector. For the locality property, we follow Scellato et al. (2010) to measure whether there are keys clustered around each query vector $q$ in $M$ :

L o c a l i t y = \sum q \in M \sum k \in N_{Z} (q, K) (1 - | | q, k | |) / (Z \cdot | M |) .

(13)

High locality scores are received if meta prompt keys in $K$ are tightly clustered. As can be seen from table 4, the training strategies we introduced in Diana help to enforce the locality and diversity properties of meta prompt keys.

5 Conclusion

We propose Diana, a novel lifelong learning model for QA tasks. Diana converts different QA tasks into a unified sequence generation format and uses a prompt enhanced PLM to learn these tasks. We introduce four types of hierarchically organized prompts in Diana to capture knowledge in different granularities. These prompts are dynamically combined based on a set of key vectors, which are built with the help of several distance-based regularization terms. Dedicated components are also allocated in Diana to model samples from unseen tasks. Experiments and empirical analysis on benchmark QA tasks show that Diana outperforms SOTA lifelong QA models, especially in handling samples from unseen tasks.

References

S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer (2015) Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information processing systems 28. Cited by: §3.4.
T. E. Boult, S. Cruz, A. R. Dhamija, M. Gunther, J. Henrydoss, and W. J. Scheirer (2019) Learning and the unknown: surveying steps toward open world recognition. pp. 9801–9807. Cited by: §1.
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. pp. 1877–1901. External Links: Link Cited by: §1.
P. Buzzega, M. Boschini, A. Porrello, D. Abati, and S. CALDERARA (2020) Dark experience for general continual learning: a strong, simple baseline. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 15920–15930. External Links: Link Cited by: §2.1, §4.3.
A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny (2019a) Efficient lifelong learning with a-gem. In Proceedings of ICLR, External Links: Link Cited by: §2.1.
A. Chaudhry, M. Rohrbach, M. Elhoseiny, T. Ajanthan, P. K. Dokania, P. H. S. Torr, and M. Ranzato (2019b) On tiny episodic memories in continual learning. External Links: 1902.10486 Cited by: §4.3.
T. Chen, I. Goodfellow, and J. Shlens (2016) Net2Net: accelerating learning via knowledge transfer. In Proceedings of ICLR, External Links: Link Cited by: §1, §2.1.
P. Dasigi, N. F. Liu, A. Marasović, N. A. Smith, and M. Gardner (2019) Quoref: a reading comprehension dataset with questions requiring coreferential reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5925–5932. External Links: Link, Document Cited by: §4.1.
D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner (2019) DROP: a reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 2368–2378. External Links: Link, Document Cited by: §4.1.
S. Ebrahimi, F. Meier, R. Calandra, T. Darrell, and M. Rohrbach (2020) Adversarial continual learning. pp. 386–402. Cited by: §2.1.
M. Farajtabar, N. Azizan, A. Mott, and A. Li (2020) Orthogonal gradient descent for continual learning. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, S. Chiappa and R. Calandra (Eds.), Proceedings of Machine Learning Research, Vol. 108, pp. 3762–3773. External Links: Link Cited by: §2.1.
C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra (2017) PathNet: evolution channels gradient descent in super neural networks. External Links: 1701.08734 Cited by: §1, §2.1.
R. M. French (1999) Catastrophic forgetting in connectionist networks. Trends in cognitive sciences 3 (4), pp. 128–135. Cited by: §1.
J. Gao, J. Li, H. Shan, Y. Qu, J. Z. Wang, and J. Zhang (2022) Forget less, count better: a domain-incremental self-distillation learning benchmark for lifelong crowd counting. External Links: 2205.03307 Cited by: §2.1, §4.3.
G. Geigle, N. Reimers, A. Rücklé, and I. Gurevych (2021) TWEAC: transformer with extendable qa agent classifiers. External Links: 2104.07081 Cited by: §1.
M. Ghifary, W. B. Kleijn, M. Zhang, and D. Balduzzi (2015) Domain generalization for object recognition with multi-task autoencoders. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2.
N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly (2019) Parameter-efficient transfer learning for NLP. In Proceedings of the 36th International Conference on Machine LearningAdvances in Neural Information Processing SystemsProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language TechnologiesProceedings of the 2021 Conference on Empirical Methods in Natural Language ProcessingProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)Proceedings of the 58th Annual Meeting of the Association for Computational LinguisticsAdvances in Neural Information Processing SystemsProceedings of the 10th international conference on World Wide WebProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)Findings of the Association for Computational Linguistics: EMNLP 2020International Conference on Learning RepresentationsProceedings of the European Conference on Computer Vision (ECCV) WorkshopsProceedings of the AAAI conference on artificial intelligenceAAAIProceedings of the 39th International Conference on Machine LearningProceedings of the 28th International Conference on Computational LinguisticsProceedings of the IEEE/CVF Winter Conference on Applications of Computer VisionProceedings of the IEEE/CVF International Conference on Computer VisionText Summarization Branches OutProceedings of the European Conference on Computer Vision (ECCV)Proceedings of the IEEE conference on computer vision and pattern recognitionProceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization3rd Workshop on Online Social Networks (WOSN 2010)Proceedings of the IEEE conference on Computer Vision and Pattern RecognitionEuropean Conference on Computer Vision, K. Chaudhuri, R. Salakhutdinov, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, H. Lin, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato (Eds.), Proceedings of Machine Learning ResearchProceedings of Machine Learning ResearchUMAP ’20, Vol. 97333033162, pp. 2790–2799. External Links: Link Cited by: §1.
Z. Jiang, F. F. Xu, J. Araki, and G. Neubig (2020) How Can We Know What Language Models Know?. Transactions of the Association for Computational Linguistics 8, pp. 423–438. Cited by: §1.
D. Khashabi, S. Min, T. Khot, A. Sabharwal, O. Tafjord, P. Clark, and H. Hajishirzi (2020) UNIFIEDQA: crossing format boundaries with a single QA system. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 1896–1907. External Links: Link, Document Cited by: Appendix G, §1, §2.3, §3.2.
A. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, and A. Torralba (2012) Undoing the damage of dataset bias. In Computer Vision – ECCV 2012, A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid (Eds.), Berlin, Heidelberg, pp. 158–171. External Links: ISBN 978-3-642-33718-5 Cited by: §2.2.
J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of NAS, pp. 3521–3526. External Links: Link Cited by: §2.1, §4.3.
T. Kocisky, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette (2018) The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics 6 (0), pp. 317–328. External Links: ISSN 2307-387X, Link Cited by: §4.1.
T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M. Chang, A. Dai, J. Uszkoreit, Q. Le, and S. Petrov (2019) Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics 7 (0), pp. 452–466. External Links: ISSN 2307-387X, Link Cited by: §4.1.
G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy (2017) RACE: large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 785–794. External Links: Link, Document Cited by: §4.1.
B. Lester, R. Al-Rfou, and N. Constant (2021) The power of scale for parameter-efficient prompt tuning. Online and Punta Cana, Dominican Republic, pp. 3045–3059. External Links: Link, Document Cited by: §3.2.
D. Li, Y. Yang, Y. Song, and T. M. Hospedales (2017) Deeper, broader and artier domain generalization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2.
D. Li, J. Zhang, Y. Yang, C. Liu, Y. Song, and T. M. Hospedales (2019) Episodic training for domain generalization. pp. 1446–1455. Cited by: §2.2.
H. Li, S. J. Pan, S. Wang, and A. C. Kot (2018) Domain generalization with adversarial feature learning. pp. 5400–5409. Cited by: §2.2.
Z. Li and D. Hoiem (2017) Learning without forgetting. TPAMI 40 (12), pp. 2935–2947. External Links: Link Cited by: §2.1.
C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. Barcelona, Spain, pp. 74–81. External Links: Link Cited by: §4.1.
X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, and J. Tang (2021) GPT understands, too. External Links: 2103.10385 Cited by: §3.2.
I. Loshchilov and F. Hutter (2017) Decoupled weight decay regularization. External Links: 1711.05101 Cited by: Appendix A, §4.2.
A. Madotto, Z. Lin, Z. Zhou, S. Moon, P. Crook, B. Liu, Z. Yu, E. Cho, P. Fung, and Z. Wang (2021) Continual learning in task-oriented dialogue systems. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, pp. 7452–7467. External Links: Link, Document Cited by: §1, §2.1, §4.3.
A. Mallya and S. Lazebnik (2018) Packnet: adding multiple tasks to a single network by iterative pruning. pp. 7765–7773. Cited by: §2.1.
D. Maltoni and V. Lomonaco (2019) Continuous learning in single-incremental-task scenarios. Neural Networks 116, pp. 56–73. Cited by: §2.1.
M. Mancini, E. Ricci, B. Caputo, and S. Rota Bulo (2018) Adding new tasks to a single network with weight transformations using binary masks. pp. 0–0. Cited by: §1.
M. Mansoury, H. Abdollahpouri, M. Pechenizkiy, B. Mobasher, and R. Burke (2020) FairMatch: a graph-based approach for improving aggregate diversity in recommender systems. New York, NY, USA, pp. 154–162. External Links: ISBN 9781450368612, Link, Document Cited by: §4.6.
B. McCann, N. S. Keskar, C. Xiong, and R. Socher (2019) The natural language decathlon: multitask learning as question answering. External Links: Link Cited by: §2.3.
F. Mi, L. Chen, M. Zhao, M. Huang, and B. Faltings (2020) Continual learning for natural language generation in task-oriented dialog systems. Online, pp. 3461–3474. External Links: Link, Document Cited by: §4.3.
T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal (2018) Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2381–2391. External Links: Link, Document Cited by: §4.1.
K. Muandet, D. Balduzzi, and B. Schölkopf (2013) Domain generalization via invariant feature representation. In Proceedings of the 30th International Conference on Machine Learning, S. Dasgupta and D. McAllester (Eds.), Proceedings of Machine Learning Research, Vol. 28, Atlanta, Georgia, USA, pp. 10–18. External Links: Link Cited by: §2.2.
M. Mundt, Y. W. Hong, I. Pliushch, and V. Ramesh (2020) A wholistic view of continual learning with deep neural networks: forgotten lessons and the bridge to active and open world learning. arXiv. External Links: 2009.01797 Cited by: §1.
N. Pu, W. Chen, Y. Liu, E. M. Bakker, and M. S. Lew (2021) Lifelong person re-identification via adaptive knowledge accumulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7901–7910. Cited by: §2.1.
C. Qin and S. Joty (2022) LFPT5: a unified framework for lifelong few-shot language learning based on prompt tuning of t5. External Links: Link Cited by: §1.
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, pp. 1–67. Cited by: Appendix A, §4.2.
P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2383–2392. External Links: Link, Document Cited by: §4.1.
S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017) iCaRL: Incremental classifier and representation learning. In Proceedings of CVPR, pp. 2001–2010. External Links: Link Cited by: §2.1.
M. Richardson, C. J.C. Burges, and E. Renshaw (2013) MCTest: a challenge dataset for the open-domain machine comprehension of text. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 193–203. External Links: Link Cited by: §4.1.
H. Ritter, A. Botev, and D. Barber (2018) Online structured laplace approximations for overcoming catastrophic forgetting. In Proceedings of NIPS, pp. 3738–3748. External Links: Link Cited by: §2.1.
A. Rogers, M. Gardner, and I. Augenstein (2021) QA dataset explosion: a taxonomy of nlp resources for question answering and reading comprehension. External Links: 2107.12708 Cited by: §1.
A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell (2016) Progressive neural networks. External Links: 1606.04671 Cited by: §1, §2.1.
M. Sap, H. Rashkin, D. Chen, R. Le Bras, and Y. Choi (2019) Social IQa: commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4463–4473. External Links: Link, Document Cited by: §4.1.
S. Scellato, C. Mascolo, M. Musolesi, and V. Latora (2010) Distance matters: geo-social metrics for online social networks. Cited by: §4.6.
H. Shin, J. K. Lee, J. Kim, and J. Kim (2017) Continual learning with deep generative replay. In Proceedings of NIPS, pp. 2990–2999. External Links: Link Cited by: §2.1.
M. A. C. Soares and F. S. Parreiras (2020) A literature review on question answering techniques, paradigms and systems. Journal of King Saud University-Computer and Information Sciences 32 (6), pp. 635–646. Cited by: §1.
L. Su, J. Guo, R. Zhang, Y. Fan, Y. Lan, and X. Cheng (2020) Continual domain adaptation for machine reading comprehension. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 1395–1404. Cited by: §1.
F. Sun, C. Ho, and H. Lee (2019a) LAMOL: language modeling for lifelong language learning. External Links: 1909.03329 Cited by: §2.1.
K. Sun, D. Yu, J. Chen, D. Yu, Y. Choi, and C. Cardie (2019b) Dream: a challenge data set and models for dialogue-based reading comprehension. Transactions of the Association for Computational Linguistics 7, pp. 217–231. Cited by: §2.3, §4.1.
S. Thrun and T. M. Mitchell (1995) Lifelong robot learning. Robotics and autonomous systems 15 (1-2), pp. 25–46. Cited by: §1.
F. Träuble, A. Goyal, N. Rahaman, M. Mozer, K. Kawaguchi, Y. Bengio, and B. Schölkopf (2022) Discrete key-value bottleneck. External Links: 2207.11240 Cited by: §1.
A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman (2017) NewsQA: a machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, Vancouver, Canada, pp. 191–200. External Links: Link, Document Cited by: §4.1.
G. M. van de Ven and A. S. Tolias (2019) Three scenarios for continual learning. arXiv. External Links: Document, 1904.07734, Link Cited by: §2.1.
T. Vu, B. Lester, N. Constant, R. Al-Rfou’, and D. Cer (2022) SPoT: better frozen model adaptation through soft prompt transfer. Dublin, Ireland, pp. 5039–5059. External Links: Link, Document Cited by: §3.2.
Z. Wang, Z. Zhang, S. Ebrahimi, R. Sun, H. Zhang, C. Lee, X. Ren, G. Su, V. Perot, J. Dy, and T. Pfister (2022) DualPrompt: complementary prompting for rehearsal-free continual learning. External Links: 2204.04799 Cited by: §1, §1, §2.1, §4.3.
Z. Wang, Z. Zhang, C. Lee, H. Zhang, R. Sun, X. Ren, G. Su, V. Perot, J. Dy, and T. Pfister (2022) Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 139–149. Cited by: §1, §1, §2.1, §4.3.
C. Wiwatcharakoses and D. Berrar (2020) SOINN+, a self-organizing incremental neural network for unsupervised learning from noisy data streams. Expert Systems with Applications 143, pp. 113069. External Links: ISSN 0957-4174, Document, Link Cited by: §1, §1.
M. Wortsman, V. Ramanujan, R. Liu, A. Kembhavi, M. Rastegari, J. Yosinski, and A. Farhadi (2020) Supermasks in superposition. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 15173–15184. External Links: Link Cited by: §1.
Z. Xu, W. Li, L. Niu, and D. Xu (2014) Exploiting low-rank structure from latent domains for domain generalization. In Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), Cham, pp. 628–643. External Links: ISBN 978-3-319-10578-9 Cited by: §2.2.
W. Yang, Y. Xie, A. Lin, X. Li, L. Tan, K. Xiong, M. Li, and J. Lin (2019) End-to-end open-domain question answering with BERTserini. Minneapolis, Minnesota, pp. 72–77. External Links: Link, Document Cited by: §1.
H. Ye, H. Liu, F. Meng, and X. Li (2021) Bi-directional exponential angular triplet loss for rgb-infrared person re-identification. IEEE Transactions on Image Processing 30 (), pp. 1583–1595. External Links: Document Cited by: §3.3.
F. Zenke, B. Poole, and S. Ganguli (2017) Continual learning through synaptic intelligence. In Proceedings of ICML, pp. 3987–3995. External Links: Link Cited by: §2.1.
H. Zhang, H. Xu, and T. Lin (2021) Deep open intent classification with adaptive decision boundary.. pp. 14374–14382. Cited by: Appendix A, §3.5.
J. Zhang, J. Zhang, S. Ghosh, D. Li, S. Tasci, L. Heck, H. Zhang, and C. J. Kuo (2020) Class-incremental learning via deep model consolidation. pp. 1131–1140. Cited by: §2.1.
S. Zhao, M. Gong, T. Liu, H. Fu, and D. Tao (2020) Domain generalization via entropy regularization. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 16096–16107. External Links: Link Cited by: §2.2.
W. Zhong, Y. Gao, N. Ding, Y. Qin, Z. Liu, M. Zhou, J. Wang, J. Yin, and N. Duan (2022) ProQA: structural prompt-based pre-training for unified question answering. External Links: 2205.04040 Cited by: §1, §2.3, §3.2, §4.1, §4.3.

Appendix

Appendix A Implementation Details

We use T5-base Raffel et al. (2020) to initialize our encoder-decoder model (12 layers, 768 dimensional hidden size, and 12 attention heads), and set the lengths of soft prompts $P_{g}$ , $P_{f}$ , $P_{t}$ , $P_{m}$ to 20, 40, 40, 20, respectively. We use a fixed T5-base encoder with an average pooling layer to obtain the query vector. We maintain a pool of $M = 30$ meta prompts, and for each sample ( $C, Q$ ) we choose $M^{'} = 5$ meta prompts to construct $P_{m} (C, Q)$ . We use the AdamW Loshchilov and Hutter (2017) optimizer for training. All hyper-parameters are tuned according to the average score on validation datasets of SQuAD, NarQA, RACE, OBQA, SIQA and Dream. We tried epoch number of ${2, 3, 4, 5, 6, 7, 8}$ and learning rate of ${1 e - 5, 5 e - 5, 1 e - 4, 5 e - 4, 1 e - 3}$ . We finally set the learning rate to 1e-4 and the number of training epochs to 5. We set $η = 0.15$ and $γ = 0.3$ in Eq.3 and $α = 0.9$ and $β = 3 e - 4$ in Eq.5. For $η$ and $γ$ , we have a grid search between 0 and 0.5 with an interval of 0.05. For $α$ and $β$ , $α$ is searched among ${0.9, 0.7, 0.5}$ , while $β$ is searched among ${1 e - 5, 3 e - 5, 1 e - 4, 3 e - 4, 1 e - 3}$ . All experiments are performed on 4 V100 GPUs (32GB). The batch size is set to 64. In our experiments, We perform 5 runs with different task orders by setting the random seed to ${42, 43, 44, 45, 46}$ respectively. In this way, we report the average score of each method. Note that we only use the random seed $42$ for tuning hyper-parameters.

To train extra task prompts ${^Pt(F1),⋯,^Pt(FL)}$ for unseen tasks, we allocate a small probability $ω = 5 %$ for each training sample ( $C, Q, A$ ) to use ${^P}_{t} (F_{j})$ as its task prompt in $P (C, Q)$ , where $F_{j}$ is the task format of ( $C, Q, A$ ). To implement variant “w/o ADB” for ablation study, we use a fixed decision boundary instead of ADB. If for any task $T_{i}$ , the distance $| | h (C, Q), k_{t} (T_{i}) | | > 0.35$ , we regard the sample is from unseen tasks.

The adaptive decision boundary for each task is determined following the approach proposed by Zhang et al. (2021). We use AdamW optimizer with a learning rate of 0.02 to learn each decision boundary.

Appendix B Memory Update

After learning task $T_{i}$ , we select $E$ diverse samples (we set $E = 50$ in our experiments) from $T_{i}$ to update the memory $M$ based on the query vector of each sample. Specifically, our selection criteria are built based on the distance of these prompt keys and query vectors. For each meta prompt key $k_{m}^{j}$ ( $j = 1, \dots, M$ ), we select top- $⌈ \frac{E}{M} ⌉$ samples ( $⌈ \cdot ⌉$ is the ceiling function), whose query vectors are closest to $k_{m}^{j}$ . After accumulating $M ⌈ \frac{E}{M} ⌉$ memory samples selected by $M$ meta prompt keys, we rank these samples based on their distance to the corresponding meta prompt keys, and choose top- $E$ samples with the smallest distance to be fed into $M$ . In this way, the memory $M$ we constructed can expand to the whole space of prompt keys.

Note that, the memory buffer $M$ is optional in Diana. Without $M$ , the loss in Eq. 4 is not optimized, and the second term in Eq. 1 is removed.

Appendix C More Analysis of Task Identity Detection Performance

The above approaches are evaluated under two scenarios: (1) In Closed-world: detectors are only required to detect samples from seen tasks. Note that in this setting, the Advanced distance-based detector used in “w/o ADB” is the same as the task identity detector implemented in Diana. (2) In Open-world: detectors are required to handle unseen task samples as well. When tested in the open-world scenario, these two distance-based detectors adopt a fixed decision boundary of 0.35 (see Appendix A). The perplexity-based detector adopts a perplexity threshold of 4, i.e., samples with a perplexity score above 4 are regarded as unseen task samples. This perplexity threshold is selected based on the model performance on the validation set.

We report the task identity detection accuracy and Marco F1 scores for seen samples and unseen samples separately in Table 5. we can observe that: (1) The task identity detector used in Diana achieves the best performance in both scenarios. This proves the effectiveness of our task prompt keys in detecting task identities. (2) Negative samples used in Advanced distance-based detector significantly improve the task identity detection performance on seen tasks. (3) ADB is effective in improving the task identity detection performance on unseen tasks.

Scenario	Methods	Scores on Seen Tasks		Scores on Unseen Tasks		Overall Scores
Scenario	Methods	F1	Accuracy	F1	Accuracy	F1	Accuracy
Closed-world	Perplexity-based	44.92	52.20	-	-	44.92	52.20
	Distance-based	43.18	63.34	-	-	43.18	63.34
	Advanced distance-based	54.37	75.35	-	-	54.37	75.15
Open-world	Perplexity-based	33.15	58.64	26.14	62.98	32.37	59.84
	Distance-based	38.51	50.53	21.98	58.48	36.67	52.72
	Advanced distance-based	44.12	64.86	24.17	59.67	41.90	63.43
	Diana	47.06	68.81	35.70	62.16	45.80	66.97

Table 5: Task identity detection performance in different models.

Figure 4: The task identity detection accuracy for samples from the last task $T_{N}$ when learning $T_{N}$ .

Appendix D More Analysis of Scheduled Sampling

We perform a more detailed analysis of the scheduled sampling scheme introduced in Diana. Specifically, in the ablation variant “w/o G.T. Identity”, the model only uses predicted task identities in training. This scheme helps to alleviate the discrepancy between training and testing with the cost of the model coverage speed. In the ablation variant “w/o Sched. Sampling”, the model only uses golden truth task identities in the training process. This scheme leads to the discrepancy between training and testing. The above two schemes under-perform our model Diana.

In this section, we analyze the task identity detection accuracy yield by the above schemes in Figure 4 when learning the last task $T_{N}$ in the input task sequence. We can observe that the task identity detection accuracy achieved by “w/o G.T. Identity” is extremely low in earlier iterations, which hinders task prompts from sharing task-specific knowledge in the early training stage. The scheduled sampling process introduced in Diana effectively compromises between detecting correct task identities and alleviating the train-test discrepancy, and thus it results in the best LL performance among these variants. Note that the task identity detection accuracy in “w/o Sched. Sampling” is almost zero in the first 1,000 iterations when learning task $T_{N}$ . This is because the task prompt keys for previous $N - 1$ tasks are already well learned. The randomly initialized prompt key for task $T_{N}$ needs to be pulled to the query vector space before starting to be functional.

Appendix E More Analysis of Computational Cost

We also analyze the computational cost of Diana, including the number of tunable parameters, time used for training and testing, and size of required memories retained from previous tasks. As can be seen from Table 6, Diana does not introduce too much computation overhead.

Methods	Tunable Parameters	Memory Size	Train Time Per Batch	Test Time All Tasks
Lower Bound	222.90M	0	0.55	523
EWC	222.90M	0	0.93	596
FLCB	222.90M	0	0.59	591
AdapterCL	262.25M	0	0.73	5852
L2P	223.39M	0	1.01	1013
DualPrompt	223.17M	0	0.93	1147
ER	222.90M	50	0.58	541
DER++	222.90M	50	0.68	604
AFPER	222.90M	50	0.95	630
ProQA	223.43M	0	0.86	863
Diana	223.84M	50	1.05	1108
Diana w/o $M$	223.84M	0	0.97	1123

Table 6: Computational cost of Diana and baselines. “Train Time” is the average time cost for each batch. “Test Time” is the total time cost to evaluate all 11 tasks. Both train and test times are in seconds.

Appendix F Training Process

Details about the training process of Diana are shown in Algorithm 1.

Input: QA model $g_{θ}$ , datasets ${(C_{j}, Q_{j}, A_{j})}_{j = 1}^{n_{i}}$ for each task $T_{i}$ ( $i$ = $1, \dots, N$ ), memory buffer $M$ , general prompt $P_{g}$ , format prompts ${P_{f} (F_{j})}_{j = 1}^{F}$ , task prompts ${Pt(Ti)}Ni=1∪{^Pt(Fj)}Fj=1$ , meta prompts ${P_{m}^{i}}_{i = 1}^{M}$ , task prompt keys ${k_{t} (T_{i})}_{i = 1}^{N}$ , meta prompt keys ${k_{m}^{i}}_{i = 1}^{M}$

1: Initialize:

M \leftarrow \emptyset

2: for Each task

T_{i}

i

1, \dots, N

3: if

M \neq \emptyset

then

4: Calculate cluster centroids

c_{1}, \dots, c_{B}

M

5: end if

6: for number of training epochs do

7: for Each mini-batch

I \in

{(C_{j}, Q_{j}, A_{j})}_{j = 1}^{n_{i}} \cup M

8: Obtain

ϵ_{k}

by Eq.5

9: for (

C, Q, A

)

\in

I

10: Obtain format

F_{j}

of (

C, Q, A

)

11: Sample

ϵ, ζ

from

U (0, 1)

12: if

ζ < ω

then

13:

P_{t} (C, Q) \leftarrow {^P}_{t} (F_{j})

{Use task prompt

{^P}_{t} (F_{j})

for unseen tasks}

14: else if

ϵ < ϵ_{k}

then

15:

P_{t} (C, Q) \leftarrow P_{t} (T_{i})

{Use the golden truth task identity to select task prompt}

16: else

17:

P_{t} (C, Q) \leftarrow P_{t} (a r g m i n T_{τ} \in {T_{1}, \dots, T_{i}} (| | q, k_{t} (T_{τ}) | |))

{Use the inferred task identity to select task prompt}

18: end if

19:

S (C, Q) \leftarrow

indexes of

M^{'}

meta prompt keys that are closest to

q

20:

P_{m} (C, Q) \leftarrow {P_{m}^{j}} j \in S (C, Q)

21:

P (C, Q)

\leftarrow

[P_{g}; P_{f} (F_{j}); P_{t} (C, Q); P_{m} (C, Q)]

22: Calculate per sample loss

L_{Q A}

g_{θ}

and

P (C, Q)

by Eq.6

23: Obtain negative sample

(C_{n}, Q_{n})

from

M

by Eq.2

24: Calculate per sample loss

L_{t}

k_{t} (T_{i})

by Eq.1

25: Calculate per sample loss

L_{m}

{k_{m}^{s_{j}}} (s_{j} \in S (C, Q))

by Eq.3

26: if (

C, Q, A

)

\in M

then

27: Calculate per sample loss

{L^{'}}_{m}

{k_{m}^{s_{j}}} (s_{j} \in S (C, Q))

by Eq.4

28: end if

29: end for

30: Update

g_{θ}

and prompts with accumulated

L_{Q A}

31: Update task prompt keys

{k_{t} (T_{i})}_{i = 1}^{N}

with accumulated

L_{t}

32: Update meta prompt keys

{k_{m}^{i}}_{i = 1}^{M}

with accumulated

L_{m}

and

{L^{'}}_{m}

33: end for

34: end for

35: Update

M

with

{(C_{j}, Q_{j}, A_{j})}_{j = 1}^{n_{i}}

according to details in Appendix B

36: end for

Algorithm 1 Training process of Diana

Appendix G Dataset Statistics

Format	Dataset	Train set size	Val set size	Test set size
Extractive	SQuAD	80k	7k	10k
	NewsQA	76k	-	4.3k
	Quoref	22k	-	2.7k
Abstractive	NarQA	65k	6.9k	21k
	NQOpen	9.6k	-	10k
	Drop	77k	-	9.5k
Multiple-Choice	RACE	87k	4.8k	4.9k
	OBQA	4.9k	500	500
	MCTest	1.4k	-	320
	SIQA	33k	1.9k	2.2k
	Dream	6.1k	2.0k	2.0k

Table 7: Dataset Statistics.

We perform experiments on 11 benchmark QA tasks across three different formats. The statistics of these 11 datasets are summarized in Table 7. Note that we follow the pre-process scheme released by Khashabi et al. (2020) to tackle these 11 datasets. Some of these datasets do not contain a validation set. We only use the validation sets of SQuAD, NarQA, RACE, OBQA, SIQA and Dream to search hyper-parameters.

Appendix H Cases

We list some samples for each task we modeled in Table 8.

Format	Dataset	Case
Extractive	SQuAD	Context: (Private_school) Private schooling in the United States has been…
		Question: In what year did Massachusetts first require children to be educated in schools?
		Answer: 1852
	NewsQA	Context:ABECHE, Chad (CNN) – Most of the 103 children that a French charity…
		Question:WHO ARE UNDER ARREST IN CHAD?
		Answer:Three French journalists, a seven-member Spanish flight crew and one Belgian
	Quoref	Context:(Blast of Silence) Frankie Bono, a mentally disturbed hitman from Cleveland…
		Question:What is the first name of the person who follows their target to select the best possible location?
		Answer:Frankie
Abstractive	NarQA	Context:The play begins with three pages disputing over the black cloak usually worn by the actor…
		Question:WHO NORMALLY DELIVERS THE OPENING PROLOGUE IN THE PLAY?
		Answer:THE ACTOR WEARING THE BLACK CLOAK
	NQOpen	Context:- cartilage - cartilage cartilage is a resilient and smooth elastic tissue , a rubber…
		Question:where is each type of cartilage located in the body?
		Answer:many other body components
	Drop	Context:Hoping to rebound from their loss to the Patriots, the Raiders stayed at home for a Week 16 duel…
		Question:How many field goals did both teams kick in the first half?
		Answer:2
Multiple-Choice	RACE	Context:It’s cool, and it’s hot, and everyone is doing it. People talk about it often, and friends…
		Question:A blogger is a person _ .
		(A) who teaches kids bad words (B) who posts songs from the latest bands
		(C) who got drunk last weekend (D) who writes diaries online
		Answer: who writes diaries online
	OBQA	Context:Null
		Question:Frilled sharks and angler fish live far beneath the surface of the ocean, which is why they are known as
		(A) Deep sea animals (B) fish (C) Long Sea Fish (D) Far Sea Animals Deep sea animals
		Answer:Deep sea animals
	MCTest	Context:It was Jessie Bear’s birthday. She was having a party…
		Question:Who was having a birthday?
		(A) Jessie Bear (b) no one (C) Lion (D) Tiger
		Answer:Jessie Bear
	SIQA	Context:Tracy didn’t go home that evening and resisted Riley’s attacks
		Question:What does Tracy need to do before this?
		(A) make a new plan (B) Go home and see Riley (C) Find somewhere to go
		Answer:Find somewhere to go
	Dream	Context:M: How long have you been teaching in this middle school? W: For ten years…
		Question:What’s the woman probably going to do?
		(A) To teach a different textbook. (B) To change her job. (C) To learn a different textbook.
		Answer:To change her job.

Table 8: Samples extracted from different tasks. Each task contains a context, a question and an answer.