Dynamics-Adaptive Continual Reinforcement Learning via Progressive Contextualization

Tiantian Zhang, Zichuan Lin, Yuxing Wang, Deheng Ye, Qiang Fu,
Wei Yang, Xueqian Wang, Bin Liang, Bo Yuan, and Xiu Li This work was partly supported by the Science and Technology Innovation 2030-Key Project under Grant 2021ZD0201404 and Tencent Rhino-Bird Research Elite Program. (Corresponding author: Bo Yuan and Xiu Li.)Tiantian Zhang is with the Department of Automation, Tsinghua University, Beijing 100084, China, and also with the Tencent AI Lab, Shenzhen 518000, China (e-mail: ztt19@mails.tsinghua.edu.cn).Zichuan Lin, Deheng Ye, Qiang Fu, and Wei Yang are with the Tencent AI Lab, Shenzhen 518000, China (e-mail: zichuanlin@tencent.com; dericye@tencent.com; leonfu@tencent.com; willyang@tencent.com).Yuxing Wang, Xueqian Wang and Xiu Li are with the Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China (e-mail: wyx20@mails.tsinghua.edu.cn; wang.xq@sz.tsinghua.edu.cn; li.xiu@sz.tsinghua.edu.cn).Bin Liang is with the Department of Automation, Tsinghua University, Beijing 100084, China (e-mail: liangbin@mail.tsinghua.edu.cn).Bo Yuan is with Qianyuan Institute of Sciences, Hangzhou 310000, China (e-mail: boyuan@ieee.org).

Abstract

A key challenge of continual reinforcement learning (CRL) in dynamic environments is to promptly adapt the RL agent’s behavior as the environment changes over its lifetime, while minimizing the catastrophic forgetting of the learned information. To address this challenge, in this article, we propose DaCoRL, i.e., dynamics-adaptive continual RL. DaCoRL learns a context-conditioned policy using progressive contextualization, which incrementally clusters a stream of stationary tasks in the dynamic environment into a series of contexts and opts for an expandable multihead neural network to approximate the policy. Specifically, we define a set of tasks with similar dynamics as an environmental context and formalize context inference as a procedure of online Bayesian infinite Gaussian mixture clustering on environment features, resorting to online Bayesian inference to infer the posterior distribution over contexts. Under the assumption of a Chinese restaurant process prior, this technique can accurately classify the current task as a previously seen context or instantiate a new context as needed without relying on any external indicator to signal environmental changes in advance. Furthermore, we employ an expandable multihead neural network whose output layer is synchronously expanded with the newly instantiated context, and a knowledge distillation regularization term for retaining the performance on learned tasks. As a general framework that can be coupled with various deep RL algorithms, DaCoRL features consistent superiority over existing methods in terms of the stability, overall performance and generalization ability, as verified by extensive experiments on several robot navigation and MuJoCo locomotion tasks.

Dynamic environment, continual reinforcement learning (CRL), incremental context detection, adaptive network expansion.

I Introduction

Reinforcement learning (RL) [33] is a major learning paradigm in machine learning for sequential decision making tasks. It aims to train a competent policy for an agent that properly maps states to actions to maximize the cumulative reward by interacting with an environment in a trial-and-error manner. Traditional RL algorithms, such as Q-learning[39] and SARSA[28], have been widely studied as tabular methods and successfully applied to Markov decision processes (MDPs) with finite discrete state-action spaces. The emergence of advanced function approximation techniques based on deep neural networks (DNNs) enables RL to have a higher level of understanding of the physical world [21], and solve high-dimensional tasks ranging from playing video games directly from pixels [24, 34] to making real-time decisions on continuous robot control tasks[29, 9, 5, 11, 2].

The progresses of RL have been predominantly focused on learning a single task with the assumption of a stationary¹¹1A stationary environment is an environment whose dynamics normally represented by the reward and state transition functions of the MDP do not change over time. and fully-explorable environment for sampling observations. Nevertheless, in the real-world, environments are often non-stationary and characterized by ever-changing dynamics such as shifts in the terrain or weather conditions, changes of the target position in robot navigation[17], different traffic inflow rates and demand patterns at different times of a day (e.g., peak and off-peak hours) in vehicular traffic signal control[19], and the variation in coexisting agents in multiagent systems[44]. These scenarios demand competent RL agents that can continually adapt to new environmental dynamics while retaining performance on all previously encountered environmental conditions.

Unfortunately, the above requirements are difficult for existing RL methods to fulfill. On the one hand, storing all past experiences may result in a constant growth in memory consumption and computational power. On the other hand, if the size of the replay buffer is limited, the agent may inevitably suffer from the phenomenon known as catastrophic forgetting[23, 12, 15], incapable of retaining the knowledge and skills learned in previously encountered situations.

Recently, Continual RL (CRL)[19, 20, 10, 18, 25, 42] has been investigated as an effective solution for adaptation to dynamic environments and mitigation of catastrophic forgetting. In this setting, the dynamic environment can be considered as a stream of stationary tasks on a certain timescale where each task corresponds to the specific environmental dynamics during the associated time period. As shown in Fig. 1, the previously learned policy (e.g., $π_{θ_{t - 1}^{*}}$ over $M_{1} \sim M_{t - 1}$ ) is used for the initialization of the new policy (e.g., $π_{θ_{t}}$ on $M_{t}$ ), and it is subsequently updated to fit in the current task during the learning period in a continual fashion, retaining previously learned abilities. In other words, CRL is capable of developing proper behaviors for new tasks while keeping the overall performance across all learned tasks. Such features of continual learning are highly desirable for intelligent systems in real-world applications where the environments are subject to consistent changes.

Fig. 1: Continual reinforcement learning (CRL) in dynamic environments. $M_{t} \in M, t = 1, 2, \dots$ denotes the specific MDP/task in time period $t$ ; $D$ denotes the dynamic environment over the MDPs space $M$ ; $θ$ are the learning parameters of policy; $π_{θ_{t}^{*}}$ represents the approximate optimal policy over all learned tasks $M_{1} \sim M_{t}$ .

Existing CRL methods [20, 10, 18, 25] assume a decomposition of the original problem into disjoint sub-domains of similar dynamics (also called “context”) and their boundaries are known in advance. Consequently, previous studies mainly focus on how to construct effective mechanisms to mitigate the catastrophic forgetting among contexts, largely ignoring the challenge of automatic context inference during the learning process. For the alleviation of catastrophic forgetting/interference caused by data distribution drift in the single-task RL, Zhang et al.[42] employ sequential K-means clustering to achieve automatic context inference, given the number of contexts ( $k$ ) in advance. This method works well because it is feasible to acquire an approximate estimate of the state distribution with sufficient exploration to determine the value of $k$ in a single task. However, significant challenges are expected in dynamic environments, where it is impractical to accurately determine the number of environmental contexts in advance as the changes of environmental dynamics are usually infinite and highly uncertain. As a result, it is more rational for the agent to infer and instantiate environmental contexts in a fully online and incremental manner during the CRL process.

In this article, we investigate CRL in dynamic environments to achieve continual inference of environmental contexts and necessary adaptation. The ultimate objective is to ensure the overall performance of the agent in the whole environment. This work is a significant and essential extension to our previous framework for single-task RL in [42], which relies on the prior knowledge about the number of contexts.

To this end, we propose a novel dynamics-adaptive continual reinforcement learning scheme (DaCoRL) with progressive contextualization, which incrementally clusters a stream of stationary tasks in dynamic environment into a series of contexts and opts for an expandable multihead neural network to learn a context-conditioned policy. The progressive contextualization contains two core modules: The first one is the incremental context detection procedure for automatically detecting the changes in environmental dynamics and clustering a set of tasks with similar dynamics into an environmental context. The second one is the joint optimization procedure to train the policy online for each unique context using an expandable multihead neural network and a knowledge distillation regularization term.

To detect the changes of environmental dynamics over time, we introduce the online Bayesian infinite Gaussian mixture model to cluster environment features in a latent context space, where each cluster corresponds to a separate context. We employ the online Bayesian inference to update the model of contexts in a fully incremental manner, assuming that the prior distribution over the contexts is a Chinese restaurant process (CRP)[1]. With this online incremental clustering technique, DaCoRL can incrementally instantiate new contexts according to the concentration parameter²²2It is a hyperparameter in CRP denoted by $α$ in this article to control the likelihood of new contexts. of CRP as needed, without requiring any external information of environmental changes such as the predetermined number of contexts in [42]. During the joint optimization procedure, we introduce an expandable multihead neural network whose output heads can be adaptively expanded according to the number of instantiated contexts. Compared with the fixed structure neural networks, this approach can eliminate unnecessary redundancy of network structure and improve learning efficiency.

The contributions of this article are summarized as follows.

An incremental context detection strategy is introduced for CRL in dynamic environments. It formalizes context inference in dynamic environments as an online Bayesian infinite Gaussian mixture clustering procedure on environmental features, enabling the agent to properly identify changes in environmental dynamics in an online manner without any prior knowledge of contexts.
A novel dynamics-adaptive CRL training scheme called DaCoRL is proposed for dynamic environments in continuous spaces. By employing an expandable multihead neural network in which an output head is added synchronously with the newly instantiated context, and a knowledge distillation regularization term, DaCoRL can effectively alleviate the catastrophic forgetting and ensure competitive capacity of RL agents for continual learning in dynamic environments.
Extensive experiments on a suite of continuous control tasks ranging from robot navigation to MuJoCo locomotion are conducted to validate the overall superiority of our method over baselines in terms of the stability, overall performance and generalization ability.

In the rest part of this article, Section II reviews the related work and Section III introduces the problem statement of CRL and relevant concepts and notations. In Section IV, the framework of DaCoRL is presented, with details on its mechanism and implementation. Experimental results and analyses on several robot navigation and MuJoCo locomotion tasks are presented in Section V to provide comprehensive evidences on the superiority of DaCoRL over existing techniques. This article is concluded in Section VI with some discussions and directions for future work.

Ii Related Work

Continual learning is conceptually related to incremental learning [26, 27, 16, 37, 38, 36] and online learning [3, 32] as they all assume that tasks or samples are presented in a sequential manner. On the one side, although incremental learning and continual learning are frequently used interchangeably in the literature, they are not always the same. In some studies[26, 27, 16], incremental learning is used to describe a learning process where a sequence of incremental tasks are learned in a continual manner — in this case, continual learning can be referred to as incremental learning. Nevertheless, several other studies on incremental learning [37, 38, 36] concentrate on how to incrementally adjust the previously learned policy to facilitate the fast adaptation to a new task, while ignoring how the agent performs on old tasks. In this sense, continual learning is different from incremental learning.

On the other side, online learning aims to fit a single model to a single task over a sequence of data instances without any adaptation to new tasks or concerns for the mitigation of catastrophic forgetting[3, 32]. By contrast, continual learning considers how to learn a sequence of tasks while maintaining the performance on previously learned tasks. Another field related to continual learning is multitask learning[4, 6, 43], which tries to train a single model on multiple tasks simultaneously. The key difference is that, in multitask learning, the training data for all tasks are simultaneously available while continual learning is usually based on the assumption that the training data of a previous task cannot be readily used for training on the current task.

For CRL in dynamic environments, the greatest challenge comes from detecting the changes of environment autonomously. RL-CD[7] is a model-based approach that estimates a set of partial models (containing the transition probability and reward function for each underlying MDP) for predicting environmental dynamics. It detects context changes by continuously evaluating the prediction quality of each partial model on the given experience transitions in the current environment. The model with the highest prediction quality is designated as the current active model, and when there is no model with quality higher than a predefined threshold, a new one is created. Although no prior knowledge is required, RL-CD is computationally and memory intensive since it needs to build an MDP for each context.

Recently, some model-free methods[25, 22, 42] perform context division directly based on experienced transitions to achieve continual reinforcement learning. Context QL[25], a tabular CRL method, applies an online parametric Dirichlet change point (ODCP) algorithm to state-reward sequences collected during learning to detect changes in the dynamic environments. According to the results of detection, Context QL may learn a new Q table for the newly detected context, or improve the policy learned if the current environmental context has been previously experienced. However, there is a key assumption that the pattern of context changes is known and the number of such changes is finite, since ODCP can only determine whether the environmental context has changed, instead of the specific context. Meanwhile, it is only applicable to RL problems with a small and discrete state-action space due to the Q-tables in use.

For dynamic environments in continuous state spaces, CRL-Unsup[22] is an end to end model-free strategy that detects distributional shifts by tracking the ability of the agent to perform the task and then consolidates the memory when the change is detected. In practice, when the difference between the short-term and long-term moving averages of rewards goes below a certain threshold, the memory elastic weight consolidation (EWC[20]) procedure is triggered to prevent catastrophic forgetting. This method requires storing all policy weights learned before each EWC process is triggered and can be problematic in the case of a positive forward transfer followed by a possible negative backward transfer, as the non-decreasing training cumulative reward curve is used to detect distributional changes. By contrast, our method only needs to store an extra policy model learned up to the beginning of the current task as the teacher network for knowledge distillation, and the incremental context detection module is more effective and timely in the detection of environmental changes.

Recently, IQ[42] shows that performing context division by online clustering and training a multihead neural network with a knowledge distillation regularization term can alleviate the catastrophic interference caused by data distribution drift in the single-task RL. However, it requires the prior knowledge of the number of contexts, which limits its applicability in dynamic environments. In this article, we lift this restriction by introducing an incremental context detection module, which can instantiate contexts incrementally as needed without requiring any prior knowledge of contexts.

In addition to the aforementioned CRL efforts, LLIRL[36] is a recently proposed lifelong adaptation approach for dynamic environments, which employs the EM algorithm, together with a CRP prior on the context distribution, to learn an infinite mixture model to cluster the tasks in dynamic environments incrementally over time. In this way, LLIRL can build upon previous experiences and selectively retrieve the necessary experience to facilitate the adaptation to the current task. It should be noted that LLIRL needs to optimize two separate sets of network parameters to train the behavior policy and parameterize the environment for each instantiated context, respectively, which greatly increases the training and storing cost. Furthermore, it does not consider the issue of catastrophic interference among in-context tasks. By contrast, our proposed method avoids the burden of multiple neural network training by performing policy learning using an expandable multihead neural network. Furthermore, the knowledge distillation technique can effectively reduce the interference among both between-context and in-context tasks, making it possible to conduct effective learning in dynamic environments with a single policy network.

In this article, we aim to design an efficient online CRL method that can automatically detect and identify environmental changes without any prior knowledge of environmental contexts and achieve continual and stable learning in dynamic environments with continuous state-action spaces, avoiding the adverse effect of catastrophic forgetting.

Iii Preliminaries and Notations

The formulation of the CRL problem in the domain of dynamic environments and the related key concepts are introduced in this section. The notations used in this article are summarized in Table I.

Iii-a Problem Formulation

1) Reinforcement Learning in Continuous Spaces: RL is commonly studied following the MDP framework[33], which is defined as a tuple $M = ⟨ S, A, P, R, γ ⟩$ , where $S$ is the set of states; $A$ is the set of actions; $P : S \times A \times S \to [0, 1]$ is the environment transition probability function; $R : S \times A \times S \to R$ is the reward function, and $γ \in [0, 1]$ is the discount factor. At each time step $t \in N$ , the agent moves from $s_{t}$ to $s_{t + 1}$ with probability $p (s_{t + 1} | s_{t}, a_{t})$ after it takes action $a_{t}$ , and receives instant reward $r_{t}$ .

Most RL algorithms rely on the mechanism of policy gradient to handle tasks with continuous state and action spaces[8]. In such cases, the policy is defined as a function $π : S \times A \to [0, 1]$ , mapping each state to a probability distribution of actions, with $\sum_{a \in A} π (a | s) = 1$ , $\forall s \in S$ . If only the state space is continuous, the policy can use the Boltzmann distribution to select discrete actions, while when the action space is also continuous, the Gaussian distribution is commonly used. The goal of policy-based RL is to find an optimal policy $π^{*}$ with internal parameter $θ \in Θ$ that maximizes the expected long-term discount return

J (θ) = E_{τ \sim π_{θ} (τ)} [R (τ)] = E_{τ \sim π_{θ} (τ)} [\infty \sum t = 0 γ^{t} r_{t}]

(1)

where, the expectation is over the complete trajectory $τ = (s_{0}, a_{0}, r_{0}, s_{1}, a_{1}, r_{1}, \dots)$ generated following $π_{θ}$ until the end of the agent’s lifetime.

In basic policy gradient methods (e.g., REINFORCE[40]), the action-selection distribution is usually parameterized by a deep neural network trained by taking the gradient ascent with the partial derivative of the objective (i.e., maximize the excepted return) with respect to the policy parameters. The policy gradient can be approximated expressed as

(2)

where $(τ^{1}, τ^{2}, \dots, τ^{N})$ is a batch of learning trajectories sampled from policy $π_{θ}$ to estimate the return expectation.

Notation	Description
$M$	space of MDPs
$D$	dynamic environment over time in $M$
$M_{t}$	stationary task in the $t^{t h}$ time period
$x_{t}$	feature vector of $M_{t}$
$θ_{t}$	weights of policy network in time period $t$
$π_{θ_{t}^{*}}$	approximate optimal policy over $M_{1} \sim M_{t}$
$π_{r}$	uniform random policy
$z_{t}, z_{t}^{*}$	latent and assigned context label of task $M_{t}$
$z_{1 : t}^{*}$	assigned context labels $(z_{1}, z_{2}, . . ., z_{t})$ for $[M_{1}, M_{2}, \dots, M_{t}]$
$α$	concentration parameter in CRP
$m_{k}^{(t)}$	number of assignments to context $k$ up to the $t^{t h}$ time period
$K_{t}$	number of instantiated contexts up to the $t^{t h}$ time period
$φ_{k}^{(t)}$	estimation of parameters of context $k$ in time period $t$
$μ_{k}^{(t)}$	estimation of centroid vector of context $k$ in time period $t$
$θ_{S}$	weights of the shared representation layers in policy network
$θ_{H, k}$	weights of the $k^{t h}$ output head in policy network
$β$	learning rate for policy update
$λ$	coefficient of distillation regularization

TABLE I: Notations and their descriptions

2) CRL in Dynamic Environments: Following the convention in [36], in this article, we consider the dynamic environment as an infinite sequence of stationary tasks where each task corresponds to the specific environmental dynamics within its time period and the same dynamics may recur more than once across different time periods. The time period is assumed to be long enough for the agent to get sufficient experience samples to finish policy learning for the associated task. Suppose that there is a space of MDPs denoted as $M$ , and a dynamic environment $D$ changing over time in $M$ . The CRL agent interacts with $D = [M_{1}, M_{2}, \dots, M_{t - 1}, M_{t}, \dots]$ , where each $M_{t} \in M$ is a specific MDP/task that is stationary in the $t^{t h}$ time period, and the identify of each task $M_{t}$ , $t \in [1, 2, \dots]$ is unknown to the agent.

To represent the policies for multiple tasks with a single model, we assume that the observations of the dynamic environment contain the discriminative information of the tasks. Under this assumption, the goal of CRL in dynamic environments in time period $t$ is to extend the acquired knowledge, accumulated from previously learned tasks $M_{1} \sim M_{t - 1}$ , to the current task $M_{t}$ , to learn a single policy to achieve the maximum return on all learned tasks $M_{1} \sim M_{t}$

θ_{t}^{*} = arg max θ t \sum i = 1 J_{M_{i}} (θ)

(3)

where $J_{M_{i}}$ is the expected return on task $M_{i}$ . Note that the agent is expected to learn a sequence of tasks one by one strategically so that it can retain the previously acquired knowledge when learning new tasks. In other words, the policy $π$ with parameter $θ_{t}^{*}$ in (3) is an approximate optimal policy over all learned tasks $M_{1} \sim M_{t}$ . In this article, the overall performance of the learned policy across all tasks is the primary metric for judging the performance of CRL agents.

Fig. 2: An overview of DaCoRL in dynamic environments. (a) The general framework of DaCoRL. The dynamic environment is represented by a sequence of stationary tasks $[M_{1}, M_{2}, \dots, M_{t - 1}, M_{t}, \dots]$ with which the CRL agent interacts sequentially. The incremental context detection module either associates an existing context with the current task (e.g., $M_{t - 1}$ belongs to context $1$ ) or instantiates a new context as needed (e.g., $M_{t}$ belongs to a new context $K_{t}$ ) online. When training sequentially on different tasks, a policy network with shared feature extractor (blue) and a set of expandable output heads corresponding to different contexts is maintained. (b) The Bayesian infinite Gaussian mixture model with the CRP prior for context inference. (c) Multihead neural network expansion. For task $M_{t}$ assigned to the new context $K_{t}$ , we add a new output head and initialize it to the nearest existing output head, and then train the whole network on $M_{t}$ while keeping the performance on learned tasks unchanged.

Iii-B Chinese Restaurant Process

The CRP[1] is a discrete-time stochastic process that defines a prior distribution over the cluster structures. The term CRP arises from an analogy of seating a sequence of customers (equivalent to a stream of observations) in a Chinese restaurant with an infinite number of tables (equivalent to clusters) and each table has infinity capacity. Each customer sits randomly at an occupied table with probability proportional to the number of current customers at that table, or at an unoccupied table with probability proportional to a hyperparameter $α$ ( $α > 0$ ). The conditional probability for the $t^{t h}$ customer sitting at the $k^{t h}$ table is

p (z_{t} = k | z_{1 : t - 1}^{*}, α) = ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ \begin{matrix} \frac{m_{k}^{(t - 1)}}{t - 1 + α}, & k \leq K_{t - 1} \frac{α}{t - 1 + α}, & k = K_{t - 1} + 1 \end{matrix}

(4)

where $z_{t}$ is the latent cluster label of the $t^{t h}$ customer; $k$ is the cluster label and $z_{1 : t - 1}^{*} = {z_{1}^{*}, z_{2}^{*}, \dots, z_{t - 1}^{*}}$ represents the cluster labels assigned to the $t - 1$ customers, respectively; $m_{k}^{(t - 1)}$ counts the number of assignments to cluster $k$ up to time $t - 1$ and $K_{t - 1}$ denotes the number of occupied clusters after the $t^{t h}$ customer is seated; $α$ serves as the concentration parameter that controls the likelihood of new clusters.

Iv Proposed Method

In this section, we give a detailed description of DaCoRL whose overview is shown in Fig. 2. We first introduce the key components including incremental context detection and adaptive network expansion. Then, we present a joint optimization scheme combining the proposed techniques with the policy-based RL to achieve competent CRL in dynamic environments. As DaCoRL can be incorporated into any canonical policy-based RL methods, for the sake of clarity, we present an instantiation of DaCoRL using the vanilla policy gradient RL algorithm REINFORCE [40].

Iv-a Incremental Context Detection

In dynamic environments, it is important to automatically detect and identify the changes of the environment, which can enable specialized policy learning for different tasks, alleviating the catastrophic forgetting. In this article, to characterize the environmental dynamics, we inherit and extend the concept of “context” in [42] where a set of tasks with similar dynamics is regarded as the same context. Under this setting, the detection of environmental changes can be transformed into a context division procedure in the latent space.

However, compared with the single stationary environment investigated in [42], context division in dynamic environments brings significantly more challenges. In particular, the changes are usually infinite and highly uncertain and, without prior knowledge, it is hard to determine in advance the number of contexts that need to be instantiated.

To address this issue, we propose an incremental context detection module in DaCoRL, which can instantiate contexts when necessary in an incremental manner without any prior knowledge. Specifically, we formalize the task of context detection as an online Bayesian infinite Gaussian mixture clustering procedure. To do so, we define a set of environment features to represent the environmental dynamics, and then perform clustering on the features by assuming a CRP prior distribution over the clustering structure. In CRP, the instantiation of new contexts is controlled by the concentration parameter $α$ : the larger the value of $α$ , the more contexts instantiated in general. On the one hand, when $α = 0$ , only a single context is instantiated during the learning process. On the other hand, when $α \to \infty$ , a new context is instantiated for each task encountered. The complete procedure of incremental context detection with CRP is described in Algorithm 1, which mostly entails the next two steps.

1) Environment Feature Construction (lines 1-2): With the assumption that the task information is contained in the observation space, we can implicitly detect the changes of environmental dynamics by tracking the variation of the observation distribution. In practical implementations, we construct the features for a specific task from the collected observations. Specifically, before the start of RL training in each time period, the agent is required to explore the current environment using a random policy

π_{r} (a | s) = U n i f o r m (A (s))

(5)

where $A (s)$ is the set of available actions in the state $s$ and $U n i f o r m (\cdot)$ is the uniform distribution function. Subsequently, we construct the feature vector ( $x_{t}$ ) for $M_{t}$ so that $x_{t}$ can represent the dynamics of the current environment from the observations $T_{ε} = {s_{0}^{i}, s_{1}^{i}, s_{2}^{i}, \dots}_{i = 1}^{m}$ , where $m$ is the number of trajectories and $i$ denotes the trajectory index.

Here, we approximate the set of collected observations as a Gaussian distribution and use its mean vector as the feature of the current environment. Naturally, we can obtain $x_{t}$ directly from the original observation space

x_{t} = \frac{1}{N} N \sum i = 1 s_{i}

(6)

for the vector inputs, or from the embedding space

x_{t} = \frac{1}{N} N \sum i = 1 ϕ (s_{i})

(7)

for the visual inputs, where $N$ is the number of states contained in $T_{ε}$ and $ϕ (s)$ can be represented by a random encoder[31] or a standard GAN model[13].

Input: The learning task $M_{t}$ in the $t^{t h}$ time period;
          The parameter set ${φ_{k}^{(t - 1)}}_{k = 1}^{K_{t - 1}}$ of contexts already
          instantiated before the $t^{t h}$ time period.
Parameter: Concentration parameter $α$ .
Output: Task-to-context assignment $z_{t}^{*}$ for $M_{t}$ .
            The parameters ${φ_{k}^{(t)}}_{k = 1}^{K_{t}}$ of instantiated contexts.

1: Sample

m

trajectories from

M_{t}

using a uniform policy

π_{r}

T_{ε} = {τ_{i}}_{i = 1}^{m}, τ_{i} \sim π_{r}

2: Construct the feature vector

x_{t}

M_{t}

from

T_{ε}

3: Initialize

φ_{K_{t - 1} + 1}^{(t - 1)}

of a new potential context as

x_{t}

4: Compute CRP prior

p (z_{t} = k | z_{1 : t - 1}^{*}, α)

for

x_{t}

5: Compute posterior probabilities of task-to-context assignment

p (z_{t} = k | z_{1 : t - 1}^{*}, x_{t})

6: if

p (z_{t} = K_{t - 1} + 1 | z_{1 : t - 1}^{*}, x_{t}) > p (z_{t} = k | z_{1 : t - 1}^{*}, x_{t})

\forall k \leq K_{t - 1}

then

K_{t} = K_{t - 1} + 1

, add

φ_{K_{t - 1} + 1}^{(t - 1)}

{φ_{k}^{(t - 1)}}_{k = 1}^{K_{t - 1}}

8: else

K_{t} = K_{t - 1}

10: end if

11: Update parameters

{φ_{k}^{(t - 1)}}_{k = 1}^{K_{t}}

12: Calculate

z_{t}^{*} = arg max k p (x_{t} | z_{t} = k, φ_{k}^{(t)}), \forall k \leq K_{t}

to obtain the final assignment of

M_{t}

13: return

z_{t}^{*}

{φ_{k}^{(t)}}_{k = 1}^{K_{t}}

Algorithm 1 Incremental Context Detection with CRP

2) Context Detection via Online Incremental Clustering (lines 3-12): To enable incremental context detection in dynamic environments, based on the constructed environment features, we employ online Bayesian inference to update the context models in a fully online fashion, avoiding the necessity for storing previously seen samples. The key step is to estimate the context parameters ${φ_{k}}_{k = 1}^{K_{t}}$ , which can build up a mapping between specific tasks and context variables in the latent space. In our implementations, the context model (predictive likelihood function) represents environment features using diagonal Gaussian distributions

p (x_{t} | φ_{k}) = N (x_{t}; μ_{k}, σ^{2})

(8)

where $μ_{k}$ is the mean of the Gaussian used to denote context $k$ and $σ^{2}$ is a constant indicating the variance. Under this setting, the context centroids are just the mean vectors of the Gaussian distributions to be estimated: $φ_{k} = {μ_{k}}$ .

For a sequence of tasks $[M_{1}, M_{2}, . . ., M_{t - 1}, M_{t}]$ , the first task is assigned to the first context by default. For $M_{t}$ in the $t^{t h}$ time period, we instantiate a new potential context $K_{t - 1} + 1$ , and initialize the new context model parameter $φ_{K_{t - 1} + 1}^{(t - 1)}$ as the feature vector $x_{t}$ , regardless of whether $M_{t}$ has been encountered before (line 3). After that, the posterior probabilities of the task-to-context assignment over all $K_{t - 1} + 1$ contexts are estimated (lines 4-5) by

\begin{matrix} p (z_{t} & = k | z_{1 : t - 1}^{*}, x_{t}, φ_{k}^{(t - 1)}, α) \propto p (x_{t} | φ_{k}^{(t - 1)}) p (z_{t} = k | z_{1 : t - 1}^{*}, α) . \end{matrix}

(9)

By combining the definition of the predictive likelihood in (8) and the CRP prior distribution in (4), the posterior distribution can be rewritten as

(10)

which is abbreviated to $p (z_{t} = k | z_{1 : t - 1}^{*}, x_{t})$ in subsequent derivations for simplicity.

With the estimated $p (z_{t} = k | z_{1 : t - 1}^{*}, x_{t})$ for each context, we can determine whether to keep the new context for $M_{t}$ (lines 6-10). If the posterior probability of the potential context is greater than those of the $K_{t - 1}$ existing contexts, this new context is instantiated for $M_{t}$ .

Next, based on the inferred posterior probabilities, we update context parameters by optimizing the expected log-likelihood

L (x_{t} | φ_{k}^{(t - 1)}) = E_{M_{t}} log p (x_{t} | z_{t} = k, φ_{k}^{(t - 1)})

(11)

where $M_{t} \sim p (z_{t} = k | z_{1 : t - 1}^{*}, x_{t})$ . Suppose that all context models with the prior parameters $φ_{k}^{(t - 1)}$ have been optimized up to the $(t - 1)^{t h}$ time period. The estimation of ${φ_{k}^{(t)}}_{k = 1}^{K_{t}}$ can be updated based on the gradient (line 11)

φ_{k}^{(t)} \leftarrow φ_{k}^{(t - 1)} + η_{k, t} \nabla_{φ_{k}^{(t - 1)}} L (x_{t} | φ_{k}^{(t - 1)}), \forall k \leq K_{t}

(12)

where $η_{k, t} = \frac{1}{m_{k, t - 1} + p (z_{t} = k | z_{1 : t - 1}^{*}, x_{t})}$ is the learning rate for the $k^{t h}$ context in time period $t$ . The gradient term can be derived as $p (z_{t} = k | z_{1 : t - 1}^{*}, x_{t}) \nabla_{φ_{k}} log p (x_{t} | z_{t} = k, φ_{k}^{(t - 1)})$ .

Based on the updated context parameters, the identity $z_{t}^{*}$ of $M_{t}$ can be finally obtained by computing an MAP estimate on the predictive likelihood (line 12), selecting the context model that best fits the current environment.

Iv-B Adaptive Network Expansion

To minimize the interference among contexts in CRL, we opt for an expandable neural network as the policy network in DaCoRL, in which an output head is added synchronously with the newly instantiated context. Each output head specializes on a specific context, and the representation layers are shared among different contexts. This adaptive and expandable network structure can parameterize a dedicated policy for each context without any unnecessary redundancy. In Fig. 2(a), the set of weights of the expandable multihead neural network is denoted by $θ = {θ_{S}, θ_{H, 1}, θ_{H, 2}, \dots, θ_{H, K_{t}}, \dots}$ , where $θ_{S}$ is a set of weights for shared representation layers, and $θ_{H, k}$ , $k \in {1, 2, \dots}$ are the parameters of the $k^{t h}$ output head.

The policy network is initialized as a canonical single-head neural network for the forthcoming learning in the first context. When a new context is instantiated, the neural network is adaptively expanded by adding an output head whose structure is consistent with that of the existing output heads. For the newly added output head $K_{t}$ , we propose the following three practical implementations for parameter initialization.

Random Initialization: The weights of the newly added output head are initialized to random values. In this case, the policy of the newly instantiated context inherits the representation module of the learned policies, while its output layer is trained from scratch.
Random Trained Head Initialization: It randomly selects one of the trained output heads and then initializes the weights of the newly added output head to those of the selected one

$θ_{H, K_{t}} = θ_{H, k}, k = U n i f o r m ({1, 2, \dots, K_{t - 1}}) .$ (13)

This method enables the new policy to inherit knowledge from the learned context. However, it is likely that the cloned policy may hinder the CRL agent’s ability to properly explore the task space corresponding to the current context, especially when there are significant differences between the two contexts.
Nearest Trained Head Initialization: It initializes the newly added output head to a specific trained one whose associated context is nearest to the current instantiated context in the latent context space.

$θ_{H, K_{t}} = θ_{H, k^{*}}, k^{*} = arg min k d i s t (φ_{k}, φ_{K_{t}})$ (14)

where $k \in {1, 2, \dots, K_{t - 1}}$ and $d i s t (φ_{k}, φ_{K_{t}})$ denotes the distance between contexts $k$ and $K_{t}$ .

Intuitively, the third implementation is most likely to encourage forward transfer. The experimental results and analysis in Section V-F provide further elaboration on this point.

Iv-C Joint Optimization Scheme

In the policy optimization stage, we integrate the above components with policy-based RL algorithms to achieve efficient CRL in dynamic environments. Furthermore, we employ the knowledge distillation technique [42] to mitigate the interference in the learning in different contexts (corresponding to different output heads in the policy network) caused by the shared low-level representation. This technique can effectively lessen the catastrophic interference caused by distribution drifts among contexts as well as the minor interference due to the differences among tasks within the same context. Taking REINFORCE as the underlying policy-based RL algorithm as an example, the optimization objective of our proposed DaCoRL is derived as follows.

First of all, we rewrite the original loss function of REINFORCE in (1) with the context label variable $z_{t}^{*}$ as

L_{o r i} (θ_{z_{t}^{*}}) = E_{τ \sim π_{θ_{z_{t}^{*}}} (τ)} [R (τ)]

(15)

where $θ_{z_{t}^{*}} = {θ_{S}, θ_{H, z_{t}^{*}}}$ and $π_{θ_{z_{t}^{*}}}$ is the policy corresponding to the context associated with the current task $M_{t}$ .

As a representative of model compression, knowledge distillation can work well for encouraging the outputs of one network to approximate the outputs of another [14, 35]. In DaCoRL, we use it as a regularization term in the probability distribution estimation of actions to preserve the previously learned policies. To construct the distillation regularization term, we regard the policy network from in the last time period as the teacher network, expressed as $π_{θ^{-}}$ , and the current policy network to be trained as the student network, expressed as $π_{θ}$ . We use the Kullback-Leibler (KL) divergence to constrain the difference between the policies corresponding to the two networks. Thus, the distillation loss of the output head for context $k$ is defined as

L_{D_{k}} (θ_{k}) = E_{τ \sim π_{θ_{z_{t}^{*}}} (τ)} [K L [π_{θ_{k}} (\cdot | s), π_{θ_{k}^{-}} (\cdot | s)]]

(16)

where $θ_{k} = {θ_{S}, θ_{H, k}}$ . In the $t^{t h}$ time period, considering all $K_{t}$ output heads in the policy network, the distillation loss term sums up as

L_{D} (θ) = K_{t} \sum k = 1 L_{D_{k}} (θ_{k})

(17)

where $θ = {θ_{S}, {θ_{H, k}}_{k = 1}^{K_{t}}}$ denotes the set of all weights of the policy network.

Finally, to optimize a policy network that can guide the agent to make proper decisions in dynamic environments without being adversely affected by catastrophic forgetting, we combine (15) and (17) to form a joint optimization scheme. Namely, we solve the CRL problem in dynamic environments by the following optimization objective

\begin{matrix} {max}_{θ_{S}, θ_{H}} L_{o r i} (θ_{z_{t}^{*}}) - λ L_{D} (θ) θ_{z_{t}^{*}} = {θ_{S}, θ_{H, z_{t}^{*}}} θ = {θ_{S}, {θ_{H, k}}_{k = 1}^{K_{t}}} \end{matrix}

(18)

where $λ \in [0, 1]$ is a coefficient to control the tradeoff between learning the new policy and preserving the learned policies.

Input: Dynamic environment $D = [M_{1}, \dots, M_{t}, \dots, M_{T}]$ .
         Single-head policy network $π_{θ}$ with random weights
         $θ = {θ_{S}^{(0)}, θ_{H, 1}^{(0)}}$ .
Parameter: Concentration parameter $α$ ; Learning rate $β$ ;
                Distillation regularization coefficient $λ$ .
Output: Parameter set ${φ_{k}}_{k = 1}^{K_{T}}$ of instantiated contexts;
            Approximate optimal policy parameters $θ^{*}$ for $D$ .

1: for each time period

t \in {1, 2, \dots, T}

2: if

t = 1

then

3: Sample randomly and construct the feature vector

x_{1}

for the task

M_{1}

4: Instantiate

M_{1}

as the first context:

φ_{1}^{(1)} = x_{1}

z_{1}^{*} = 1

5: Set

K_{1} = 1

m_{1}^{(1)} = 1

in the CRP model.

6: Update the policy network from scratch using the canonical policy gradient method to obtain

θ^{*}

θ \leftarrow θ + β \nabla_{θ} L_{o r i}

7: else

8: Infer the task-to-context assignment

z_{t}^{*}

for

M_{t}

using incremental context detection with CRP:

z_{t}^{*}, {φ_{k}^{(t)}}_{k = 1}^{K_{t}} \leftarrow Algorithm % ??? (M_{t}, {φ_{k}^{(t - 1)}}_{k = 1}^{K_{t - 1}}, α)

9: Update the CRP model according to

z_{t}^{*}

m_{k}^{(t)} = m_{k}^{(t - 1)}, \forall k \leq K_{t}

;

m_{z_{t}^{*}}^{(t)} = m_{z_{t}^{*}}^{(t - 1)} + 1

10: if

z_{t}^{*} = K_{t - 1} + 1

then

11: Expand the policy network

π_{θ}

by adding an output head with the nearest trained head initialization, and add

θ_{H, K_{t}}

θ

12: end if

13: Set

π_{θ^{-}} = π_{θ}

for distillation.

14: Update the policy network using the policy gradient method in (18) to obtain

θ^{*}

θ \leftarrow θ + β \nabla_{θ} (L_{o r i} - λ L_{D})

15: end if

16: end for

17: return

{φ_{k}}_{k = 1}^{K_{T}}

θ^{*}

Algorithm 2 DaCoRL

The complete procedure of DaCoRL is summarized in Algorithm 2 where the agent interacts with a dynamic environment $D = [M_{1}, \dots, M_{t}, \dots, M_{T}]$ . In the first time period $t = 1$ , the task $M_{1}$ is instantiated as the first context with parameter $φ_{1}^{(1)}$ (line 3). We initialize the CRP model with $K_{1} = 1$ , $m_{1}^{(1)} = 1$ (line 4) and train the single-head policy network from scratch using the canonical policy gradient method (line 5). In the $t^{t h}$ ( $t \geq 2$ ) time period, we first apply the incremental context detection module to identify the context to which the current task belongs (line 7), and update the CRP model based on the identity of $M_{t}$ for future context detection (line 8). Then, we employ an expandable multihead neural network for policy optimization. Specifically, when a new context is instantiated, the policy network is synchronously expanded by adding an output head initialized with the nearest trained head (lines 9-12). Next, the knowledge distillation regularization term is integrated into the loss of the original RL algorithm to reduce the catastrophic forgetting of learned tasks, and the parameters of the policy network are updated till convergence (lines 13-14). Finally, $K_{T}$ contexts with parameters ${φ_{k}}_{k = 1}^{K_{T}}$ and the optimal policy $π_{θ^{*}}$ are obtained for all learned tasks.

V Experiments and Evaluations

In this section, we conduct comprehensive experiments on several continuous control tasks from robot navigation to MuJoCo locomotion to demonstrate the effectiveness of our method. We design a variety of sequential learning tasks with diverse changes in the underlying dynamics. These problem settings are expected to be representative of the dynamic environments that RL agents may encounter in real-world scenarios. The following are the overarching questions that we aim to answer from our experiments and analysis.

Does DaCoRL successfully achieve better continual reinforcement learning in various dynamic environments compared with existing methods?
How does the initialization strategy of the newly added output head affect the performance of DaCoRL?
How does the number of instantiated contexts in the latent space affect the performance of DaCoRL?
Can DaCoRL achieve a positive forward transfer during the learning process?
How is DaCoRL’s generalization ability to previously unseen tasks?

V-a Datasets

1) Robot Navigation[38, 36]: It contains three types of dynamic environments with parametric variation across tasks. In Fig. 3, Types I, II, and III indicate that the dynamic environments are created in terms of parametric variation in the goal position (changes in the reward function), the puddles positions (changes in the state transition function), and both the goal and puddles positions (changes in both the reward function and state transition functions), respectively. In each navigation task, a robot agent needs to move to a goal position within a unit square. The state consists of the agent’s current 2-D position, and the action corresponds to the 2-D velocity commands in the range of $[- 0.1, 0.1]$ . The reward is equal to the negative squared distance to the goal position minus a small control cost that is proportional to the action’s scale. Each learning episode always starts from a given position and terminates when the agent is within $0.01$ from the goal or when the episode length is greater than $100$ . We choose these commonly used domains as they are well-understood, suitable for highlighting the mechanism and verifying the effectiveness of our proposed method in a straightforward manner.

2) MuJoCo Locomotion[36]: It contains three locomotion tasks with parametric variation and growing dimensions of state-action spaces, as shown in Fig. 4. These continuous control tasks require a one-legged hopper, a planar cheetah or a 3-D quadruped ant robot to run at a particular velocity along the positive $x$ -direction. The reward is an alive bonus plus a regular part that is negatively correlated with the absolute value between the current velocity of the agent and a preset target velocity. The dynamic environment is designed to apply parametric variations in the target velocity within a range: $[0.0, 1.0]$ for Hopper, $[0.0, 2.0]$ for HalfCheetah, and $[0.0, 0.5]$ for Ant. Each learning episode always starts from a given physical status of the agent and terminates when the agent falls down or when the episode length is greater than $100$ . We choose these domains to further evaluate the efficiency of our method on more sophisticated domains.

Fig. 3: Examples of three types of dynamic environments in the robot navigation tasks[38, 36]. $˙ S$ is the start point and $˙ G$ is the goal point. Puddles are shown in white. (a) Type I: the goal changes. (b) Type II: the puddles change. (c) Type III: both the goal and the puddles change.

Fig. 4: Representative MuJoCo locomotion tasks[36] with growing dimensions of state-action spaces including (a) Hopper, $| S | = 11$ , $| A | = 3$ , and $r = 1 - 4 \cdot | v_{x} - x_{g} |$ . (b) HalfCheetah, $| S | = 20$ , $| A | = 6$ , and $r = - | v_{x} - x_{g} |$ . (c) Ant, $| S | = 111$ , $| A | = 8$ , and $r = 1 - 3 \cdot | v_{x} - x_{g} |$ . $v_{x}$ is the agent’s velocity in the positive x-direction and $v_{g}$ is the target velocity.

In our experiments, to constitute the continual learning process, we uniformly generate four Gaussian clusters (representing the expected contexts in this article) for each type of the dynamic environment in its parametric variation space. Each of the first three clusters consists of 12 samples and the fourth cluster contains 14 samples, with each sample containing the values of the variable parameters of a specific task. We sequentially arrange these $T = 50$ tasks in a random order, resulting in a dynamic environment $D = [M_{1}, M_{2}, \dots, M_{T}]$ . For DaCoRL (Oracle), the supervised version of DaCoRL mentioned in the next section, we make available the task-to-context assignments based on the generation process of dynamic environments. By contrast, the correspondence between tasks and contexts is unknown and needs to be identified by DaCoRL itself. In addition, by reference to the benchmarks in the scenario of learning multiple tasks with a single model in environments with parametric variation [41], we construct observation spaces with extra dimensions of variable parameters for the above dynamic environments to meet the assumption in this article that the observations contain the discriminative information of the tasks.

V-B Baselines

We evaluate our method in comparison to the following four state-of-the-art baseline methods and one supervised version of our proposed DaCoRL in dynamic environments.

Naive: It refers to canonical RL methods (e.g., REINFORCE[40], Proximal Policy Optimization (PPO)[30]) that simply train a policy model during the learning process, without paying attention to any possible environmental changes or forgetting.
CRLUnsup[22]: It detects environmental changes by observing the ability of the agent to perform the task (i.e., the difference between the actual reward and the expected one). When this difference goes below a certain threshold, it triggers the memory consolidation procedure employing Elastic Weight Consolidation (EWC) [20].
CDKD: It is adapted from the IQ [42] framework that can alleviate catastrophic interference for value-based RL in stationary environments. In this article, we extend IQ to policy-based RL and perform context division and knowledge distillation (renamed as CDKD), and set the true number of contexts in advance so that it can conduct continual learning in dynamic environments.
LLIRL[36]: It develops and maintains a library that contains an infinite mixture of parameterized environment models for lifelong adaptation in dynamic environments, which focuses on building upon the prior knowledge accumulated during previous learning to optimize the learning parameters to achieve the maximum return in the current environment, without considering the forgetting of previously learned policies.
DaCoRL (Oracle): This approach can be regarded as a supervised version of our proposed DaCoRL, and the key difference is that the agent is informed in each time period of the specific task-to-context mapping (i.e., all context identifications are made available).

In the experiments, we use policy search with nonlinear function approximation to handle continuous control tasks in dynamic environments. For the robot navigation tasks, we perform gradient updates using the vanilla policy gradient RL algorithm REINFORCE. In addition, we employ PPO as the base RL algorithm to learn in the more challenging MuJoCo locomotion tasks.

Testing curves of the average returns over all tasks in the dynamic environments ( — (a)

V-C Implementation

We adopt a similar network architecture for all tasks of robot navigation and MuJoCo locomotion. The policy of DaCoRL is approximated by a feed-forward neural network that contains a fully connected hidden layer (with $200$ units) used as the feature extractor and an expandable output head module used as the action distribution predictor, where each output head consists of a $200$ -unit fully connected hidden layer and a fully connected output layer. The hidden layers are connected by ReLU nonlinearity, following the network configuration for these tasks in [36]. For a fair comparison, the network architecture of CDKD is set to the same as that of DaCoRL and the number of contexts predetermined for CDKD is also set to be consistent with that automatically detected by DaCoRL. For Naive, CRLUnsup, and LLIRL, each policy network consists of two $200$ -unit hidden layers connected by ReLU nonlinearity and a fully connected output layer.

During the learning process, we train the model for $1$ k policy iterations on each task, and evaluate the policy performance by testing the current policy on all tasks in the dynamic environment every $100$ iterations. All results reported are the average performance over five independent runs.

V-D Evaluation Metrics

Following the convention in previous studies[38, 36], we define two performance metrics to systematically evaluate our proposed method. The first one is the average return over a batch of test episodes on all tasks in the dynamic environment, which is used to evaluate the overall performance of the model on all tasks after a fixed number of policy iterations during training in real time. It is defined as

R_{a v e} = \frac{1}{T m} T \sum i = 1 m \sum j = 1 R (τ_{i j})

(19)

where $T$ is the number of tasks in the dynamic environment; $m$ is the number of episodes tested for each task; $R (τ_{i j})$ is the cumulative reward obtained in the $j^{t h}$ test episode of task $i$ . The other is the average return over all test episodes, which evaluates the average performance of the model over the entire training process. It is defined as

{¯ R}_{a v e} = \frac{1}{J} J \sum i = 1 R_{a v e}^{(i)}

(20)

where $J$ is the number of $R_{a v e}$ evaluations in the learning process, and $R_{a v e}^{(i)}$ is $R_{a v e}$ in the $i^{t h}$ evaluation.

Remark 1: For task $M$ , the test procedure of DaCoRL is conducted in two steps: 1) Policy selection. It firstly constructs the feature vector $x$ of $M$ following the procedure in lines 1-2 of Algorithm 1, and the identity $z^{*}$ of $M$ can be obtained by the MAP estimate of the predictive likelihood of the $K_{T}$ contexts with parameters ${φ_{k}}_{k = 1}^{K_{T}}$ obtained by training. The policy corresponding to the output head $z^{*}$ of the trained neural network is the final policy selected for $M$ . 2) Policy execution. The agent applies the selected policy on $M$ to evaluate its performance in terms of the cumulative reward in each episode.

V-E Results

To investigate the effectiveness of our proposed method (Q1), we present the results of DaCoRL and all baselines in three types of dynamic environments of the robot navigation tasks. Fig. 5 shows the average episodic returns over all tasks during training according to (19), and Table II reports the numerical results in terms of the average returns over $1000 / 100 * 50$ tests throughout the whole training process according to (20). For DaCoRL, the numbers of instantiated contexts are $K_{T} = 5$ , $K_{T} = 6$ , and $K_{T} = 4$ for the three types of navigation tasks, respectively. In Fig. 5, it is clear that DaCoRL is significantly superior to all baselines in terms of the average test return and the stability. Naive shows the worst forgetting and performance fluctuations since it does not employ any mechanism to overcome context drifts in dynamic environments. CRLUnsup is slightly better than Naive, possibly due to the EWC technique. However, in Fig. 5(b), it still suffers from wild fluctuations in performance, as the changes of environment cannot be detected just by looking at the cumulative reward curve, and learning the new task can overwrite the learned policy without triggering the memory consolidation procedure. CDKD and LLIRL perform clearly better than the above two baselines. Nevertheless, the fixed policy network structure may lead to an over-constrained optimization objective at the early stage of CDKD training, and although LLIRL employs separate neural networks to learn tasks from different context clusters, it does not consider the interference among tasks within the same cluster.

Task	Type I	Type II	Type III
Naive	$- 20.44 \pm 1.07$	$- 43.55 \pm 2.25$	$- 47.28 \pm 3.98$
CRLUnsup	$- 22.43 \pm 0.97$	$- 37.73 \pm 1.82$	$- 40.32 \pm 1.02$
CDKD	$- 14.34 \pm 0.30$	$- 19.86 \pm 0.69$	$- 19.13 \pm 0.95$
LLIRL	$- 14.24 \pm 0.24$	$- 21.08 \pm 1.92$	$- 25.92 \pm 0.41$
DaCoRL	$- 11.93 \pm 0.43$	$- 16.15 \pm 0.30$	$- 19.58 \pm 0.08$
DaCoRL (Oracle)	$- 12.79 \pm 0.26$	$- 17.61 \pm 0.03$	$- 20.46 \pm 1.19$

TABLE II: Numerical results of

{¯ R}_{a v e}

of all methods in the robot Navigation tasks (Based on the results in Fig. 5. Here
and in related tables, the confidence intervals
are standard deviations. The best perfor-
mance is marked in boldface.)

Task	Hopper	HalfCheetah	Ant
Naive	$16.18 \pm 0.32$	$- 69.98 \pm 6.44$	$36.39 \pm 5.87$
CRLUnsup	$- 35.87 \pm 9.38$	$- 61.08 \pm 0.55$	$23.37 \pm 4.40$
CDKD	$26.47 \pm 3.73$	$- 46.36 \pm 6.63$	$41.24 \pm 1.55$
LLIRL	$35.36 \pm 0.75$	$- 34.84 \pm 3.10$	$42.76 \pm 4.25$
DaCoRL	$44.07 \pm 0.46$	$- 34.11 \pm 0.62$	$52.18 \pm 1.34$
DaCoRL (Oracle)	$44.57 \pm 0.45$	$- 34.31 \pm 0.58$	$54.09 \pm 1.87$

TABLE III: Numerical results of

{¯ R}_{a v e}

of all methods in the MuJoCo Locomotion tasks (Based on the results in Fig. 6.)

By contrast, DaCoRL achieves superior performance compared with all baselines. Due to the effective context division of environments in the latent space, along with an expandable multihead policy network and the knowledge distillation technique, DaCoRL can achieve competent continual learning of sequential tasks in dynamic environments. In particular, DaCoRL is on a par with [see Fig. 5(c)] or even outperforms [see Figs. 5(a) and 5(b)] DaCoRL (Oracle) where the task-to-context assignments are provided in advance. This results reveal that the incremental context detection module may produce more rational and valuable context division for continual learning than the supervised version with known task-to-context assignments. Furthermore, the average results in Table II indicate that DaCoRL receives significantly larger or comparable average returns over all test episodes during the entire learning process than all baselines. Meanwhile, the statistical results also indicate that DaCoRL generally features smaller standard deviations in performance than the baselines.

To further demonstrate the scalability and flexibility of our method, we evaluate DaCoRL and all baselines on MuJoCo locomotion tasks. The average episodic returns over all tasks during training according to (19) are shown in Fig. 6 and the corresponding average returns over all tests throughout the whole learning process according to (20) are summarized in Table III. It shows that DaCoRL consistently exhibits better and more stable performance than all baseline methods, even in these complex dynamic environments. More specifically, DaCoRL instantiates totally four contexts ( $K_{T} = 4$ ) for each dynamic environment, which is consistent with the given contexts in DaCoRL (Oracle). Since the learning curves of these two methods are largely coincident in all tasks in Fig. 6, it confirms that DaCoRL can accurately identify environmental context changes in a fully self-adaptive manner and can achieve comparable performance with the case where the task-to-context assignments are known in advance.

V-F Analysis

Average training episodic return over all tasks per iteration of DaCoRL with different output head initialization implementations in the robot navigation tasks. (a) Type I. (b) Type II. (c) Type III. — (a)

Fig. 8: The clustering patters of contexts in DaCoRL with different numbers of instantiated contexts in the Type I navigation environment. (a) One context. (b) Two contexts. (c) Three contexts. (d) Four contexts. (e) Five contexts. (f) Six contexts. (g) Eight contexts.

The average returns over all tasks in the dynamic environments ( — (a)

1) Influence of the Initialization of Output Heads: To address Q2, we evaluate DaCoRL in the robot navigation tasks with different initialization strategies proposed in Section IV-B. The average returns over all tasks during the first $100$ policy iterations are shown in Fig. 7. Here, we shorten the three initialization implementations to “Random”, “Random Head”, “Nearest Head”, respectively, to ensure the readability.

From Fig. 7, it is clear that the initialization by Nearest Head can enable DaCoRL to attain a better initial policy and more positive forward transfer to the new task, which is consistent with our analysis in Section IV-B. Meanwhile, Random Head exhibits obvious positive forward transfer on Type I and Type II environments and negative forward transfer on the Type III environment, compared with Random initialization. Referring to [37, 38], a possible explanation is that, in the Type III environment, Random Head often chooses an initial policy that hinders the exploration of the new task such that the agent needs to spend more time in the early training stage to counter the old policy, resulting in slow performance improvement.

2) Influence of the Number of Instantiated Contexts ( $K_{T}$ ): Intuitively, instantiating a separate context for each task contained in the dynamic environment (i.e., $K_{T} = T$ , each output head of the policy network corresponds to the policy for an individual task) is likely to result in an optimal policy for the agent. However, it is hard to operate in practice, especially when the dynamic environment comprises a large number of tasks due to the complicated network structure and challenging model training process. Thus, it is necessary to investigate the influence of the number of instantiated contexts on the performance of DaCoRL (Q3).

$K_{T}$	Type I	Type II	Type III
1	$- 22.86 \pm 1.10$	$- 27.34 \pm 3.42$	$- 47.16 \pm 3.69$
2	$- 22.32 \pm 0.34$	$- 26.80 \pm 0.43$	$42.08 \pm 1.35$
3	$- 17.09 \pm 0.88$	$- 22.59 \pm 2.10$	$- 28.74 \pm 0.89$
4	$- 12.89 \pm 0.30$	$- 18.00 \pm 0.96$	$- 19.58 \pm 0.08$
5	$- 11.93 \pm 0.43$	$-$	$- 19.80 \pm 0.42$
6	$- 11.89 \pm 0.43$	$- 16.15 \pm 0.30$	$- 19.86 \pm 0.56$
7	$-$	$- 15.79 \pm 0.21$	$-$
8	$- 11.48 \pm 0.19$	$-$	$-$

TABLE IV: Numerical results of

{¯ R}_{a v e}

of DaCoRL with different
numbers of instantiated contexts in the robot
Navigation tasks (Based on the results in Fig. 9.)

Task	Type I	Type II	Type III	Hopper	HalfCheetah	Ant
Naive	$- 22.17 \pm 1.83$	$- 43.63 \pm 2.42$	$- 50.86 \pm 5.18$	$8.67 \pm 0.47$	$- 75.11 \pm 4.57$	$30.64 \pm 4.39$
CRLUnsup	$- 23.07 \pm 1.23$	$- 39.09 \pm 0.58$	$- 44.78 \pm 0.90$	$- 39.40 \pm 9.42$	$- 63.55 \pm 0.47$	$20.91 \pm 4.56$
CDKD	$- 20.11 \pm 0.60$	$- 31.34 \pm 1.77$	$- 25.96 \pm 0.86$	$24.42 \pm 3.93$	$- 49.66 \pm 7.64$	$34.57 \pm 2.43$
LLIRL	$- 19.08 \pm 0.51$	$- 23.80 \pm 1.71$	$- 33.47 \pm 1.36$	$28.16 \pm 0.77$	$- 37.63 \pm 3.55$	$40.53 \pm 3.70$
DaCoRL	$- 14.92 \pm 0.74$	$- 20.11 \pm 0.78$	$- 25.14 \pm 0.67$	$37.28 \pm 0.46$	$- 36.02 \pm 0.13$	$49.07 \pm 1.38$

TABLE V: Numerical results in terms of the average initial performance over all sequential tasks
during training in the robot Navigation and MuJoCo Locomotion tasks

Task	Type I	Type II	Type III	Hopper	HalfCheetah	Ant
Naive	$- 33.78 \pm 19.51$	$- 39.82 \pm 9.78$	$- 50.19 \pm 8.02$	$39.36 \pm 0.90$	$- 69.84 \pm 12.17$	$55.26 \pm 6.01$
CRLUnsup	$- 11.29 \pm 0.59$	$- 42.52 \pm 9.09$	$- 38.24 \pm 16.41$	$- 43.41 \pm 12.73$	$- 62.92 \pm 1.57$	$34.13 \pm 4.34$
CDKD	$- 11.86 \pm 0.24$	$- 22.46 \pm 2.06$	$- 18.77 \pm 3.93$	$34.25 \pm 9.38$	$- 36.14 \pm 4.05$	$63.26 \pm 1.86$
LLIRL	$- 14.52 \pm 1.28$	$- 39.84 \pm 1.26$	$- 35.94 \pm 0.15$	$48.76 \pm 3.39$	$- 27.67 \pm 3.85$	$59.39 \pm 2.25$
DaCoRL	$- 11.28 \pm 0.22$	$- 17.58 \pm 0.89$	$- 23.38 \pm 4.46$	$53.45 \pm 0.90$	$- 26.36 \pm 1.20$	$67.65 \pm 4.24$

TABLE VI: Numerical results in terms of the average test return over 50 different tasks that have not
been seen during training in robot Navigation and MuJoCo Locomotion tasks

We vary the concentration parameter $α$ in the CRP prior distribution to change the numbers of instantiated contexts in DaCoRL in the robot navigation tasks. Since the environmental features are highly correlated with their variation parameters, that is, intuitively, taking the Type I environment for example, the tasks with adjacent goal positions are more similar to each other and tend to belong to the same context. Hence, we use the goal position in the 2-D coordinate as a visualization to reveal the clustering patterns of contexts for the Type I navigation environment with different numbers of instantiated contexts. The results are shown in Fig. 8, where the star shapes represent the centroids of instantiated contexts obtained from the incremental context detection procedure, and other markers represent tasks ( $M_{i}$ , $i \in [1, 2, \dots, T]$ ) in the dynamic environment. It is clear that our incremental context detection module can capture the appropriate context patterns under different parameter settings in a fully online manner, regardless of how many contexts are ultimately instantiated.

Additionally, we show the performance of DaCoRL with different numbers of contexts in the three types of navigation environments in Fig. 9 and Table IV according to (19) and (20), respectively. Overall, the results are consistent with the intuition that DaCoRL tends to achieve better performance with more contexts instantiated. Meanwhile, maintaining fewer contexts means that tasks with large differences may be clustered into the same context, leading to serious interference during policy network training. Nevertheless, Fig. 9 and Table IV indicate that the performance gains become relatively minor when the number of contexts exceeds a certain value. Consequently, for the sake of model complexity and training cost, we recommend choosing a moderate number of instantiated contexts by adjusting the value of $α$ . For instance, in our experiments, $K_{T} = 5$ for Type I, $K_{T} = 6$ for Type II and $K_{T} = 4$ for Type III are sufficient to obtain reasonable performance in these robot navigation tasks. Simultaneously, we empirically found that the concentration parameter $α$ is insensitive for the number of instantiated contexts, and the same context detection results can be obtained by all $α$ values over a continuous interval. For instance, in our experiments, all values of $α$ in $[0.64, 0.93]$ for Type I, in $[0.85, 0.89]$ for Type II and in $[0.19, 0.81]$ for Type III can obtain the same number of contexts as expected above.

3) Forward Transfer: In DaCoRL, tasks with similar dynamics are grouped into the same context and share the same policy throughout the CRL process. In each time period, DaCoRL retrieves the policy corresponding to the most similar context as the initial policy for the current learning process, which can enable the agent to get better startup performance on each new task. To validate this feature (Q4), we calculate the average training episodic return for each algorithm at the first policy iteration on each task, and the results are shown in Table V. In comparison to all baselines, DaCoRL always achieves better initial startup performance in all experimental dynamic environments, demonstrating significant positive forward transfer.

4) Generalization Results: To investigate the performance of DaCoRL in unseen tasks (Q5), we randomly generate $50$ different tasks that have not been seen during training in each dynamic environment, and test the policies learned by DaCoRL and all baselines in these tasks, where each policy is tested on a specific task for $100$ times. The average test returns over all tasks are shown in Table VI. In general, DaCoRL features superior generalization ability in most unseen tasks compared with all baselines, except on Type III robot navigation tasks, where its generalization performance is slightly inferior to CDKD but still significantly better than other baselines. This phenomenon is explainable since CDKD is the most similar algorithm to DaCoRL and is likely to be comparable to DaCoRL especially when the number of contexts is known in advance.

Vi Conclusion and Future Work

In this article, we present a continual reinforcement learning framework named DaCoRL as a viable solution to the challenge of dynamics adaptibility in continuous spaces. The goal of DaCoRL is to continuously adapt the RL agent’s behavior towards the changing environment, which can be regarded as a sequence of tasks, and to minimize the catastrophic forgetting of previously learned tasks. To this end, DaCoRL employs an incremental context detection module to categorize the stream of tasks into a set of distinct contexts using online Bayesian infinite Gaussian mixture clustering. It also simultaneously optimizes a context-conditioned policy with an expandable multihead neural network, in conjunction with a knowledge distillation regularization term, to avoid the interference both between and within contexts. A key advantage of our method is that it can achieve incremental context instantiation in a fully online manner without requiring any external information to explicitly signal environmental changes. Meanwhile, it only relies on a single policy network to accomplish effective continual learning in dynamic environments. Experiments on several continuous control tasks confirm that DaCoRL can significantly outperform state-of-the-art algorithms in terms of the stability, overall performance and generalization ability.

While REINFORCE and PPO are used as the underlying RL algorithms in the current experimental studies, our proposed framework is versatile and can be easily coupled with any other policy-based RL algorithms. A promising direction for future work is to develop an efficient strategy that can facilitate the positive backward transfer during the CRL process to make the current learning in turn further improve the performance on the learned tasks. Another direction is to address more challenging and realistic cases, such as learning in intensively changing environments where environment changes may happen between consecutive episodes [38].

References

[1] D. J. Aldous (1985) Exchangeability and related topics. In École d’Été de Probabilités de Saint-Flour XIII—1983, pp. 1–198. Cited by: §I, §III-B.
[2] M. G. Bellemare et al. (2020) Autonomous navigation of stratospheric balloons using reinforcement learning. Nature 588 (7836), pp. 77–82. Cited by: §I.
[3] L. Bottou et al. (1998) Online learning and stochastic approximations. Online Learning in Neural Networks 17 (9), pp. 142. Cited by: §II, §II.
[4] R. Caruana (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: §II.
[5] H. L. Chiang, A. Faust, M. Fiser, and A. Francis (2019) Learning navigation behaviors end-to-end with autorl. IEEE Robotics and Automation Letters 4 (2), pp. 2007–2014. Cited by: §I.
[6] M. Crawshaw (2020) Multi-task learning with deep neural networks: a survey. arXiv preprint arXiv:2009.09796. Cited by: §II.
[7] B. C. Da Silva, E. W. Basso, A. L. Bazzan, and P. M. Engel (2006) Dealing with non-stationary environments using context detection. In Proceedings of the International Conference on Machine Learning, pp. 217–224. Cited by: §II.
[8] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel (2016) Benchmarking deep reinforcement learning for continuous control. In Proceedings of the International Conference on Machine Learning, pp. 1329–1338. Cited by: §III-A.
[9] A. Faust et al. (2018) PRM-RL: long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning. In Proceedings of the International Conference on Robotics and Automation, pp. 5113–5120. Cited by: §I.
[10] C. Fernando et al. (2017) Pathnet: evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734. Cited by: §I, §I.
[11] A. Francis et al. (2020) Long-range indoor navigation with PRM-RL. IEEE Transactions on Robotics 36 (4), pp. 1115–1134. Cited by: §I.
[12] R. M. French (1999) Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences 3 (4), pp. 128–135. Cited by: §I.
[13] I. Goodfellow et al. (2014) Generative adversarial nets. Proceedings of the Conference on Neural Information Processing Systems 27. Cited by: §IV-A.
[14] J. Gou, B. Yu, S. J. Maybank, and D. Tao (2021) Knowledge distillation: a survey. International Journal of Computer Vision 129 (6), pp. 1789–1819. Cited by: §IV-C.
[15] R. Hadsell, D. Rao, A. A. Rusu, and R. Pascanu (2020) Embracing change: continual learning in deep neural networks. Trends in Cognitive Sciences. Cited by: §I.
[16] M. Hersche, G. Karunaratne, G. Cherubini, L. Benini, A. Sebastian, and A. Rahimi (2022) Constrained few-shot class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9057–9067. Cited by: §II.
[17] M. A. K. Jaradat, M. Al-Rousan, and L. Quadan (2011) Reinforcement based mobile robot navigation in dynamic environment. Robotics and Computer-Integrated Manufacturing 27 (1), pp. 135–149. Cited by: §I.
[18] S. Kessler, J. Parker-Holder, P. Ball, S. Zohren, and S. J. Roberts (2020) UNCLEAR: a straightforward method for continual reinforcement learning. In Proceedings of the International Conference on Machine Learning, Cited by: §I, §I.
[19] K. Khetarpal, M. Riemer, I. Rish, and D. Precup (2020) Towards continual reinforcement learning: a review and perspectives. arXiv preprint arXiv:2012.13490. Cited by: §I, §I.
[20] J. Kirkpatrick et al. (2017) Overcoming catastrophic forgetting in neural networks. In Proceedings of the National Academy of Sciences, Vol. 114, pp. 3521–3526. Cited by: §I, §I, §II, item 2).
[21] H. Li, Q. Zhang, and D. Zhao (2019) Deep reinforcement learning-based automatic exploration for navigation in unknown environment. IEEE Transactions on Neural Networks and Learning Systems 31 (6), pp. 2064–2076. Cited by: §I.
[22] V. Lomonaco, K. Desai, E. Culurciello, and D. Maltoni (2020) Continual reinforcement learning in 3d non-stationary environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 248–249. Cited by: §II, §II, item 2).
[23] M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of Learning and Motivation, Vol. 24, pp. 109–165. Cited by: §I.
[24] V. Mnih et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §I.
[25] S. Padakandla, K. Prabuchandran, and S. Bhatnagar (2020) Reinforcement learning algorithm for non-stationary environments. Applied Intelligence 50 (11), pp. 3590–3606. Cited by: §I, §I, §II.
[26] S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017) iCaRL: incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2001–2010. Cited by: §II.
[27] A. Rosenfeld and J. K. Tsotsos (2018) Incremental learning through deep adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (3), pp. 651–663. Cited by: §II.
[28] G. A. Rummery and M. Niranjan (1994) On-line q-learning using connectionist systems. Technical Report. Cited by: §I.
[29] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, pp. 1889–1897. Cited by: §I.
[30] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: item 1).
[31] Y. Seo, L. Chen, J. Shin, H. Lee, P. Abbeel, and K. Lee (2021) State entropy maximization with random encoders for efficient exploration. In Proceedings of the International Conference on Machine Learning, pp. 9443–9454. Cited by: §IV-A.
[32] S. Shalev-Shwartz et al. (2012) Online learning and online convex optimization. Foundations and Trends® in Machine Learning 4 (2), pp. 107–194. Cited by: §II, §II.
[33] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §I, §III-A.
[34] O. Vinyals et al. (2019) Grandmaster level in starcraft II using multi-agent reinforcement learning. Nature 575 (7782), pp. 350–354. Cited by: §I.
[35] L. Wang and K. Yoon (2021) Knowledge distillation and student-teacher learning for visual intelligence: a review and new outlooks. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §IV-C.
[36] Z. Wang, C. Chen, and D. Dong (2021) Lifelong incremental reinforcement learning with online bayesian inference. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §II, §II, §III-A, Fig. 3, Fig. 4, item 4), §V-A, §V-A, §V-C, §V-D.
[37] Z. Wang, C. Chen, H. Li, D. Dong, and T. Tarn (2019) Incremental reinforcement learning with prioritized sweeping for dynamic environments. IEEE/ASME Transactions on Mechatronics 24 (2), pp. 621–632. Cited by: §II, §V-F.
[38] Z. Wang, H. Li, and C. Chen (2019) Incremental reinforcement learning in continuous spaces via policy relaxation and importance weighting. IEEE Transactions on Neural Networks and Learning Systems 31 (6), pp. 1870–1883. Cited by: §II, Fig. 3, §V-A, §V-D, §V-F, §VI.
[39] C. J. C. H. Watkins and P. Dayan (1992) Q-learning. Machine Learning 8 (3-4), pp. 279–292. Cited by: §I.
[40] R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3), pp. 229–256. Cited by: §III-A, §IV, item 1).
[41] T. Yu et al. (2020) Meta-World: a benchmark and evaluation for multi-task and meta reinforcement learning. In Proceedings of the Conference on Robot Learning, pp. 1094–1100. Cited by: §V-A.
[42] T. Zhang, X. Wang, B. Liang, and B. Yuan (2022) Catastrophic interference in reinforcement learning: a solution based on context division and knowledge distillation. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §I, §I, §I, §I, §II, §II, §IV-A, §IV-A, §IV-C, item 3).
[43] Y. Zhang and Q. Yang (2021) A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering. Cited by: §II.
[44] Y. Zheng, Z. Meng, J. Hao, Z. Zhang, T. Yang, and C. Fan (2018) A deep bayesian policy reuse approach against non-stationary agents. In Proceedings of the Conference on Neural Information Processing Systems, Vol. 31. Cited by: §I.