Semi-Centralised Multi-Agent Reinforcement Learning with
Policy-Embedded Training

Taher Jafferjee¹

^{, *}

, Juliusz Ziomek¹

^{, *}

, Tianpei Yang², Zipeng Dai¹,
Jianhong Wang³, Matthew Taylor², Kun Shao¹, Jun Wang⁴, David Mguni

†\textsuperscript1

Abstract

Centralised training^†^†* Equal Contribution. ^†^† $^{†}$ Corresponding author: davidmguni@hotmail.com. (CT) is the basis for many popular multi-agent reinforcement learning (MARL) methods because it allows agents to quickly learn high-performing policies. However, CT relies on agents learning from one-off observations of other agents’ actions at a given state. Because MARL agents explore and update their policies during training, these observations often provide poor predictions about other agents’ behaviour and the expected return for a given action. CT methods therefore suffer from high variance and error-prone estimates, harming learning. CT methods also suffer from explosive growth in complexity due to the reliance on global observations, unless strong factorisation restrictions are imposed (e.g., monotonic reward functions for QMIX). We address these challenges with a new semi-centralised MARL framework that performs policy-embedded training and decentralised execution. Our method, policy embedded reinforcement learning algorithm (PERLA), is an enhancement tool for Actor-Critic MARL algorithms that leverages a novel parameter sharing protocol and policy embedding method to maintain estimates that account for other agents’ behaviour. Our theory proves PERLA dramatically reduces the variance in value estimates. Unlike various CT methods, PERLA, which seamlessly adopts MARL algorithms, scales easily with the number of agents without the need for restrictive factorisation assumptions. We demonstrate PERLA’s superior empirical performance and efficient scaling in benchmark environments including StarCraft Micromanagement II and Multi-agent Mujoco.

\doparttoc\faketableofcontents\affiliations

¹Huawei Noah’s Ark Lab ²University of Alberta
³Imperial College, London ⁴University College, London

1 Introduction

Multi-agent reinforcement learning (MARL) has emerged as a powerful tool to enable autonomous agents to tackle difficult tasks such as ride-sharing zhou2020smarts and swarm robotics mguni2018decentralised. Recently, various methodologies have produced significant performance boosts for MARL algorithms mguni2021ligs; kuba2021settling. Nevertheless, an important impediment for MARL is the high variance of the critic and policy gradient estimators. Reducing this variance is a critical challenge since high variance estimates can significantly hinder training, leading to low sample efficiency and poor overall performance gu2016q. This variance has multiple causes. First, MARL methods are applied to unknown environments whose reward signals are often noisy, especially as the sizes of the state and action spaces increases kuba2021settling. Second, unlike in single agent reinforcement learning (RL), MARL agents are faced with the challenge of distinguishing the aleatoric uncertainty due to environmental stochasticity from randomness due to agents’ exploratory actions. These challenges can deeply undermine the performance of MARL methods, especially in centralised learning (CL) methods where agents rely on observations of others’ actions while training.

CL serves as the foundation for many popular MARL methods such as MAPPO yu2021surprising, Q-DPP yang2020multi QMIX rashid2018qmix, SPOT-AC mguni2021learning, and COMA foerster2018counterfactual. In CL, each agent has a (possibly shared) critic that makes use of all available information generated by the system, including the global state and the joint action peng2017multiagent. Accounting for actions of other agents is crucial to achieve good coordination. Despite the obvious advantages from using knowledge of the joint actions in the critic, this methodology has several weaknesses that exacerbate the MARL variance problem. The Q-functions are based on one-off observations of actions sampled from other agents’ policies at a given state. This can produce VF updates based on improbable events and hence, inaccurate estimates of expected returns, leading to higher variance value function (VF) estimates. Also, since each agent’s critic has an explicit dependence on the actions of others, the CL critic suffers from an explosive growth in complexity with the number of agents yang2020overview. This results in CL methods needing large numbers of samples to complete training. Current methods for alleviating this rely on VF factorisations that require restrictive representational constraints. These constraints can cause poor exploration and performance failures when violated mahajan2019maven (e.g., QMIX rashid2018qmix requires a monotonicity constraint that can produce suboptimal value approximation).

Because of these issues affecting CL, (decentralised) independent learning (IL) represents an attractive alternative de2020independent. IL decomposes a MARL problem with $N$ agents into $N$ decentralised single-agent problems while the agents train solely with local observations. This avoids the explosive growth in complexity suffered by CL methods. Nonetheless, since MARL environments consist of multiple learners, IL suffer from convergence issues yang2020overview. Additionally, since IL methods cannot explicitly factor in the behaviour of other agents, achieving coordination among IL agents can be a difficult to achieve, resulting in suboptimal performance in some tasks.

To tackle these challenges, in this paper we propose a new MARL training formalism which we call semi-centralised training. In this framework, the agents use knowledge of other agents’ policies to compute accurate VFs estimates while using a critic (and policy) that has only the agent’s own action and state as an input. Specifically, we introduce policy embedding, a technique that enables the agents’ VF estimates to capture the present and future behaviour of other agents. As with parameter sharing, a technique used in many MARL methods (e.g., MAPPO yu2021surprising, QMIX rashid2018qmix), this procedure requires the agents to communicate their individual policies during training.

Our framework, policy embedding reinforcement learning algorithm (PERLA), embeds the agents’ policies within the critic estimation. This is done by sampling the actions of other agents and using them to marginalise the influence of other agents from the critic. The method is scalable and the resulting critic retains a functional form that requires only the agent’s own action (and state) as input to the action-value function. Consequently, the resulting critic exhibits the efficient scaling benefits of a fully decentralised critic while preserving the convergence guarantees and coordination abilities of a centralised critic.

Many leading MARL methods such as MAPPO yu2021surprising, IPPO de2020independent, MADDPG lowe2017multi use the actor-critic formalism. For concreteness, we instantiate our methodology within actor-critic architectures which involve both critic and policy updates which is a natural candidate to instantiate our framework. We propose Algorithm 1, which is a PERLA version of MAPPO/IPPO and show it improves the performance of the underlying algorithm. Applications of PERLA could naturally be extended to other degenerate architectures (e.g. value-based methods).

We summarise the advantages of PERLA below.
1) PERLA enables (networked) MARL agents to scale efficiently by factoring the joint policy into a new functional representation of the critic. Crucially, each agent’s critic does not use a joint action input. This enables efficient scaling with the number of agents without restrictive VF constraints (validated empirically in Sec. 5.2).
2) Elimination of non-stationarity of IL by incorporating updated policies into the agents’ critics through the policy embedding procedure. This avoids the non-stationarity problem in MARL when critics use localised observations.
3) Plug & play enhancement because PERLA can seamlessly incorporate different Actor-Critic algorithms, PERLA significantly boosts performance over the base MARL learners PPO schulman2017proximal, and MAPPO yu2021surprising (validated empirically Sec. 5.1).
4) PERLA is theoretically sound and we prove that
i) PERLA induces a vast reduction of variance of VF estimates (Theorem 1), ii) PERLA preserves policy gradient estimators (Theorem 3), and iii) Actor-Critic style algorithm based on PERLA converges almost surely to a locally optimal policy profile (Theorem 6).

2 Related Work

Centralised Learning CL is assured to generate policies that are consistent with the desired system goal whenever the IGM principle son2019qtran is satisfied.¹¹1The IGM principle imposes an equivalence between the joint greedy action and the collection of individual greedy actions. In order to realise the IGM principle in the CL framework, QMIX and VDN propose two sufficient conditions of IGM to factorise the joint action-value function. Such decompositions are limited by the joint action-value function class they can represent and can perform badly is systems that do not adhere to these conditions wang2020qplex. QPLEX wang2020qplex uses a dueling network architecture to factor the joint action-value function. It however has been shown to fail in simple tasks with non-monotonic VFs rashid2020weighted. QTRAN son2019qtran formulates the MARL problem as a constrained optimisation problem but has been shown to scale poorly in complex MARL tasks such as the StarCraft Multi-Agent Challenge (SMAC) peng2020facmac. WQMIX rashid2020weighted considers a weighted projection towards better performing joint actions but does not guarantee IGM consistency. Actor-critic methods such as COMA foerster2018counterfactual and MADDPG lowe2017multi are popular methods within MARL. Although these methods use CL without restrictive assumptions, they are significantly outperformed by value-based methods such as QMIX and MAPPO yu2021surprising on standard MARL benchmarks like SMAC peng2020facmac. Consequently, achieving full expressiveness of the IGM function class with scalability remains an open challenge for MARL.

Parameter Sharing (PS) gupta2017cooperative aims to speed up learning by sharing the policy parameters of all the agents during training. In this setup, all agents share the same policy network forcing the agents to share the same policy which is only appropriate in systems with rich symmetries between agents. Leading MARL methods such as MAPPO use PS with proven performance benefits in benchmark domains. Our goal differs from PS since the benefit of PERLA is to enable the agents to embed knowledge of the other agents’ behaviours into the agent’s own critic estimation and policy. As in PS, PERLA requires communication between agents during training, but in PERLA each agent can maintain its own set of parameters, which is more suitable for environments where agents significantly differ from each other.

Opponent Modelling allows each agent to infer the other agents’ policies from their observed behavior. This is widely used in MARL with imperfect information. LOLA FoersterCAWAM18 and its recent variant of COLA willi2022cola seeks to model the impact of an agent’s policy on the opponent’s future policy updates. Through behavioral cloning, the agent infers the opponent’s parameters from the corresponding state-action trajectories. Based on unsupervised representation learning, several methods predict the other agents’ actions by using each agents’ own policy architecture RaileanuDSF18 or auxiliary policy features GroverAGBE18; HongSSCL18. MBOM yu2021model uses the (provided) environment model to simulate multiple best response policies as opponents and then mixes them according to the similarity with the real behaviors of opponents. However, MBOM were examined only in two-player games. In contrast with this set of methods that introduce prediction models to (over)fit an opponent’s trajectories, in PERLA agents communicate their policy parameters (and local observations) to each other during training. With this and using our novel policy embedding procedure allows the agents to immediately form best responses to the joint policies of other agents.

3 PERLA framework

We formulate the MARL problem as a Markov game (MG) deng2021complexity, which is represented by a tuple $G = ⟨ N, S, {(A_{i})}_{i \in N}, P, R_{i}, γ ⟩$ where $S$ is the finite set of states, $A_{i}$ is an action set for agent $i \in N$ , and $R_{i} : S \times A \to P (D)$ is the reward function that agent $i$ seeks to maximise (where $D$ is a compact subset of $R$ ), and $P : S \times A \times S \to [0, 1]$ is the probability function describing the system dynamics where $A := \times_{i = 1}^{N} A_{i}$ . We consider a partially observable setting, in which given the system state $s^{t} \in S$ , each agent $i \in N$ makes local observations $τ_{i}^{t} = O (s^{t}, i)$ where $O : S \times N \to Z_{i}$ is the observation function and $Z_{i}$ is the set of local observations for agent $i$ . To decide its actions, each agent $i \in N$ samples its actions from a Markov policy $π_{i, θ_{i}} : Z_{i} \times A_{i} \to [0, 1]$ , which is parameterised by the vector $θ_{i} \in R^{d}$ . Throughout the paper, $π_{i, θ_{i}}$ is abbreviated as $π_{i}$ . At each time $t \in 0, 1, \dots,$ the system is in state $s^{t} \in S$ and each agent $i \in N$ takes an action $a_{i}^{t} \in A_{i}$ , which together with the actions of other agents $a_{- i}^{t} := (a_{1}^{t}, \dots, a_{i - 1}^{t}, a_{i + 1}^{t}, \dots, a_{N}^{t})$ , produces an immediate reward $r_{i} \sim R (s^{t}, a_{i}^{t})$ for agent $i \in N$ . The system then transitions to a next state $s^{t + 1} \in S$ with probability $P (s^{t + 1} | s^{t}, a^{t})$ where $a^{t} = (a_{1}^{t}, \dots, a_{N}^{t}) \in A$ is the joint action which is sampled from the joint policy $π := \prod_{i = 1}^{N} π_{i}$ . The goal of each agent $i$ is to maximise its expected returns measured by its VF $v_{i} (s) = E [\sum_{t = 0}^{\infty} γ^{t} R_{i} (s^{t}, a^{t}) | s_{0} = s]$ and the action-value function for each agent $i \in N$ is given by $Q_{i} (s, a) = E [\sum_{t = 0}^{\infty} R_{i} (s^{t}, a^{t}) | a^{0} = a]$ , where $- i$ denotes the tuple of agents excluding agent $i$ . Likewise, we denote $\prod_{j = 1, j \neq i}^{N} π_{j}$ as $π_{- i}$ . In the fully cooperative case all agents share the same goal: $R_{1} = \dots R_{N} := R$ .

In the CL paradigm, given a state $s \in S$ and the joint action $a \in A$ , each agent $i \in N$ computes its action-value function $Q_{i} (s, a)$ . The action-value function provides an estimate of the agent’s expected return using its policy given the behaviour of all other agents $N / {i}$ for a given action $a_{i} \in A_{i}$ . Therefore, $Q_{i} (s, a)$ seeks to provide an estimate of the agent’s own action, accounting for the actions of others. MARL agents use stochastic policies to explore — each joint action $a \sim π$ may contain exploratory actions that included in agent policy updates.

The core component of the PERLA framework is a Critic Policy Embedding procedure. As the name suggests, the procedure embeds the agents’ policies within the critic, enabling us to construct a critic whose input takes only the agent’s own action and the state while factoring in the joint behaviour of other agents. This is done by jointly marginalising the action-value function $~ Q$ , as defined below:

{~ Q}_{i} (s, a_{i}) := E_{π_{- i}} [Q_{i} (s, a)]; a \equiv (a_{i}, a_{- i}) \in A,

(1)

where $s \in S, a_{i} \sim π_{i} (\cdot | τ_{i}), a_{- i} \sim π_{- i} (\cdot | τ_{- i})$ . This object requires some explanation: as with $Q_{i}$ , the function ${~ Q}_{i}$ seeks to estimate the expected return following agent $i$ taking action $a_{i}$ . However, unlike $Q_{i}$ , ${~ Q}_{i}$ builds in the distribution of the actions played by other agents under their current policy. This gives less weight to low probability actions and conversely for high probability actions. A key aspect is that $^Q$ depends only on the agent’s own action (and state) which acquires the scalability benefit of IL while factoring in the behaviour of other agents. As in practice, it may be impossible to analytically calculate 1, to approximate ${~ Q}_{i} (s, a_{i})$ , for any $s \in S$ and any $a_{i} \in A_{i}$ we construct the object $^Q$ :

{^Q}_{i} (s, a_{i}) = \frac{1}{k} k \sum j = 1 Q_{i} (s, a_{i}, a_{i}^{(j)}); a_{- i}^{(j)} \sim π (a_{- i} | τ_{- i}) .

(2)

We now describe the implementation details of Actor-Critic algorithm using the critic policy embedding procedure, along with pseudocode in Algorithm 1:
1. Augment the critic $Q$ to accommodate joint action, i.e., $Q (s^{t}, a_{i}^{t}, a_{- i}^{t (j)})$ .
2. Each agent $i$ communicates its policy $π_{i}$ and its local observation $τ_{i}^{t}$ to other agents.
3. Upon arrival to state $s^{t}$ each agent $i$ samples other agents’ actions given their local observations, $k$ times using other agents’ policies $a_{j} \sim π_{j} (a_{j} | τ_{j}^{t})$ . This produces joint samples of the actions ${a_{- i}^{t (j)}}_{j = 1}^{k}$ for each time step $t$ .
4. Each agent uses this to approximate the function
${~ Q}_{i} (s^{t}, a_{i}^{t}) = \frac{1}{k} \sum_{j = 1}^{k} Q (s^{t}, a_{i}^{t}, a_{- i}^{t (j)}) ≊ {^Q}_{i} (s_{t}, a_{i}^{t})$ , embedding the behaviour of other agents.
5. Agent $i$ uses ${~ Q}_{i}$ as the critic to update its policy using the chosen policy gradient algorithm.
6. Return to 2 and repeat until termination.

In our experiments, we use a value-function style critic, where $Q (s^{t}, a_{i}^{t}, a_{- i}^{t (j)}) = r_{t} + γ V (s^{t}, a_{- i}^{t (j)})$ and $V (s^{t}, a_{- i}^{t (j)})$ is approximated via a deep neural network. In this case, it is enough to marginalise the next step values function. To perform policy updates, we use PPO updates schulman2017proximal. We apply PERLA to two cases, one which involves a base learner whose critic input captures global information (MAPPO yu2021surprising) and another base learner whose critic input is only local observations (IPPO) (in both cases, PERLA makes use of shared global state information during training). This gives rise to PERLA MAPPO and PERLA IPPO algorithms respectively.

Input: Joint-policy

π

, critic parameters

ϕ

, policy parameters

θ

, environment

E

, number of marginalisation samples

K

Output: Optimised joint-policy

π

Instantiate critic

Q_{ϕ}

that takes as input state

s

, action

a_{i}

and joint-action

a_{- i}

// Equation 2

1 Rollout

π

E

to obtain data

D = (s^{0}, a^{1}, r^{1}, \dots, s^{T - 1}, a^{T}, r^{T})

for $t \leftarrow 0$ to $T$ do

2 for each agent $i$ do

Generate

K

samples of joint-actions

{a_{- i}^{t (j)} \sim π_{- i} (τ^{t})}_{j = 1}^{K}

Compute TD-error

δ_{i} = \frac{1}{K} \sum_{j = 1}^{K} Q_{ρ} (s^{t - 1}, a_{- i}^{t - 1 (j)}, a_{i}) -

- (r + γ \frac{1}{K} \sum_{j = 1}^{K} Q_{ρ} (s^{t}, a_{- i}^{t (j)}, a_{i}^{t})

) over sampled joint-actions for each agent // Equation 2

3 Update critic parameters

ρ

with

δ_{i}

Update

i

th agent’s policy parameters

θ_{i}

with advantages given by

δ_{i}

, using PPO update

Algorithm 1 PERLA MAPPO/IPPO

4 Theoretical Analysis

In this section, we perform a detailed theoretical analysis of PERLA. Here, we derive theoretical results that establish the key benefits of PERLA, namely that it vastly reduces variance of the key estimates used in training. We begin with a result that quantifies the reduction of variance when using ${^Q}_{i}$ instead of $Q_{i}$ . We defer all proofs to the Appendix.

Theorem 1.

The variance of marginalised Q-function ${~ Q}_{i}$ is smaller than that of the non-marginalised Q-function $Q_{i}$ for any $i \in N$ , that is to say: $Var (Q_{i} (s, a)) \geq Var ({~ Q}_{i} (s, a_{i}))$ . Moreover, for the approximation to the marginalised Q-function (c.f. Equation 2) the following relationship holds:

Var ({^Q}_{i} (s, a_{i})) = \frac{1}{k} % Var (Q_{i} (s, a_{- i}, a_{i})) + \frac{k - 1}{k} Var ({~ Q}_{i} (s, a_{i})) .

Therefore for $k = 1$ we get that the approximation has the same variance as the non-marginalised Q-function. However, for any $k > 1$ the approximation to marginalised Q-function has smaller variance that the non-marginalised Q-function. Therefore marginalisation procedure of PERLA can essentially be used as a variance reduction technique. Let us now analyse how this framework can be applied to enhance multi-agent policy gradient algorithm, which is known to suffer from high variance in its original version.

In the policy gradient algorithms, we assume a fully cooperative game which avoids the need to add the agent indices to the state-action and state-value functions since the agents have identical rewards. The goal of each agent is therefore to maximise the expected return from the initial state defined as $J (θ) = E_{s_{0} \sim p (s_{0})} [v (s_{0})]$ , where $p (s_{0})$ is the distribution of initial states and $θ = (θ_{1}^{T}, \dots, θ_{N}^{T})^{T}$ is the concatenated vector consisting of policy parameters for all agents. The following well-known theorem establishes the gradient of $J (θ)$ with respect to the policy parameters.

Theorem 2 (MARL Policy Gradient zhang2018fully).

\nabla_{θ_{i}} J (θ) = E [\infty \sum t = 0 γ^{t} Q (s^{t}, a_{- i}^{t}, a_{i}^{t}) \nabla_{θ_{i}} log π_{i} (a_{i}^{t} ∣ ∣ τ_{i}^{t})] .

Therefore, to calculate the gradient with respect to policy parameters, it is necessary to make use of the state-action-values. In practice, it can estimated by a function approximator, which gives rise to Actor-Critic methods konda1999actor. In MARL, one can either maintain a centralised critic that provides state-action-value for the $i^{t h}$ agent using the knowledge of the other agents’ actions or introduce a decentralised critic that does not take actions of others (in its inputs). This gives rise to the centralised training decentralised execution (CT-DE) $g_{i}^{C}$ and decentralised $g_{i}^{D}$ gradient estimators respectively, as defined below.

	$g_{i}^{C}$	$:= \infty \sum t = 0 γ^{t} Q (s^{t}, a_{- i}^{t}, a_{i}^{t}) \nabla_{θ_{i}} log π_{i} (a_{i}^{t} ∣ ∣ τ_{i}^{t}),$		(3)
	$g_{i}^{D}$	$:= \infty \sum t = 0 γ^{t} ~ Q (s^{t}, a_{i}^{t}) \nabla_{θ_{i}} log π_{i} (a_{i}^{t} ∣ ∣ τ_{i}^{t}),$		(4)

where $Q (s_{t}, a_{- i}^{t}, a_{i}^{t})$ and $~ Q (s^{t}, a_{i}^{t})$ are the CT-DE and decentralised critics respectively. In PERLA Actor-Critic algorithms (such as in Algorithm 1) we utilise a third estimator - the PERLA estimator given below:

g_{i}^{P} := \infty \sum t = 0 γ^{t}^Q (s^{t}, a_{i}^{t}) \nabla_{θ_{i}} log π_{i} (a_{i}^{t} ∣ ∣ τ_{i}^{t}),

where $^Q (s^{t}, a_{i}^{t})$ is the Monte-Carlo approximation of $~ Q (s^{t}, a_{i}^{t})$ . Note that in this approach we maintain a centralised critic $Q (s^{t}, a_{- i}^{t}, a_{i}^{t})$ , therefore the approximation to the marginalised Q-function is obtained using $^Q (s^{t}, a_{i}^{t}) = \frac{1}{k} \sum_{j = 1}^{k} Q (s^{t}, a_{i}^{t}, a_{- i}^{t (j)})$ , where $a_{- i}^{t (j)} \sim π_{- i} (a_{- i}^{t} | τ_{- i}^{t})$ . The PERLA estimator is equal in expectation to the CT-DE estimator as stated by the following Theorem.

Theorem 3.

Given the same (possibly imperfect) critic, the estimators $g_{i}^{C}$ and $g_{i}^{P}$ have the same expectation, that is:

E [g_{i}^{C}] = E [g_{i}^{P}] .

Therefore, whenever the critic provides the true Q-value, PERLA estimator is an unbiased estimate of the policy gradient (which follows from Theorem 2). However, although the CT-DE and PERLA estimators have the same expectations, the PERLA estimator enjoys significantly lower variance. Similar to the analysis in kuba2021settling, let us study the excess variance those two estimators have over the decentralised estimator. Let $B_{i}$ be the upper bound on the gradient norm of $i$ th agent, i.e. $B_{i} = {sup}_{s, a} | | \nabla_{θ^{i}} log π_{i} (a_{i} | s) | |$ and $C$ be the upper bound on the Q-function, i.e. $C = {sup}_{s, a} Q (s, a)$ .

We can now present two theorems showing the effectiveness of PERLA estimator for policy gradients.

Theorem 4.

Given true Q-values, the difference in variances between the decentralised and PERLA estimators admits the following bound:

Var (g_{i}^{P}) - Var (g_{i}^{D}) \leq \frac{1}{k} \frac{B_{i}^{2} C^{2}}{1 - γ^{2}} .

Theorem 5.

Given true Q-values, the difference in variances between the decentralised and CT-DE estimators admits the following bound:

Var (g_{i}^{C}) - Var (g_{i}^{D}) \leq \frac{B_{i}^{2} C^{2}}{1 - γ^{2}} .

Therefore, we can see that with $k = 1$ the bound on excess variance of the PERLA estimator is the same as for the CT-DE estimator, but as $k \to \infty$ , the variance of PERLA estimator matches the one of the fully decentralised estimator. However, this is done while still maintaining a centralised critic, unlike in the fully decentralised case. It turns out that the presence of centralised critic is critical, as it allows us to guarantee the converge of an algorithm to a local optimum with probability 1. The result is stated by the next Theorem.

Theorem 6.

Under the standard assumptions of stochastic approximation theory konda1999actor, an Actor-Critic algorithm using $g_{i}^{P}$ or $g_{i}^{C}$ as a policy gradient estimator, converges to a local optimum with probability 1, i.e.

P (lim k \to \infty ∥ \nabla_{θ_{i}} J (θ^{k}) ∥ = 0) = 1,

where $θ^{k}$ is the value of vector $θ$ obtained after the $k$ th update following the policy gradient.

{proofsketch}

We present a proof sketch here and defer the full proof to Appendix Proof of Theorem 6.
The proof consists of showing that a multi-agent Actor-Critic algorithm using policy gradient estimate $g_{i}^{P}$ or $g_{i}^{C}$ is essentially a special case of single-agent Actor-Critic. Note that because the decentralised critic does not allow us to query the state-action-value for joint action of all agents, a decentralised actor-critic using $g_{i}^{D}$ as policy gradient estimate is not equivalent to the single-agent version and we cannot establish convergence for it. Therefore, our PERLA estimator enjoys both the low variance property of the decentralised estimator and the convergence guarantee of the CT-DE one. Additionally, having a centralised critic yields better performance in practice in environments with strong interactions between agents kok2004sparse.

5 Experiments

We ran a series of experiments in Large-scale Matrix Games (son2019qtran), Level-based Foraging (LBF) (christianos2020shared), Multi-agent Mujoco de2020deep and the StarCraft II Multi-agent Challenge (SMAC) samvelyan2019starcraft²²2The specific maps/variants used of each of these environments in given in Section B of the Appendix) to test if PERLA: 1. Improves overall performance of MARL base learners. 2. Enables efficient scaling in the number of agents. 3. Reduces variance of value function estimates. In all tasks, we compared the performance of PERLA against MAPPO (yu2021surprising) and IPPO de2020independent. Here, we report average training results across multiple maps for LBF and SMAC. Further performance comparisons across these maps are deferred to the Appendix. Lastly, we ran a suite of ablation studies which we deferred to the Appendix.

We used the codebase accompanying the MARL benchmark study in papoudakis2021benchmarking to implement PERLA on top IPPO schulman2017proximal and MAPPO yu2021surprising. IPPO is a decentralised learning algorithm, while MAPPO follows the CT-DE paradigm. Implementing PERLA on both widely used training paradigms enables us to examine the impact of PERLA in each case. Hyperparameters were tuned using simple grid-search, and the values over which we tuned them are presented in Table 3 in the Appendix. All results are statistics of the metric being measured over $3$ random seeds unless otherwise stated. In our plots dark lines represent the mean across the seeds, while shaded areas represent 95% confidence intervals.

5.1 Performance Analysis

We conducted experiments to ascertain if PERLA yields performance improvements to base MARL algorithms to which it is added as claimed.

Large-Scale Matrix Games. To demonstrate PERLA’s ability to handle various reward structure and scale efficiently we first tested its performance in a set of variants of the hard matrix game proposed in (son2019qtran) given in Fig. 1. The game contains multiple stable points and a strongly attracting equilibrium $(a_{1}, b_{1})$ .

To ensure sufficient data collection in the joint action space, we adopted uniform data distribution in batch learning techniques. With this fixed dataset, we can compare the optimality of PERLA and baselines from an optimisation perspective. We use $500$ training iterations and average the results of $[0 - 10]$ random seeds in each method. As shown in Table 1, policy-based methods (MAPPO and IPPO) and leading value-based methods (MAIC, QPLEX wang2020qplex and WQMIX rashid2020weighted) achieve optimal performance in the initial settings ( $2$ agents with $3$ available actions), while other algorithms achieve suboptimal outcomes. When scaling with more agents and larger action space, only PERLA can maintain the optimal performance nearly across all action sizes. This is because PERLA enables agents to maintain estimates that account for other agents’ actions. Other policy-based methods (MAPPO and IPPO) only consider states as inputs in critic network, without considering agents’ actions. Most of value-based baselines (except for QTRAN) are strictly restricted by IGM consistency, which may not be suitable for complex domains.

	$b_{1}$	$b_{2}$	$b_{3}$
$a_{1}$	8	-12	-12
$a_{2}$	-12	0	0
$a_{3}$	-12	0	0

Table 1: Payoff matrix for

2

agents and

3

actions each.

Figure 1: Results of scaling with more agents (bottom left) and larger action space (bottom right) in cooperative matrix game extensions of the matrix game (Table 1).

Figure 2: Average performance over all tested LBF maps of IPPO, PERLA_IPPO (left); MAPPO, PERLA_MAPPO (right). PERLA improves sample efficiency (better performance faster) and performance of the final policy.

Level-based Foraging. Figure 2 shows learning curves averaged across all LBF maps that we ran (the full list of maps is given in the Appendix E). As shown in the plots, PERLA enhances the base algorithm. Compared to standard IPPO, PERLA_IPPO learns significantly faster: the PERLA_IPPO achieves evaluation of return of $0.6$ by about $300 k$ training steps while standard IPPO only achieves the same performance after $1, 000 k$ steps. Cruciually, PERLA_IPPO learns superior policies at the end of training (achieving returns of approximately $0.8$ ) compared to standard IPPO (which achieves around $0.65$ in returns after training) which is around a 20% increase. Similarly, PERLA_MAPPO enhances standard MAPPO but does so to a lesser extent achieving a return of $0.6$ after approximately $300 k$ steps while standard MAPPO requires about $600, 000$ steps to achieve the same performance.

StarCraft II Multi-agent Challenge.

Figure 3: Average performance over *all* SMAC maps for IPPO, PERLA_IPPO (top); MAPPO, PERLA_MAPPO (bottom). PERLA enhances the base method both in terms of sample efficiency (better performance, faster) and final performance.

Figure 3 shows averaged performance of base IPPO and MAPPO and their respective PERLA variants over a wide range of SMAC maps from all difficulty levels. At regular intervals during training, we ran several evaluation episodes and tracked the median win-rate. The learning curves are then generated by computing the mean of the median win rate. SMAC maps are richly diverse and vary in the number of agents to control, density of environment reward, degree of coordination required, and (partial)-observability. Therefore, aggregated and averaged performance over all maps gives us a fairly robust understanding of the effects of PERLA. As with LBF, PERLA enhances the performance of IPPO to a much greater degree than MAPPO. However, there are clear gains with both algorithms. In both cases, the PERLA variants of the algorithms are more sample efficient and converge to better overall policies (within the training budget).

Multi-agent Mujoco.

To study PERLA’s capabilities in complex settings that require both scalability and coordination, we compared its performance with the aforementioned baselines on three tasks in multi-agent Mujoco: Walker 2 $\times$ 3, Hopper 3 $\times$ 1, and Swimmer 2 $\times$ 1. In Fig. 13 (see Appendix), we report our results showing the performance of PERLA against leading MARL baselines. We ran $6$ runs of each algorithm. Fig. 13 we can see PERLA outperforms IPPO and MAPPO on all three tasks. By enabling agents to maintain estimates that account for other agents’ actions, PERLA achieves more accurate value estimation with the variance reduction, therefore establishing more efficient learning.

PERLA eliminates IPPO failure modes.
Despite independent learners (IL) being scalable and quick to train, IL can fail to achieve coordination between agents and occasionally fail to solve the task altogether. In Section 1, we claimed PERLA induces coordination and can minimise failure modes exhibited by IL. To test this claim, we counted the number of maps where an algorithms return at evaluation or test win-rate was below 0.1 at the end of training. Figure 4 shows the frequency of training failures over a range of SMAC and LBF maps (see Appendix B) for IL (IPPO), CL (MAPPO), and PERLA_IPPO. As in shown in Figure 4, Perlaising IPPO eliminates the occurrence of failure modes for IPPO as claimed. Crucially, as we later show, PERLA has highly efficient scalability benefits while eliminating the failure modes of IL.

Figure 4: Learning failure frequencies (across all SMAC and LBF maps). PERLA_IPPO vastly reduces occurrences of learning failure and outperforms the CL method, MAPPO.

PERLA boosts performance in fully observable settings. Perlaisation of IL methods introduces global state information during training (while training with a Q function whose input space is local observations). In some cases, accessing global state (GS) information can improve performance during training. To verify that PERLA delivers vast benefits beyond providing access to global information, we ran PERLA_IPPO against MAPPO and IPPO on a variant of LBF which is fully observable (i.e. each agent’s local observation provides full information about the full system state). Figure 5 shows performance of of both IPPO (GS) and PERLA_IPPO (GS) when the algorithms now have access to global state information. PERLA_IPPO outperforms IPPO indicating the performance benefits of Perlaising IPPO extend beyond providing GS information.

Figure 5: Learning performances of IPPO and PERLA_IPPO with access to the global state (GS). PERLA_IPPO outperforms IPPO by over 15%.

5.2 Scaling Analysis

Figure 6: Performance improvement (%) after training of PERLA_IPPO over IPPO (left) and MAPPO (right) as the number of agents ( $N$ ) increases. For $N = 8$ , PERLA_IPPO yields over $+ 1000 %$ performance gains over both baselines.

Scaling efficiently to large systems (i.e. systems with many agents and large action spaces) is a major challenge for MARL. In Sec. 1, we claimed the PERLA framework enables MARL to scale efficiently. To test this claim, we investigated PERLA’s scaling ability in our large scale matrix games (described above) and LBF. In matrix games, we demonstrated PERLA’s ability to efficiently scale across both dimensions namely games varied by i) the number of agents $N = 2, 3, 4, 5, 6, 10, 15, 20$ and ii) the cardinality of the agents’ action sets $| A_{i} | = 3, 6, 9, 12, 15$ . In each case, we retained the setup that reward agents with a score of $8$ only when all agents choose the first action.

We then tested this claim in LBF where we varied the number of agents in the system. Fig. 6 shows PERLA_IPPO’s percentage performance gains over IPPO and MAPPO. As shown, the performance gains monotonically increase with the number of agents yielding over $1000 %$ improvement in systems with $8$ agents.

5.3 Variance Analysis

In Sec. 4, we proved that PERLA reduces the variance of VF estimators. To show this empirically, we reported the cumulative TD-errors (which serve as proxies for the value estimates tamar2016learning) over the course of training in LBF $20 \times 20$ 10p3f over 5 seeds.

Figure 7: Comparison of cumulative TD errors on the LBF task. Shown with 0.95 confidence intervals over 5 seeds.

6 Conclusion

MARL estimators suffer from high variance due to a number of sources of randomness. This hinders learning and reduces the sample efficiency of MARL methods. Secondly, many leading CT-DE algorithms avoid exponential scaling complexity at the expense of employing representational constraints that impose strong restrictions. Despite the popularity of the CT-DE paradigm, its efficacy is heavily constrained by limits imposed by these challenges. On the other hand, IL often depend on random instances of coordination to solve tasks which can lead to poor sample efficiency and learning failures. Consequently, scalability and efficient learning are key challenges in MARL research.

We introduced PERLA, a plug & play enhancement tool that enables MARL to scale efficiently while inducing sample efficient, coordinated learning among MARL agents. Our theory and empirical analyses show that PERLA reduces the variance of VF estimators which is critical for efficient learning. By introducing a policy embedding protocol which factors in the behaviour of other agents in the critic functional representation, PERLA avoids joint action input spaces to train MARL critics. In this way, PERLA enables MARL to exhibit the scalability benefits of IL while achieving coordinated learning, often required to achieve optimal performance. This unlocks the capacity for coordinated behaviours amog MARL agents whose critic structure resembles that of IL agents. In doing so, PERLA powerfully bridges the gap between MARL and single agent RL.

References

Appendix

Appendix A Ablation Study - Number of Samples Used in Marginalization

We ran an ablation study over the number of samples required to marginalise $Q_{i}$ to obtain $^Q$ . We selected three tasks from Level-based Foraging at random: Foraging-5x5-2p-1f-coop, Foraging-10x10-5p-3f, and Foraging-10x10-3p-5f. We ran PERLA_IPPO on these maps with a range of values for $K$ for three seeds each. Figure 8 shows the area-under-the-curve averaged across seeds and across the three maps versus $K$ . As expected, the general trend is that more samples yields better performance. PERLA_IPPO with $K = 100$ is significantly better than PERLA_IPPO with $K = 5$ . However this trend discontinues beyond $K = 250$ — the algorithm performs slightly worse than $K = 100$ indicating that increasing the number of samples does not improve performance.

Figure 8: Ablation on $K$ , the number of samples used to marginalise $Q_{i}$ . As expected, the overall trend in the plots suggests that more samples are better.

Appendix B Learning Curves on Level-Based Foraging

Figure 9: Learning curves across all tested Level-Based Foraging maps for PERLA_IPPO and baselines. PERLA_IPPO outperforms all baselines.

Figure 10: Learning curves across all tested Level-Based Foraging maps for PERLA_MAPPO. PERLA_MAPPO outperforms all baselines.

Appendix C Learning Curves on Starcraft Multi-agent Challenge

Figure 11: Learning curves across all tested StarCraft Challenge maps for MAPPO and PERLA_MAPPO. In each case, PERLA_MAPPO outperforms MAPPO.

Figure 12: Learning curves across all tested StarCraft Challenge maps for MAPPO and MAPPO with PERLA. In each case, PERLA_MAPPO outperforms MAPPO.

Figure 13: Comparisons of PERLA variants versus base learners on Multi-agent Mujoco tasks.

Appendix D Computing Resources and Hyperparmeters

d.1 Computing Resources

All experiments (including hyperparameter search) were run on machines equipped as desribed in Table 2. We used multiple such machines in parallel to complete all runs.

Component	Description
CPU	Intel Core i9-9900X CPU @ 3.50GHz
GPU	Nvidia RTX 2080
Memory	64 GB DDR4

Table 2: Specifications of machines used to run empirical component of this paper.

d.2 Hyperparameters

We used the implementation of papoudakis2021benchmarking for the baseline algorithms as well as the foundation upon which we implemented the PERLAISED version of IPPO and MAPPO. For the baseline algorithms, we re-used the optimal hyperparameters reported by papoudakis2021benchmarking. To optimise PERLA_IPPO and PERLA_MAPPO, for each environment, we sampled a small number of tasks and used simple grid-search to optimise selected hyperparameters. The combination of hyperparameters that achieved maximum Return or Test Win-Rate were then used in the runs of all remaining tasks. Table 3 shows the hyperparameters optimised and the range of values evaluated.

Hyperparameter	Range
Hidden Dimension	[64, 128, 256]
Learning Rate	[3e-4, 5e-4]
Network Type	[Fully-connected, Gated Recurrent Network]
Reward Standardisation	[False, True]
Entropy Coefficient	[1e-3 1e-2 1e-1]
Target network update	[Hard update every 200 steps, Soft update of 0.01 per step]
Number of steps for calculating Return	[5, 10]
Number of PERLA samples (K)	[10, 100]

Table 3: Hyperparameters and range of values evaluated to optimise PERLAISED variants of IPPO and MAPPO.

Appendix E LBF experiments

e.1 Maps

We used the following maps in Level-based Foraging: Foraging-5x5-2p-1f-v2, Foraging-5x5-2p-1f-coop-v2, Foraging-5x5-2p-2f-v2, Foraging-8x8-2p-2f-coop-v2, Foraging-8x8-3p-3f-v2, Foraging-10x10-5p-1f-coop-v2, Foraging-10x10-3p-5f-v2, Foraging-10x10-5p-3f-v2, Foraging-10x10-8p-1f-coop-v2, Foraging-15x15-8p-1f-coop-v2

e.2 Selected Hyperparameters

We used the hyperparameters given in Table 4 for all runs of LBF.

Hyperparameter	Value
Hidden Dimension	[256]
Learning Rate	[5e-4]
Network Type	[Fully-connected]
Reward Standardisation	[False]
Entropy Coefficient	[1e-3]
Target network update	[Soft update of 0.01 per step]
Number of steps for calculating Return	[10]
Number of PERLA samples (K)	[100]

Table 4: Hyperparameters and optimised values PERLAISED variants of IPPO and MAPPO on LBF.

Appendix F SMAC Experiments

f.1 Maps

We used the following maps in StarCraft Multiagent Challenge.
Easy - 2m_vs_1z, 2s_vs_1sc, so_many_baneling, bane_vs_bane, 1c3s5z, 2s3z, 3m, 8m, MMM
Hard - 2c_vs_64zg, 3s5z, 25m
Very Hard - 3s5z_vs_3s6z, MMM2, corridor

f.2 Selected Hyperparameters

Hyperparameters used for Multi-agent PPO in the SMAC domain.

{adjustbox}

width=0.4center Hyperparameters value actor lr 1e-3 critic lr 5e-4 gamma 0.99 batch size 3200 num mini batch 1 PPO epoch 10 PPO clip param 0.2 entropy coef 0.01 optimiser ADAM opti eps 1e-5 max grad norm 10 actor network mlp hidden layper 1 hidden layer dim 64 activation ReLU gain 0.01 eval episodes 32 use huber loss True rollout threads 32 episode length 100 Number of PERLA samples (K) 100

Appendix G Matrix Game Experiments

g.1 Selected Hyperparameters

Hyperparameters used for Multi-agent PPO in the matrix game domain.

{adjustbox}

width=0.3center Hyperparameters value actor lr 1e-4 gamma 0.99 batch size 64 num mini batch 1 PPO epoch 5 PPO clip param 0.2 entropy coef 0.01 optimiser ADAM opti eps 1e-5 max grad norm 10 actor network mlp hidden layper 1 hidden layer dim 64 activation ReLU gain 0.01 eval episodes 1 use huber loss False rollout threads 4 episode length 1

Appendix H Multi-agent Mujoco Experiments

Hyperparameters used for Multi-agent PPO in the Multi-Agent MuJoCo domain.

{adjustbox}

width=0.6center Hyperparameters Hopper(3x1) Swimmer(2x1) Walker(2x3) actor lr 5e-4 5e-4 5e-4 critic lr 5e-3 5e-3 5e-3 lr decay 1 1 1 $γ$ 0.99 0.99 0.99 batch size 4000 4000 4000 num mini batch 40 40 40 PPO epoch 5 5 5 PPO clip param 0.2 0.2 0.2 entropy coef 0.001 0.001 0.001 optimiser RMSProp RMSProp RMSProp momentum 0.9 0.9 0.9 optim eps 1e-5 1e-5 1e-5 max grad norm 0.5 0.5 0.5 actor network mlp mlp mlp hidden layer 2 2 2 hidden layer dim 128 128 128 activation ReLU ReLU ReLU eval episodes 10 10 10 rollout threads 4 4 4 episode length 1000 1000 1000

Appendix I Proofs of Theoretical Results

Proof of Theorem 1

Let us first present a Lemma, which we use to prove the Theorem 1.

Lemma 1.

Given $N$ random variables $(x_{i})_{i = 1}^{N}$ , where $x_{i} : Ω \to X_{i}$ and a measurable function $f : \times_{i = 1}^{N} X_{i} \to R$ , if we define $~ f (x_{1}, \dots, x_{M}) := E [f (x_{1}, \dots, x_{N}) | x_{1}, \dots, x_{M}]$ for some $M \leq N$ , we have that:

Var (f (x_{1}, \dots, x_{N})) \geq Var (~ f (x_{1}, \dots, x_{M})) .

(5)

Moreover, for a $k$ -sample Monte-Carlo estimator of $~ f$ given by:

^f (x_{1}, \dots, x_{M}) = \frac{1}{k} k \sum i = 1 f (x_{1}, \dots, x_{M}, x_{M + 1}^{(i)}, \dots, x_{N}^{(i)}),

where $x_{j}^{(i)}$ is the $i^{t h}$ sample of $x_{j}$ we have:

Var (^f (x_{1}, \dots, x_{M})) = \frac{1}{k} Var (f (x_{1}, \dots, x_{N})) + \frac{k - 1}{k} Var (~ f (x_{1}, \dots, x_{M})) .

(6)

Proof.

Using the law of total variance we can decompose the variance of $^f$ as follows:

Var (f (x_{1}, \dots, x_{N})) = E [Var (f (x_{1}, \dots, x_{N}) | x_{1}, \dots, x_{M})] + Var (E [f (x_{1}, \dots, x_{N}) | x_{1}, \dots, x_{M}])

We can now observe that the conditional expectation inside the variance in second term is just $~ f$ . We therefore get:

Var (f (x_{1}, \dots, x_{N})) = E [Var (f (x_{1}, \dots, x_{N}) | x_{1}, \dots, x_{M})] + Var (~ f (x_{1}, \dots, x_{M}))

(7)

The first term is an expectation of variance and therefore cannot be negative, this proves the statement in Equation 5. Applying the law of total variance, again, this time for $^f (x_{1}, \dots, x_{M})$ we get:

Var (^f (x_{1}, \dots, x_{M})) = E [Var (^f (x_{1}, \dots, x_{M}) | x_{1}, \dots, x_{M})] + Var (E [^f (x_{1}, \dots, x_{M}) | x_{1}, \dots, x_{M}])

Note that $^f (x_{1}, \dots, x_{M})$ is a random variable dependent on $(x_{M + 1}, \dots, x_{N})$ . Since we estimate $^f$ by Monte-Carlo method with $k$ samples, we get that the expectation must be equal to the true value $~ f$ and variance is equal to the variance of samples divided by $k$ . We therefore get:

Var (^f (x_{1}, \dots, x_{M})) = \frac{1}{k} E [Var (f (x_{1}, \dots, x_{M}) | x_{1}, \dots, x_{M})] + Var (~ f (x_{1}, \dots, x_{N}))

Substituting an expression for $E [Var (f (x_{1}, \dots, x_{M}) | x_{1}, \dots, x_{M})]$ from Equation 7, we get:

Var (^f (x_{1}, \dots, x_{M})) = \frac{1}{k} Var (f (x_{1}, \dots, x_{N})) + \frac{k - 1}{k} Var (~ f (x_{1}, \dots, x_{N}))

which proves the statement in Equation 6. ∎

See 1

Proof.

Taking the Q-function as $f$ , the action $a_{i}$ of the $i$ th agent and the state $s$ as random variables $x_{1}, \dots, x_{M}$ , and actions of other agents $a_{- i}$ as remaining variables $x_{M + 1}, \dots, x_{N}$ we immediately obtain the statement of the Theorem from Lemma 1. ∎

Proof of Theorem 3

See 3

Proof.

Using the tower property of expectation we have:

	$E [g_{i}^{C}]$	$= E [E [\infty \sum t = 0 γ^{t} Q_{i} (s^{t}, a_{i}^{t}) \nabla_{θ_{i}} log π_{i} (a_{i}^{t} ∣ ∣ s^{t}) \| a_{i}^{t}, s^{t}]]$
		$= E [\infty \sum t = 0 γ^{t} E [Q_{i} (s^{t}, a_{i}^{t}) \| a_{i}^{t}, s^{t}] \nabla_{θ_{i}} log π_{i} (a_{i}^{t} ∣ ∣ s^{t})]$
		$= E [\infty \sum t = 0 γ^{t} E [{^Q}_{i} (s^{t}, a_{i}^{t}) \| a_{i}^{t}, s^{t}] \nabla_{θ_{i}} log π_{i} (a_{i}^{t} ∣ ∣ s^{t})]$
		$= E [\infty \sum t = 0 γ^{t} {^Q}_{i} (s^{t}, a_{i}^{t}) \nabla_{θ_{i}} log π_{i} (a_{i}^{t} ∣ ∣ s^{t})] = E [g_{i}^{P}],$

where the third equality is due to Monte-Carlo estimates being unbiased. ∎

Proof of Theorem 4

See 4

Proof.

Using the law of total variance on the variance of $j$ -th component of PERLA estimator we get:

Var (g_{i, j}^{S}) = E [Var (g_{i, j}^{S} | a_{i}^{t}, s^{t})] + Var (E [g_{i, j}^{S} | a_{i}^{t}, s^{t}])

Analogously for the $j$ -th component of the decentralised estimator we get:

Var (g_{i, j}^{D}) = E [Var (g_{i, j}^{D} | a_{i}^{t}, s^{t})] + Var (E [g_{i, j}^{D} | a_{i}^{t}, s^{t}])

Using the fact that Monte-Carlo estimates are unbiased and that critics are perfect, we get:

Var (E [g_{i, j}^{S} | a_{i}^{t}, s^{t}]) = Var (\infty \sum t = 0 γ^{t} \nabla_{θ_{i, j}} log π_{i} (a_{t}^{i} ∣ ∣ s_{t}) E [^Q (s_{t}, a_{i}^{t}) | a_{i}^{t}, s^{t}]) = Var (E [g_{i, j}^{D} | a_{i}^{t}, s^{t}])

We therefore obtain:

	$Var (g_{i, j}^{S}) - Var (g_{i, j}^{D})$
	$= E [Var (g_{i, j}^{S} \| a_{i}^{t}, s^{t})] - E [Var (g_{i, j}^{D} \| a_{i}^{t}, s^{t})]$
	$= E [\infty \sum t = 0 γ^{2 t} (\nabla_{θ_{i, j}} log π_{i} (a_{t}^{i} ∣ ∣ s^{t}))^{2} (Var (^Q (s^{t}, a_{i}^{t}) \| a_{i}^{t}, s^{t}) - Var (~ Q (s^{t}, a_{i}^{t}) \| a_{i}^{t}, s^{t}))] .$

Since ${~ Q}_{i} (s^{t}, a_{i}^{t})$ is a deterministic function given $a_{i}^{t}$ and $s_{t}$ , the expression above simplifies to:

Var (g_{i, j}^{S}) - Var (g_{i, j}^{D}) = E [\infty \sum t = 0 γ^{2 t} (\nabla_{θ_{i, j}} log π_{i} (a_{t}^{i} ∣ ∣ s_{t}))^{2} Var (^Q (s^{t}, a_{i}^{t}) | a_{i}^{t}, s^{t})]

Since ${^Q}_{i}$ is a Monte-Carlo approximation of ${~ Q}_{i}$ with $k$ samples of $Q_{i}$ we get:

Var (g_{i, j}^{S}) - Var (g_{i, j}^{D}) = E [\infty \sum t = 0 γ^{2 t} (\nabla_{θ_{i, j}} log π_{i} (a_{t}^{i} ∣ ∣ s_{t}))^{2} \frac{1}{k} Var (Q (s^{t}, a^{t}) | a_{i}^{t}, s^{t})]

We can now sum over all components of the gradient vector to obtain:

	$Var (g_{i}^{C}) - Var (g_{i}^{D}) = d \sum j = 1 E [\infty \sum t = 0 γ^{2 t} (\nabla_{θ_{i, j}} log π_{i} (a_{t}^{i} ∣ ∣ s_{t}))^{2} \frac{1}{k} Var (Q (s^{t}, a^{t}) \| a_{i}^{t}, s^{t})]$		(8)
	$= E [\infty \sum t = 0 γ^{2 t} ∥ \nabla_{θ_{i, j}} log π_{i} (a_{t}^{i} ∣ ∣ s_{t}) ∥^{2} \frac{1}{k} Var (Q (s^{t}, a^{t}) \| a_{i}^{t}, s^{t})]$		(9)
	$\leq \frac{B_{i}^{2}}{k} \frac{1}{1 - γ^{2}} E [Var (Q (s^{t}, a^{t}) \| a_{i}^{t}, s^{t})] .$		(10)

We can now upper-bound the variance of the Q-function as follows:

\leq E [Q {(s^{t}, a^{t})}^{2} | a_{i}^{t}, s^{t}] \leq C_{i}

(11)

Combining Equations 9 and 11 completes the proof.

∎

Proof of Theorem 5

See 5

Proof.

Similar to the proof of Theorem 4, because we assume perfect critics, we have:

Var (E [g_{i, j}^{C} | a_{i}^{t}, s^{t}]) = Var (\infty \sum t = 0 γ^{t} \nabla_{θ_{i, j}} log π_{i} (a_{t}^{i} ∣ ∣ s_{t}) E [Q (s_{t}, a_{i}^{t}) | a_{i}^{t}, s^{t}]) = Var (E [g_{i, j}^{D} | a_{i}^{t}, s^{t}])

We can therefore follow the proof of Theorem 4, using the law of total variance to obtain that:

Var (g_{i}^{C}) - Var (g_{i}^{D}) \leq E [\infty \sum t = 0 γ^{2 t} ∥ \nabla_{θ_{i, j}} log π_{i} (a_{t}^{i} ∣ ∣ s_{t}) ∥^{2} Var (Q (s^{t}, a^{t}) | a_{i}^{t}, s^{t})]

\leq B_{i}^{2} \frac{1}{1 - γ^{2}} E [Var (Q (s^{t}, a^{t}) | a_{i}^{t}, s^{t})]

We can now bound the variance of Q-function using the inequality in Equation 11, which completes the proof. ∎

Proof of Theorem 6

See 6

Proof.

We prove the convergence by showing that a multi-agent Actor-Critic algorithm based on $g_{i}^{P}$ or $g_{i}^{C}$ can be expressed as a special case of a single agent Actor-Critic algorithm. The convergence of single agent Actor-Critic algorithm has been established by konda1999actor. First, let us define a joint gradient as a vector consisting of concatenated gradient vectors for each agent. We call them $g^{C} = ((g_{1}^{C})^{T}, \dots, (g_{N}^{C})^{T})^{T}$ and $g^{S} = ((g_{1}^{S})^{T}, \dots, (g_{N}^{S})^{T})^{T}$ for the centralised and PERLA estimator respectively (we denote the number of agents by $N$ ). Let us also define the joint policy for all agents as $π (a | s) = \prod_{i = 1}^{N} π_{i} (a_{i} | s)$ . Since only the policy of the $i$ th agent depends on set of parameters $θ_{i}$ , we have that:

\nabla_{θ_{i}} log π (a | s) = \nabla_{θ_{i}} log N \prod i = 1 π_{i} (a_{i} | s) = \nabla_{θ_{i}} N \sum i = 1 log π_{i} (a_{i} | s) = \nabla_{θ_{i}} log π_{i} (a_{i} | s)

We can therefore write:

\nabla_{θ} log π (a | s) = ((\nabla_{θ_{1}} log π_{1} (a_{1} | s))^{T}, \dots, (\nabla_{θ_{N}} log π_{N} (a_{N} | s))^{T})^{T}

This enables us to express the centralised estimator as:

g^{C} = \infty \sum t = 0 γ^{t} Q (s^{t}, a^{t}) \nabla_{θ} log π (a^{t} ∣ ∣ s^{t})

Let us now define a single agent using policy $π (a^{t} | s^{t})$ and taking a multidimensional action $a^{t}$ at each step. In such a case, $g^{C}$ is an unbiased policy gradient estimate for that agent, which proves the convergence for an Actor-Critic algorithm using $g^{C}$ . We have already established in Theorem 3 that $E [g_{i}^{C}] = E [g_{i}^{S}]$ , which means $E [g^{C}] = E [g^{S}]$ . Therefore $g^{S}$ is also an unbiased estimate of this special case of single-agent Actor-Critic, which proves its convergence by the same argument. ∎

	$E [g_{i}^{C}]$	$= E [E [\infty \sum t = 0 γ^{t} Q_{i} (s^{t}, a_{i}^{t}) \nabla_{θ_{i}} log π_{i} (a_{i}^{t} ∣ ∣ s^{t}) \| a_{i}^{t}, s^{t}]]$
		$= E [\infty \sum t = 0 γ^{t} E [Q_{i} (s^{t}, a_{i}^{t}) \| a_{i}^{t}, s^{t}] \nabla_{θ_{i}} log π_{i} (a_{i}^{t} ∣ ∣ s^{t})]$
		$= E [\infty \sum t = 0 γ^{t} E [{^Q}_{i} (s^{t}, a_{i}^{t}) \| a_{i}^{t}, s^{t}] \nabla_{θ_{i}} log π_{i} (a_{i}^{t} ∣ ∣ s^{t})]$
		$= E [\infty \sum t = 0 γ^{t} {^Q}_{i} (s^{t}, a_{i}^{t}) \nabla_{θ_{i}} log π_{i} (a_{i}^{t} ∣ ∣ s^{t})] = E [g_{i}^{P}],$

Semi-Centralised Multi-Agent Reinforcement Learning with Policy-Embedded Training

Abstract

1 Introduction

2 Related Work

3 PERLA framework

4 Theoretical Analysis

Theorem 1.

Theorem 2 (MARL Policy Gradient zhang2018fully).

Theorem 3.

Theorem 4.

Theorem 5.

Theorem 6.

5 Experiments

5.1 Performance Analysis

5.2 Scaling Analysis

5.3 Variance Analysis

6 Conclusion

References

Appendix A Ablation Study - Number of Samples Used in Marginalization

Appendix B Learning Curves on Level-Based Foraging

Appendix C Learning Curves on Starcraft Multi-agent Challenge

Appendix D Computing Resources and Hyperparmeters

d.1 Computing Resources

d.2 Hyperparameters

Appendix E LBF experiments

e.1 Maps

e.2 Selected Hyperparameters

Appendix F SMAC Experiments

f.1 Maps

f.2 Selected Hyperparameters

Appendix G Matrix Game Experiments

g.1 Selected Hyperparameters

Appendix H Multi-agent Mujoco Experiments

Appendix I Proofs of Theoretical Results

Proof of Theorem 1

Lemma 1.

Proof.

Proof.

Proof of Theorem 3

Proof.

Proof of Theorem 4

Proof.

Proof of Theorem 5

Proof.

Proof of Theorem 6

Proof.

Semi-Centralised Multi-Agent Reinforcement Learning with
Policy-Embedded Training