Socially Fair Reinforcement Learning

Debmalya Mandal
Max Planck Institute for Software Systems
dmandal@mpi-sws.org
Jiarui Gan
University of Oxford
jiarui.gan@cs.ox.ac.uk

Abstract

We consider the problem of episodic reinforcement learning where there are multiple stakeholders with different reward functions. Our goal is to output a policy that is socially fair with respect to different reward functions. Prior works have proposed different objectives that a fair policy must optimize including minimum welfare, and generalized Gini welfare. We first take an axiomatic view of the problem, and propose four axioms that any such fair objective must satisfy. We show that the Nash social welfare is the unique objective that uniquely satisfies all four objectives, whereas prior objectives fail to satisfy all four axioms. We then consider the learning version of the problem where the underlying model i.e. Markov decision process is unknown. We consider the problem of minimizing regret with respect to the fair policies maximizing three different fair objectives – minimum welfare, generalized Gini welfare, and Nash social welfare. Based on optimistic planning, we propose a generic learning algorithm and derive its regret bound with respect to the three different policies. For the objective of Nash social welfare, we also derive a lower bound in regret that grows exponentially with $n$ , the number of agents. Finally, we show that for the objective of minimum welfare, one can improve regret by a factor of $O (H)$ for a weaker notion of regret.

\addbibresource

refs.bib

1 Introduction

In recent years, reinforcement learning has been immensely successful in various domains including game playing [Silver+17], and robotics [Andry+20]. These breakthroughs have opened up new avenues for applying reinforcement learning in real-world decision making systems like healthcare [GJKF+19], and finance [LLXW+21]. However, such human-facing systems often include multiple stakeholders with different preferences, and the classical goal of maximization of rewards is no longer sufficient. A major design challenge is the selection of right objective that provides certain fairness guarantees across different agents.

Our first example comes from the use of reinforcement learning for vaccine distribution [MMMC21]. In this context, the vulnerable population is in immediate need, and their utilities should be prioritized. Similarly, multi-armed bandit algorithms have been deployed for improving access to maternal and child healthcare [MMTM+22]. In this domain, researchers have found that maximizing the minimum reward provides better fairness guarantees.

These examples suggest that in the context of multi-agent sequential decision making, we should be careful about selecting the right measure. But what should be the right measure? Besides maximizing minimum welfare discussed above, researchers have proposed selecting a policy at the pareto frontier [DCR18], or a policy maximizing generalized Gini social welfare [SWZ20]. In this paper, we focus on the problem of selecting a measure that ensures fairness in multi-agent sequential decision making. For the static setting, this is a well-studied problem in economics [Moulin04], and also in computational social choice [BCEL+16]. Motivated by the success of axiomatic framework in these fields, we aim to layout a reasonable set of axioms for selecting a fairness measure in dynamic multi-agent sequential decision making systems.

Even though a set of axioms might uniquely characterize a fairness measure, the optimal policy according to this fairness measure is unknown. This is because the underlying MDP is unknown, and we must interact with the environment to learn such an optimal policy. Moreover, given the use of different fairness measures in different contexts, can we come up with a generic reinforcement learning algorithm that learns different fair optimal policies? This is the second question we study in this paper, and in particular, we consider the problem of minimizing regret with respect to different fair optimal policies. Note that, this question is challenging since most fairness measures are non-linear in value functions and dynamic programming [Bertsekas12] based learning algorithms cannot be applied.

1.1 Contributions

At a high level, we propose a set of axioms for selecting fairness measure in multi-agent sequential decision making systems. Three natural fairness measures are then analyzed against the proposed axioms. When the underlying MDP is unknown, we propose a generic learning algorithm for minimizing regret with respect to different fairness measures. In more detail, we make the following contributions.

We consider the episodic reinforcement learning with multiple agents, and propose a set of four axioms that any fairness measure should satisfy. We show that these axioms uniquely characterize Nash social welfare ( $N W$ ), whereas prior measures like minimum welfare ( $M W$ ) or generalized Gini welfare ( $G G W$ ) violates some of them.
When the underlying model is unknown, we propose a generic learning algorithm for minimizing regret with respect to the optimal fair according to a fairness measure. For a setting with $n$ agents, $S$ states, $A$ actions, and horizon length $H$ , we instantiate this algorithm for different measures, and obtain regret upper bounds of $˜ O (n H^{n + 1} S \sqrt{A T})$ for the objective of Nash social welfare, and $˜ O (H^{2} S \sqrt{A T})$ for both the minimum welfare, and generalized Gini welfare.
For the objective of Nash social welfare, we derive a lower bound in regret that scales exponentially with $n$ , whereas for the other two measures lower bound immediately follows from lower bound in regret for single-agent setting. Finally, for the objective of minimum welfare, we show that one can improve regret by a factor of $O (H)$ for a weaker notion of regret.

1.2 Related Work

Our work is related to several lines of research.

Fairness in Multi-Agent Systems. Our work falls under the framework of fairness in multi-agent sequential decision-making [ZS14, JL19]. In a multi-agent MDP model, \citeauthor*ZS14 [ZS14] propose to solve for a regularized maximin policy, where the regularizer offers a trade-off between utilitarian and max-min fair solution. \citeauthor*JL19 [JL19], on the other hand, selects a policy in the pareto frontier. Recently, several papers have proposed to use generalized Gini welfare ( $G G W$ ) [SWZ20, ZGSW21] as a metric for selecting policies in multi-agent MDP.

Nash Welfare in Bandits. Beyond learning in Markov decision processes, our paper is closely related to \citeauthor*HMS21 [HMS21] who consider the objective of Nash social welfare in multi-armed bandit and achieve $˜ O (\sqrt{T})$ regret. Note that, our regret upper bound also becomes $˜ O (\sqrt{T})$ if horizon length $H = 1$ , and number of states $S = 1$ . Additionally, there is no need to generalize the classical axioms of [KN79] in a setting with a single state. \citeauthor*BKMS22 [BKMS22] also consider regret with respect to the Nash social welfare objective. However, they consider a setting where an agent arrives each round, and the goal is to maximize the Nash welfare (i.e. geometric mean of the rewards) over the $T$ rounds. Beyond Nash welfare, several papers [BBLB20, PGNN21] have also studied other notions of fairness in multi-armed bandits.

Fairness in Reinforcement Learning. Our work is related to fairness in online learning [JKMR16, JKMN+16, LRDMP17] and reinforcement learning [JJKMR17]. In contrast to our setting, these papers mainly define fairness with respect to the arms and stipulates that arms of lower quality should not be selected over arms of higher quality. Subsequent papers have considered group fairness in online learning [HLW+20, SLMD19, WBT21]. with metrics like demographic parity. These notions of fairness are different than ours and we refer the reader to the recent survey [GSTF+22] for more details.

Episodic Reinforcement Learning. The classic paper of \citeauthor*AJO08 [AJO08] introduced the UCRL algorithm and studied regret minimization in average reward MDP setting. Our main learning algorithm is based on the optimistic planning approach of UCRL algorithm. In the context of episodic reinforcement learning \citeauthor*AOM17 [AOM17] obtained the minimax optimal regret bound. Subsequent papers have considered other versions of episodic RL including changing reward function [JJLS+20, RM19], and changing transition function [JZBJ18, JL20].

	Axioms				Regret
	PO	ANON	IIAN	CON	Upper Bound	Lower Bound
Nash Social Welfare	Y	Y	Y	Y	$˜ O (n H^{n + 1} S \sqrt{A T})$	$Ω (n {(\frac{H}{2})}^{n} \sqrt{S A T})$
Minimum Welfare	N	Y	N	Y	$˜ O (H^{2} S \sqrt{A T})$	$Ω (\sqrt{H S A T})$
Generalized Gini Welfare	Y	Y	N	Y	$˜ O (H^{2} S \sqrt{A T})$	$Ω (\sqrt{H S A T})$

Figure 1: Comparison between different fairness measures for reinforcement learning. Measure

N W

is the unique fairness measure (up to a monotone transformation) satisfying all four axioms. The regret for

N W

scales exponentially with

n

, whereas for the other measure it is independent of

n

2 Preliminaries

We consider the episodic reinforcement learning problem with multiple reward functions. Before describing the general setting, we begin with the single-agent episodie reinforcement learning. We are given an MDP $M = (S, A, P, r, ρ)$ with $S$ states and $A$ actions i.e. $| S | = S$ and $| A | = A$ . The initial state $s_{1}$ is drawn from a distribution $ρ$ . We will write $s_{h}$ (resp. $a_{h}$ ) to denote the state visited (resp. action taken) at time-step $h$ . For $h = 1, \dots, H - 1$ the next state $s_{h + 1} \sim P (\cdot | s_{h}, a_{h})$ . The goal is to maximize the expected sum of rewards over the $H$ steps i.e. $E [\sum_{h = 1}^{H} r (s_{h}, a_{h}) ∣ s_{1} \sim ρ]$ .

In this paper, we consider a setting with $n$ different reward functions $r = {r_{i}}_{i \in [n]}$ corresponding to $n$ different agents. We will assume $r_{i} (s, a) \in [0, 1]$ for all $i$ and state,action pair $s, a$ . Given a policy $π$ , we can define the value function corresponding to the $i$ -th reward function as follows.

V^{π} (ρ; r_{i}) = E_{π} [H \sum h = 1 r_{i} (s_{h}, a_{h}) ∣ s_{1} \sim ρ]

(1)

Often the starting state distribution $ρ$ will be clear from the context and we will simply write $V^{π} (r_{i})$ instead of $V^{π} (ρ; r_{i})$ . Note that we are considering an episodic reinforcement learning setup, so the policy $π$ need not be a stationary policy. We will write $π = (π_{1}, \dots, π_{H})$ where $π_{h}$ is the (non-stationary) policy used at time-step $h$ . We will write $Π$ to denote the set of such non-stationary policies.

Any policy $π$ affects different reward functions differently (e.g. through value functions), and we want to build a measure to evaluate how fair the policy $π$ is with respect to $n$ different reward functions. In particular, a fairness measure $W$ maps a policy $π$ and a set of $n$ reward functions to a real number i.e. $W : Π \times R^{S \times A \times n} \to R$ . Therefore, $W (π; {r_{i}}_{i \in [n]})$ provides a way to evaluate the fairness of a policy $π$ and we want to maximize the measure to find the optimal fair policy. The most basic measure is the utilitarian measure which is the sum of values across the $n$ agents. However, this measure violates basic axioms and we will consider the following three measures of fairness.

Minimum Welfare.

We maximize the minimum value function across the $n$ agents.

π_{M W}^{⋆} \in {a r g m a x}_{π} {min}_{i \in [n]} V^{π} (r_{i}) .

Generalized Gini Social Welfare (GGW).

This notion of fairness generalizes max-min fairness and has been considered previously in the literature on multi-objective Markov decision processes [SWZ20, ZGSW21]. We are given a vector of weights $w \in R^{n}$ so that $w_{i} \geq 0$ for each $i$ , $\sum_{i} w_{i} = 1$ , and $w_{1} \geq w_{2} \geq \dots \geq w_{n}$ . Given a policy $π$ , let $i_{1}, \dots, i_{n}$ be an ordering of the agents so that $V^{π} (r_{i_{1}}) \leq V^{π} (r_{i_{2}}) \leq \dots \leq V^{π} (r_{i_{n}})$ . We then maximize the following objective:

π_{G G W}^{⋆} \in {a r g m a x}_{π} \sum_{j = 1}^{n} w_{i} V^{π} (r_{i_{j}}) .

When $w_{1} = 1$ , the above objective coincides with minimum welfare.

Nash Social Welfare.

We maximize the product of the value functions of the $n$ agents, which is known as the Nash social welfare.

π_{N W}^{⋆} \in {a r g m a x}_{π} \prod_{i = 1}^{n} V^{π} (r_{i}) .

When the full MDP is known, a policy that maximizes each of the measures can be computed by linear programming. The details can be found in the appendix.

Prior work on (multi-objective) fair reinforcement learning has mostly focused on measures minimum welfare and generalized Gini welfare. In this work, we want to emphasize the Nash social welfare as an alternative measure of fairness because of its attractive axiomatic properties.

2.1 Axioms

We view the problem of solving a fair reinforcement learning problem as maximization of a fairness measure $W$ . Therefore, we propose four axioms that any such fairness measure $W$ should satisfy. These axioms are inspired by classical axioms in economics, particularly social choice theory [Sen18]. For our setting, we build on the axiomatic framework developed by \citeauthor*KN79 [KN79]. To define the axioms we will need some additional notations.

Given a policy $π$ we will write $q^{π}$ to denote its state-action occupancy measure, i.e.,

$q_{h}^{π} (s, a) = {Pr}_{π} (s_{h} = s, a_{h} = a) .$ (2)
Given an occupancy measure $q$ we will write $π^{q}$ to denote the corresponding policy, i.e.,

$π_{h}^{q} (s, a) = ⎧ ⎨ ⎩ \begin{matrix} \frac{q_{h} (s, a)}{\sum_{b} q_{h} (s, b)} & if \sum_{b} q_{h} (s, b) > 0 \frac{1}{A} & otherwise \end{matrix}$

It is known that the occupancy measure of the policy $π^{q}$ is $q$ .
Given two occupancy measures $q_{1}$ and $q_{2}$ , we will write $π^{α q_{1} + (1 - α) q_{2}}$ to denote the policy corresponding to the convex combination of the occupancy measures $q_{1}$ and $q_{2}$ . In particular, first we define the occupancy measure $q = α q_{1} + (1 - α) q_{2}$ and then take the policy $π^{q}$ .

Axiom 1 (Pareto Optimality).

If $V^{π} (r_{i}) \geq V^{˜ π} (r_{i})$ for all $i \in [n]$ and $V^{π} (r_{j}) > V^{˜ π} (r_{j})$ for some $j$ , then it must be that

W (π; r) > W (˜ π; r) .

Namely, if policy $π$ pareto-improves policy $˜ π$ , then $W$ must assign higher value to a policy $π$ .

Axiom 2 (Independence of Irrelevant Alternatives with Neutrality).

Suppose that for all $i \in [n]$ ,

\frac{V^{π_{1}} (r_{i})}{V^{π_{2}} (r_{i})} = \frac{V^{{˜ π}_{1}} ({˜ r}_{i})}{V^{{˜ π}_{2}} ({˜ r}_{i})} .

Then $W (π_{1}; r) \geq W (π_{2}; r)$ if and only if $W ({˜ π}_{1}; ˜ r) \geq W (˜ π_{2}; ˜ r)$ .

Namely, if $π_{1}$ to $π_{2}$ is the same as ${˜ π}_{1}$ to ${˜ π}_{2}$ to each agent in terms of value function, then $W (π_{1})$ to $W (π_{2})$ is the same as $W ({˜ π}_{1})$ to $W ({˜ π}_{2})$ .

Axiom 3 (Anonymity).

For any permutation $σ : [n] \to [n]$ of the agents and policy $π$ , it must be that

W (π; r) = W (π; {r_{σ (i)}}_{i \in [n]}) .

Namely, $W$ is independent of the indices of the agents.

Axiom 4 (Continuity).

Suppose $W (π_{1}; r) \geq W (π_{2}; r) \geq W (π_{3}; r)$ and let $q_{j} = q^{π_{j}}$ for $j \in {1, 2, 3}$ . Then there exists $α \in [0, 1]$ such that

W (π_{2}; r) = W (π^{α q_{1} + (1 - α) q_{3}}; r) .

Namely, there is an intermediate policy between $π_{1}$ and $π_{3}$ that attains the same value as $π_{2}$ under $W$ . Note that, we don’t take direct convex combination of policies, rather we take convex combination in the occupancy measure. This is because the value function in an MDP is a non-linear function of policy, and there might not exist a policy of the form $α π_{1} + (1 - α) π_{3}$ that achieves the same value as $π_{2}$ .

3 Axiomatic Analysis

We investigate which of the above axioms are satisfied by different fairness measures. First, we show that even though the axioms appear basic, not all of them are satisfied by the fairness measures considered previously in the literature. In fact, we can show that minimum welfare violates PO and IIAN, whereas the measure generalized Gini welfare ( $G G W$ ) violates IIAN under any choice of weight vector $w$ .

Minimum Welfare Violates 1 (Po).

Consider a MDP with a single state and two actions $a$ and $b$ . Agent $1$ has the same reward for both the actions, say $1$ . On the other hand, agent $2$ has reward $1$ for action $a$ and reward $2$ for action $b$ . Consider a policy $π_{a}$ that always pulls action $a$ in state $s$ . Then $V^{π_{a}} (r_{1}) = V^{π_{a}} (r_{2}) = H$ . Now consider another policy $π_{b}$ that always pulls action $b$ in state $s$ . Then $V^{π_{b}} (r_{1}) = H$ and $V^{π_{b}} (r_{2}) = 2 H$ . Since the minimum value function does not change we have $M W (π_{a}) = M W (π_{b}) = H$ .

Minimum Welare Violates 2 (Iian).

We again consider a MDP with a single state and two actions $a$ and $b$ . We consider two reward functions $r$ and $˜ r$ defined as follows. For the first agent

r_{1} (s, a) = {˜ r}_{1} (s, a) = 1, and r_{1} (s, b) = {˜ r}_{1} (s, b) = 0.

(3)

On the other hand, for the second agent, we have

r_{2} (s, a) = \frac{1}{4}, r_{2} (s, b) = \frac{3}{4}, {˜ r}_{2} (s, a) = 1, % and {˜ r}_{2} (s, b) = 3.

(4)

Consider two policies: $π_{1}$ pulls $a$ or $b$ with equal probability, whereas $π_{2}$ pulls $a$ with probability $3 / 4$ and $b$ with probability $1 / 4$ . For the first agent we then have,

V^{π_{1}} (r_{1}) = V^{π_{1}} ({˜ r}_{1}) = \frac{H}{2}, and V^{π_{2}} (r_{1}) = V^{π_{2}} ({˜ r}_{1}) = \frac{3 H}{4} .

For the second agent we have,

V^{π_{1}} (r_{2}) = \frac{H}{2}, V^{π_{2}} (r_{2}) = \frac{3 H}{8}, V^{π_{1}} ({˜ r}_{2}) = 2 H, V^{π_{2}} ({˜ r}_{2}) = \frac{3 H}{2} .

Therefore, we observe that the rewards and the policy satisfy the following condition.

\frac{V^{π_{1}} (r_{1})}{V^{π_{2}} (r_{1})} = \frac{V^{π_{1}} ({˜ r}_{1})}{V^{π_{2}} ({˜ r}_{1})} = \frac{2}{3} and \frac{V^{π_{1}} (r_{2})}{V^{π_{2}} (r_{2})} = \frac{V^{π_{1}} ({˜ r}_{2})}{V^{π_{2}} ({˜ r}_{2})} = \frac{4}{3}

For the reward vector $r$ we have $M W (π_{1}; r) = \frac{H}{2}$ and $M W (π_{2}; r) = \frac{3 H}{8}$ . On the other hand, for the reward vector $˜ r$ we have $M W (π_{1}; ˜ r) = \frac{H}{2}$ and $M W (π_{2}; ˜ r) = \frac{3 H}{4}$ . Therefore, $M W (π_{1}; r) > M W (π_{2}; r)$ but $M W (π_{1}; ˜ r) < M W (π_{2}; ˜ r)$ . This implies that the objective $M W$ does not satisfy the independence of irrelevant alternatives.

GGW Violates 2 (Iia).

We consider the same example as discussed for the setting of max-min fairness. Consider a weight vector $w = (w_{1}, w_{2})$ where $w_{1} > w_{2} > 0$ and $w_{1} + w_{2} = 1$ . For the reward vector $r$ , we have

	$G G W (π_{1}, r)$	$= \frac{H}{2},$
	$and G G W (π_{2}, r)$	$= \frac{3 H}{8} w_{1} + \frac{3 H}{4} w_{2} = \frac{3 H}{8} (1 + w_{2}) .$

For the reward vector $˜ r$ , we have

	$G G W (π_{1}; ˜ r)$	$= \frac{H}{2} w_{1} + 2 H w_{2},$
	$and G G W (π_{2}; ˜ r)$	$= \frac{3 H}{4} w_{1} + \frac{3 H}{2} w_{2} .$

We now consider two cases. First, if $w_{2} < \frac{1}{3}$ we have $G G W (π_{1}; r) > G G W (π_{2}; r)$ but $G G W (π_{1}; ˜ r) < G G W (π_{2}; ˜ r)$ . On the other hand, if $w_{2} > \frac{1}{3}$ we have $G G W (π_{1}; r) < G G W (π_{2}; r)$ but $G G W (π_{1}; ˜ r) > G G W (π_{2}; ˜ r)$ .

Now we consider the remaining case of $w_{2} = \frac{1}{3}$ . We now consider a different choice of reward functions than defined in equations (3) and (4). Reward functions $r_{1}$ and ${˜ r}_{1}$ remain as they are, but we change the definition of reward functions $r_{2}$ and $_{2}$ as follows.

r_{2} (s, a) = \frac{1}{2}, r_{2} (s, b) = \frac{2}{3}, and {˜ r}_{2} (s, a) = 1, {˜ r}_{2} (s, b) = \frac{4}{3}

Now it can be again checked that the required assumptions for axiom 2 is satisfied. However, for the weight vector $(2 / 3, 1 / 3)$ we have $G G W (π_{1}; r) > G G W (π_{2}; r)$ but $G G W (π_{1}; ˜ r) < G G W (π_{2}; ˜ r)$ . Therefore, we have shown that for any weight vector with $w_{1} > w_{2} > 0$ , one can construct an instance where the fairness measure $G G W$ violates axiom IIA. Note that, the case of $w_{1} = 1, w_{2} = 0$ corresponds to minimum welfare, and the construction can be generalized to more than two agents by just duplicating the reward functions.

Apart from the above violations, the remaining axioms are satisfied by the fairness measures minimum welfare, and the generalized Gini welfare. The proofs are straightforward and are provided in the appendix.

3.1 Nash Social Welfare

We now turn to the fairness measure Nash social welfare. In the appendix, we show that $N W$ satisfies all four axioms. The main difference with $G G W$ , and $M W$ is the satisfaction of axiom IIA. Next we state our first main result and show that up to a transformation by a monotonically increasing function, $N W$ is the unique fairness measure that satisfies all four axioms.

Theorem 5.

Suppose there are at least four actions available at each state, and $W$ is a fairness measure satisfying Axioms 1, 2, 3, and 4. Then it holds that $W (π; r) = ϕ (N W (π; r))$ for any policy $π$ , and reward function $r$ , where $ϕ : R \to R$ is some monotonically increasing function.

The proof of Theorem 5 is provided in the appendix, and adapts main ideas from \citeauthor*KN79 [KN79] to the setting of episodic reinforcement learning.

We first define an order $≽$ on $R_{+}^{n}$ as $x ≽ y$ if and only if $W (π^{a}; r) \geq W (π^{b}; r)$ . Here $π^{a}$ (resp. $π^{b}$ ) always plays action $a$ (resp. $b$ ) and $r$ is a reward function constructed using $x$ and $y$ .
Based on the function $W$ , we then define a function $F : R_{+}^{n} \to R$ that respects the order $≽$ . The construction of this function crucially depends on axiom 4.
We then show the function $F$ satisfy some requirements so that we can invoke a result due to [osborne1976irrelevant] and obtain that there exists a monotone increasing function $G$ such that $F (x) = G (\prod_{i = 1}^{n} x_{i})$ . This enables us to show that $W$ is a monotone increasing transformation of $N W$ .

In addition to the above properties, the Nash social welfare also provides a utility guarantee in comparison to the optimal max-min fair policy: $π_{N W}^{⋆}$ provides any agent at least $1 / n$ fraction of her value under $π_{M W}^{⋆}$ .

Theorem 6.

Suppose that the maximum $M W$ social welfare is not zero.¹¹1When the maximum $M W$ social welfare is zero, for each policy $π$ there exists an agent $i$ such that $V^{π} (r_{i}) = 0$ , i.e., $N W (π) = 0$ . This implies that $π_{N W}^{⋆}$ is also a max-min fair policy.Then for any agent $i \in [n]$ , it holds that

V^{π_{N W}^{⋆}} (r_{i}) \geq \frac{1}{n} V^{π_{M W}^{⋆}} (r_{i}) .

Moreover, this bound is tight, i.e., for any $δ > 0$ there exists an instance s.t. $V^{π_{N W}^{⋆}} (r_{i}) \leq \frac{1}{n} V^{π_{M W}^{⋆}} (r_{i}) + δ$ for some $i$ .

The main idea to prove Theorem 6 is to show that the optimal policy $π_{N W}^{⋆}$ can be computed by optimizing a log-concave function over a convex set, and then use first order optimality conditions at the optimal solution.²²2A similar relationship is known in the fair division literature [CKMP+19], who showed a tight bound of $1 / \sqrt{n}$ , but their setting and proof techniques are very different than ours. The details can be found in the appendix.

4 Learning

We now consider the setting where the probability transition function $P$ is unknown and the learner needs to learn $P$ in order to compute a fair policy. The learner interacts with the environment over $T$ episodes. We will write $π_{t}$ to denote the policy (possibly non-stationary) adopted by the learner during episode $t$ . We will also use $s_{t, h}$ to denote the state visited at time-step $h$ of episode $t$ , and $a_{t, h}$ to denote the action taken at time-step $h$ of episode $t$ . As is standard in the literature on online reinforcement learning, we will assume that the reward functions are known, but we show in the appendix how our algorithm and analysis can be easily generalized to handle unknown reward functions. We next define the regret of a learner that interacts with the world over $T$ episodes.

for $t = 1, 2, \dots$ do

/* Compute optimistic MDP

{˜ P}_{t}

and the corresponding policy

{˜ π}_{t}

\begin{matrix} {˜ P}_{t} \leftarrow {a r g m a x}_{P \in C_{t} (ˆ P)} {max}_{π} F (π; P) {˜ π}_{t} \in {a r g m a x}_{π} F (π; {˜ P}_{t}) \end{matrix}

(5)

for $h = 1, \dots, H$ do

Observer state

s_{t, h}

Play

a_{t, h} \sim {˜ π}_{t, h} (\cdot | s_{t, h})

/* Update estimates of

P

N_{t} (s, a) = \sum_{t^{'} \leq t} \sum_{h = 1}^{H} 1 {s_{t, h} = s, a_{t, h} = a}

for $s, a, s^{'}$ do

ˆ P (s, a, s^{'}) = \frac{N_{t} (s, a, s^{'})}{N_{t} (s, a)}

Set

C_{t} (ˆ P) = {P : \forall s {∥ ∥ P (s, a, \cdot) - ˆ P (s, a, \cdot) ∥ ∥}_{1} \leq \sqrt{\frac{4 S log (S A t / δ)}{max {1, N_{t} (s, a)}}}}

ALGORITHM 1 Upper Confidence Reinforcement Learning for Fair Objective

F

(UCRL-

F

)

Regret.

We define regret with respect to a generic fair objective $F \in {N W, G G W, M W}$ . For a policy $π$ we will write $F (π)$ to denote its value according to the objective $F$ e.g. $N W (π) = \prod_{i = 1}^{n} V^{π} (ρ; r_{i})$ . Let $π_{F}^{⋆}$ be the policy that maximizes the objective $F$ i.e. $π_{F}^{⋆} \in {a r g m a x}_{π} F (π) .$ Then we measure regret of a learning algorithm $(π_{1}, \dots, π_{T})$ with respect to this optimal fair policy.

{Reg}_{F} (T) = T \sum t = 1 (F (π_{F}^{⋆}) - F (π_{t}))

(6)

Algorithm.

Algorithm 1 presents our learning algorithm for a generic objective $F$ . The algorithm is based on the principle of optimistic planning, introduced by \citeauthor*AJO08 [AJO08]. The learner interacts with the environment over $T$ episodes each of length $H$ . At the start of episode $t$ , the learner first computes the the optimistic model ${˜ P}_{t}$ and the optimistic policy ${˜ π}_{t}$ (eq. 5). Then the learner plays policy ${˜ π}_{t}$ for $H$ steps, and at the end of the episode, updates the empirical probability transition function $ˆ P$ .

Optimistic Planning.

The most important step of algorithm 1 is the computation of an optimistic model and a policy (eq. 5). Here ${˜ P}_{t}$ is the probability transition function that is plausible at time $t$ (i.e., lies within the confidence ellipsoid $C_{t} (ˆ P)$ centered around ${ˆ P}_{t}$ ) and has the highest possible objective value according to the function $F$ . For an arbitrary objective $F$ , it is not even clear that this step can be performed efficiently. However, we show that for objectives ${N W, G G W, M W}$ , one can use the ellipsoid method to efficiently compute ${˜ P}_{t}$ and ${˜ π}_{t}$ . The details for the three objectives are provided in the appendix. The next theorem bounds the regret of algorithm for the objective of Nash social welfare ( $N W$ ).

Theorem 7.

For the objective of Nash social welfare, Algorithm 1 has regret

{Reg}_{N W} (T) = ˜ O (n H^{n + 1} S \sqrt{A T}) .

Proof sketch.

The proof of this theorem is provided in the appendix. Here we discuss the main ideas behind the proof.

We can apply Chernoff-Hoeffding inequality and the union bound to show that, with high probability, the true probability transition function $P^{⋆}$ is contained within the confidence ellipsoid $C_{t} (ˆ P)$ for all time steps $t \in {1, 2, \dots, T}$ i.e. $P (\exists t P^{⋆} \notin C_{t} (ˆ P)) \leq δ .$
Conditioned on the event above, we show that the regret can be upper bounded as ${Reg}_{N W} (T) \leq \sum_{t = 1}^{T} N W ({˜ π}_{t}; {˜ P}_{t}) - N W ({˜ π}_{t}; P^{⋆})$ . This step uses the fact that ${˜ P}_{t}$ is the optimistic model and ${˜ π}_{t}$ is the optimistic policy at time $t$ .

We can then bound the difference $N W ({˜ π}_{t}; {˜ P}_{t}) - N W ({˜ π}_{t}; P^{⋆})$ in Nash welfare in terms of sum of difference in value functions. This gives us the following upper bound on regret (see lemma 4 in the appendix).

H^{n - 1} n \sum i = 1 T \sum t = 1 ∣ ∣ V^{{˜ π}_{t}} (r_{i}; {˜ P}_{t}) - V^{{˜ π}_{t}} (r_{i}; P^{⋆}) ∣ ∣      := {Reg}_{i}

This result can be thought of as a linearization lemma, and it generalizes Lemma 2 by \citeauthor*HMS21 [HMS21] to the general setting of finite-horizon reinforcement learning.

Finally, we prove a bound of $˜ O (H^{2} S \sqrt{A T})$ on the term ${Reg}_{i}$ to complete the proof. ∎

We now consider the regret in learning the max-min fair policy. We instantiate algorithm 1 with objective $F = M W$ and obtain a regret bound of $˜ O (H^{2} S \sqrt{A T})$ . Note that, unlike the objective of Nash social welfare, the regret in this case doesn’t grow with the number of agents $n$ .

Theorem 8.

For the objective of minimum welfare, Algorithm 1 has regret

{Reg}_{M W} (T) = ˜ O (H^{2} S \sqrt{A T}) .

Proof.

Proceeding similarly as the proof of Theorem 7 we can show that the regret can be upper bounded by the regret with respect to the optimistic policies.

{Reg}_{M W} (T)

\leq T \sum t = 1 min i V^{{˜ π}_{t}} (r_{i}; {˜ P}_{t}) - min i V^{{˜ π}_{t}} (r_{i}; P^{⋆})

We can bound the term ${min}_{i} V^{{˜ π}_{t}} (r_{i}; P^{⋆})$ as follows.

	$min i V^{{˜ π}_{t}} (r_{i}; P^{⋆})$
	$= min i {(V^{{˜ π}_{t}} (r_{i}; P^{⋆}) - V^{{˜ π}_{t}} (r_{i}; {˜ P}_{t})) + V^{{˜ π}_{t}} (r_{i}; {˜ P}_{t})}$
	$\geq - max i ∣ ∣ V^{{˜ π}_{t}} (r_{i}; P^{⋆}) - V^{{˜ π}_{t}} (r_{i}; {˜ P}_{t}) ∣ ∣ + min i V^{{˜ π}_{t}} (r_{i}; {˜ P}_{t})$

Substituting this bound gives us the following upper bound on ${Reg}_{M W} (T)$ .

{Reg}_{M W} (T) \leq T \sum t = 1 max i ∣ ∣ V^{{˜ π}_{t}} (r_{i}; P^{⋆}) - V^{{˜ π}_{t}} (r_{i}; ˜ P_{t}) ∣ ∣ \leq T \sum t = 1 H^{2} \sqrt{\sum s^{'}, b {∥ ∥ ϵ_{t} (s^{'}, b, \cdot) ∥ ∥}_{t}^{'}}      := ˜ Reg [By Lemma~{} ???]

where $ϵ_{t} (s^{'}, a, s) = ∣ ∣ {˜ P}_{t} (s^{'}, a, s) - P^{⋆} (s^{'}, a, s) ∣ ∣$ . The proof of Theorem 7 first upper bounds ${Reg}_{i}$ by $˜ Reg$ for any $i$ and then bounds $˜ Reg$ by $˜ O (H^{2} S \sqrt{A T})$ . ∎

The next theorem upper bounds the regret for the objective of Generalized Gini Social welfare.

Theorem 9.

For the objective of generalized Gini social welfare, Algorithm 1 has regret

{Reg}_{G G W} (T) = ˜ O (H^{2} S \sqrt{A T}) .

Proof.

We proceed very similarly to the proof of Theorem 7. Since the true probability transition function $P^{⋆}$ is contained within $C_{t} (ˆ P)$ with high probability and algorithm 1 uses optimistic planning, we can bound the regret as follows.

{Reg}_{G G W} (T) \leq T \sum t = 1 G G W ({˜ π}_{t}; {˜ P}_{t}) - G G W ({˜ π}_{t}; P^{⋆})

We now consider two ordering of the agents. Let $i_{1}, i_{2}, \dots, i_{n}$ be an ordering of the agents so that

V^{{˜ π}_{t}} (r_{i_{1}}; {˜ P}_{t}) \geq V^{{˜ π}_{t}} (r_{i_{2}}; {˜ P}_{t}) \geq \dots \geq V^{{˜ π}_{t}} (r_{i_{n}}; {˜ P}_{t}) .

Let $ℓ_{1}, ℓ_{2}, \dots, ℓ_{n}$ be an ordering of the agents so that

V^{{˜ π}_{t}} (r_{ℓ_{1}}; P^{⋆}) \geq V^{{˜ π}_{t}} (r_{ℓ_{2}}; P^{⋆}) \geq \dots \geq V^{{˜ π}_{t}} (r_{ℓ_{n}}; P^{⋆}) .

Without loss of generality, we can assume that $i_{k} = k$ for all $k \in [n]$ . Then we can rewrite the upper bound on regret as follows:

{Reg}_{G G W} (T) \leq T \sum t = 1 n \sum k = 1 w_{k} (V^{{˜ π}_{t}} (r_{k}; {˜ P}_{t}) - V^{{˜ π}_{t}} (r_{ℓ_{k}}; P^{⋆})) .

We can show that for all $k \in [n]$ it holds that

V^{{˜ π}_{t}} (r_{k}; {˜ P}_{t}) - V^{{˜ π}_{t}} (r_{ℓ_{k}}; P^{⋆}) \leq max i \in [n] | V^{{˜ π}_{t}} (r_{i}; {˜ P}_{t}) - V^{{˜ π}_{t}} (r_{i}; P^{⋆}) | .

(7)

Indeed, if $V^{{˜ π}_{t}} (r_{k}; {˜ P}_{t}) \leq V^{{˜ π}_{t}} (r_{ℓ_{k}}; P^{⋆})$ , then we are done, so we assume that $V^{{˜ π}_{t}} (r_{k}; {˜ P}_{t}) > V^{{˜ π}_{t}} (r_{ℓ_{k}}; P^{⋆})$ .

Note that there must be $j \geq k$ such that $ℓ_{j} \leq k$ . This is because the converse, i.e., $ℓ_{j} > k$ for all $j \geq k$ , would imply that $ℓ_{j} = ℓ_{j^{'}}$ for some $j \neq j^{'}$ , which is a contradiction. It then follows that

V^{{˜ π}_{t}} (r_{k}; {˜ P}_{t}) - V^{{˜ π}_{t}} (r_{ℓ_{k}}; P^{⋆}) \leq V^{{˜ π}_{t}} (r_{ℓ_{j}}; {˜ P}_{t}) - V^{{˜ π}_{t}} (r_{ℓ_{j}}; P^{⋆}) = | V^{{˜ π}_{t}} (r_{ℓ_{j}}; {˜ P}_{t}) - V^{{˜ π}_{t}} (r_{ℓ_{j}}; P^{⋆}) |,

so (7) follows.

Hence, we have

{Reg}_{G G W} (T) \leq T \sum t = 1 max i ∣ ∣ V^{{˜ π}_{t}} (r_{i}; P^{⋆}) - V^{{˜ π}_{t}} (r_{i}; ˜ P_{t}) ∣ ∣

and the remainder of the proof is the same as that of Theorem 8. ∎

4.1 Lower Bound on Regret

If all the $n$ agents have identical reward functions then regret in learning minimum welfare ( ${Reg}_{M W} (T)$ ) just corresponds to regret in learning a single reward function. Therefore, ${Reg}_{M W} (T)$ must be at least the regret in learning for the single-agent setting which is $Ω (\sqrt{H S A T})$ [AOM17]. Now, for the generalized Gini welfare recall that the weights are normalized i.e. $\sum_{k} w_{k} = 1$ . So, the regret in learning ( ${Reg}_{G G W} (T)$ ) again coincides with the regret in learning for the single-agent setting and we have ${Reg}_{G G W} (T) \geq Ω (\sqrt{H S A T})$ .

We now consider the regret in learning for the Nash welfare objective. Note that the upper bound in regret scales exponentially with $n$ , the number of agents. Intuitively this is expected as in an episodic reinforcement learning setting, the Nash social welfare of $n$ agents scales as $H^{n}$ . We now provide a lower bound in regret ${Reg}_{N W} (T)$ that also scales exponentially with $n$ .

Theorem 10.

Suppose $T \geq Ω (log H)$ and $H \geq Ω (log S)$ . Then for any policy $π$ , there exists a MDP with $S$ states, and $A$ actions so that $E [{Reg}_{N W} (π; T)] \geq C n {(\frac{H}{2})}^{n} \sqrt{S A T}$ for some universal constant $C > 0$ .

Here we briefly highlight the main ideas of the proof. We use the same instance used in the lower bound construction of average reward MDP [LS20, AJO08], where there is a collection of $O (S A)$ MDPs. Each MDP has $S - 2$ states arranged in the form of an $A$ -ary tree, and two other states – good state $s_{g}$ (reward $1$ ) and bad state $s_{b}$ (reward $0$ ). Any action from the leaves of this tree transition uniformly at random to the remaining two states are good ( $s_{g}$ ) and bad ( $s_{b}$ ). However, on model $M_{ℓ, a}$ we define

P (s_{g} | ℓ, a) = 1 / 2 + Δ, P (s_{b} | ℓ, a) = 1 / 2 - Δ .

Now observe that on Model $M_{ℓ, a}$ the optimal policy can navigate to leaf $ℓ$ and take action $a$ . Moreover, the transition probabilities are designed in such a way that once either the good or the bad state is reached, it takes at least $Θ (H)$ steps to leave that state. This means the optimal policy achieves Nash welfare of at least $(1 / 2 + Δ)^{n} H^{n}$ per episode. On the other hand, if the learning policy does not take action $a$ at leaf node $ℓ$ , it achieves a Nash welfare of at most $(1 / 2)^{n} H^{n}$ that episode. Therefore, the regret in these episodes is at least $((1 / 2 + Δ)^{n} - (1 / 2)^{n}) H^{n} \geq n (H / 2)^{n}$ . Then the rest of the proof shows that one can choose $Δ$ small enough so that there exists a model under which the number of such misses by the learning algorithm is at least $O (T)$ .

4.2 Improved Bound for Minimum Welfare

We now consider a weaker notion of regret for the fairness measure $M W$ , minimum social welfare.

{Reg}_{M W}^{W} (T) = T \sum t = 1 M W (π_{M W}^{⋆}; r) - min i T \sum t = 1 V^{π_{t}} (r_{i})

(8)

The difference with the regret definition (6) is that for the second quantity the order of minimum and summation is interchanged. Since value functions are non-negative, sum of minimum welfare across $T$ episodes is upper bounded by the minimum of the total welfare across $T$ episodes. This implies that ${Reg}_{M W}^{W} (T) \leq {Reg}_{M W} (T)$ , and we show that it is possible to improve the bound for ${Reg}_{M W}^{W} (T)$ .

Before deriving a regret bound with respect to the new definition, we first argue that the new regret eq. 8 is perhaps more appropriate for our setting. Since each of the $n$ agents interact with the learner for $T$ episodes, the total utility received by agent $i$ is $\sum_{t = 1}^{T} V^{π_{t}} (r_{i})$ . Therefore, the minimum total welfare is ${min}_{i} \sum_{t = 1}^{T} V^{π_{t}} (r_{i})$ . On the other hand, the optimal minimum welfare achievable over $T$ episodes is

{max}_{π \in Π} {min}_{i} \sum_{t = 1}^{T} V^{π} (r_{i}) = T {max}_{π \in Π} {min}_{i} V^{π} (r_{i}) .

The difference between these two quantities is precisely the regret in eq. 8.

Algorithm 2 describes our new learning algorithm and is inspired by the dual formulation of the problem of solving max-min fair policy.

\begin{matrix} max v \in R, q \in Q (ρ, P) v s.t. & H \sum h = 1 \sum s, a q_{h} (s, a) r_{i} (s, a) \geq v \forall i \in [n] \end{matrix}

(9)

Here $q$ is the state-action occupancy measure (as defined in eq. 2) and the constraint $q \in Q (ρ, P)$ ensures that $q$ is valid with respect to $P$ . The Lagrangian of the optimization problem (9) is given as

L (q, v; λ)

= v + n \sum i = 1 λ_{i} (H \sum h = 1 \sum s, a q_{h} (s, a) r_{i} (s, a) - v)

Suppose we know $v^{⋆}$ , the optimal solution to (9), i.e., the maxmin value. Then substituting $v = v^{⋆}$ we get the following expression for the Lagrangian.

L (q; λ)

= v^{⋆} + n \sum i = 1 λ_{i} (H \sum h = 1 \sum s, a q_{h} (s, a) r_{i} (s, a) - v^{⋆})

Here we assume that $v^{⋆}$ is known and it is an input to Algorithm 2. In our full proof we show that it suffices to set $v^{⋆}$ to an upper bound of the maxmin value in Algorithm 2 (e.g., we can use an upper bound of the rewards).

The Lagrangian can be interpreted as a two-player zero-sum game where the learner (max-player) plays $q$ and the adversary (min-player) plays $λ$ . Algorithm 2 tries to find an equilibrium of this game by using a reinforcement learning algorithm for the $q$ -player and a bandit algorithm for the $λ$ -player. In order to see which RL algorithm is suitable let us first rewrite the term $L (q, λ)$ as

H \sum h = 1 \sum s, a q_{h} (s, a) (\sum i λ_{i} r_{i} (s, a) + \frac{v^{⋆}}{H} (1 - \sum i λ_{i}))

Therefore, we can define a new environment with rewards $˜ r (s, a) = \sum_{i} λ_{i} r_{i} (s, a) + \frac{v^{⋆}}{H} (1 - \sum_{i} λ_{i})$ and in the new environment the RL agent receives the expected reward $L (q, λ)$ . We will fix the choice of $λ$ at the start of an episode (say $λ^{t}$ at the start of episode $t$ ), and we also need to fix the policy at the start of the episode. For this reason, we will be using the UOB-REPS algorithm [JJLS+20] which works with unknown transition, and adversarial rewards, and has regret $O (H S \sqrt{A T})$ .

The $λ$ -player, on the other hand, chooses a point in the convex set $C = {λ \in R_{+}^{n} : \sum_{i} λ_{i} \leq B}$ at the start of each episode and receives an expected cost of $L (q, λ)$ . As the cost function is linear in $λ$ we can use any algorithm for linear bandit with adversarial reward. We will use the OSMD algorithm proposed by \citeauthor*BCK12 [BCK12] which has regret $O (\sqrt{n T log A})$ .

Input: Maxmin value

v^{⋆}

, number of episodes

T

, length of each episode

H

Instantiate UOB-REPS with action set

A

and state space

S

Instantiate OSMD with action set

C = {λ \in R_{+}^{n} : \sum_{i} λ_{i} \leq B}

for $t = 1, \dots, T$ do

π^{t} \leftarrow

policy chosen according to UOB-REPS.

λ^{t} \leftarrow

action chosen according to OSMD.

for $h = 1, \dots, H$ do

Observe

x_{t, h}

Action

a_{t, h} = π^{t} (x_{t, h})

Reward feedback

{˜ r}_{t, h} = \sum_{i} λ_{i}^{t} r_{i} (x_{t, h}, a_{t, h}) + \frac{v^{⋆}}{H} (1 - \sum_{i} λ_{i}^{t})

Loss feedback to OSMD:

v^{⋆} + \sum_{i = 1}^{n} λ_{i}^{t} (\sum_{h = 1}^{H} r_{i} (x_{t, h}, a_{t, h}) - v^{⋆})

ALGORITHM 2 Lagrange-Maximin

Theorem 11.

For $n \leq \frac{S^{2} A}{log A}$ , algorithm 2 has regret

{Reg}_{M W}^{W} (T) = ˜ O (H S \sqrt{A T})

The main idea is to show that the average occupancy measure $¯ ¯ ¯ q = \frac{1}{T} \sum_{t} q^{t}$ and $¯ ¯ ¯ λ = \frac{1}{T} \sum_{t} λ^{t}$ is an $ϵ$ -approximate fixed point of the game $L (h, q)$ for $ϵ = ˜ O (B (H S \sqrt{A} + \sqrt{n log A}) / \sqrt{T})$ . This lets us bound the constraint violation of $¯ ¯ ¯ q$ with respect to the LP (9) and also bound regret ${Reg}_{M W}^{W} (T)$ . The proof also shows that any choice of $v^{⋆}$ larger than $v_{M W}^{⋆} = {max}_{π} {min}_{i} V^{π} (r_{i})$ works, so one can call algorithm 2 with $v^{⋆} = H$ . Additionally, the parameter $B$ needs to be at least the maximum possible value of any reward function.

5 Conclusion

In this paper, we proposed a set of axioms for selecting fairness measure in multi-agent sequential decision making systems. We analyzed three different fairness measures ( $N W, M W$ , and $G G W$ ) against the axioms. When the underlying MDP is unknown, we proposed a generic learning algorithm for minimizing regret with respect to different fair optimal policies. There are many interesting directions for future work. First, it would be interesting to extend our framework to consider RL with general function approximation. Our axioms should generalize easily, but the computation of fair optimal policies might be challenging. Second, it would be great to bridge the gap between the lower and upper bounds in regret (table 1). For single-agent RL, value iteration based methods achieve minimax regret bounds. However, fairness measures are non-linear, and we might need to devise different learning algorithms. Finally, it would also be interesting to consider a setting where agents have partial information about the underlying MDP [DCR18].

\printbibliography

Appendix A Computing Optimal Policies

We first recall the linear programming formulation of reinforcement learning. Consider an MDP $M = (S, A, P, r, ρ)$ and our goal is to maximize value function with respect to the reward function $r$ .

max π V^{π} (ρ; r_{i})

The primal linear program for this problem is the following.

	$min {V_{h}}_{h \in [H]}$	$\sum s ρ (s) V_{1} (s)$
	s.t.	$V_{h} (s) \geq r (s, a) + \sum s^{'} P (s, a, s^{'}) V_{h + 1} (s^{'}) \forall h \in [H - 1]$

We will use the dual formulation of the above linear program.

\begin{matrix} max q & H \sum h = 1 \sum s, a q_{h} (s, a) r (s, a) s.t. & \sum a q_{1} (s, a) = ρ (s) \forall s \sum a q_{h + 1} (s, a) = \sum s^{'}, a q_{h} (s^{'}, a) P (s^{'}, a, s) \forall s, \forall h \in [H - 1] q \geq 0 \end{matrix}

(10)

Here the variable $q_{h} (s, a)$ should be interpreted as the probability that the policy visits state $s$ at time-step $h$ and takes action $a$ . We will refer to the variables ${q_{h} (s, a)}_{a \in A, s \in S, h \in [H]}$ as a state-action occupancy measure. Once we solve the LP 12, we can obtain the optimal policy as

π_{h} (a | s) = ⎧ ⎪ ⎨ ⎪ ⎩ \begin{matrix} \frac{q_{h} (s, a)}{\sum_{b} q_{h} (s, b)} & if \sum_{b} q_{h} (s, b) > 0 \frac{1}{A} & o.w. \end{matrix}

(11)

Occupancy Measure Polytope: Given an initial state distribution $ρ$ and a probability transition function $P$ we will write $Q (ρ, P)$ to denote the set of state-action occupancy measures satisfying the Bellman flow conditions with respect to $ρ$ and $P$ , i.e.

Q (ρ, P) = ⎧ ⎨ ⎩ q \geq 0 : \sum a q_{1} (s, a) = ρ (s) \forall s and \sum a q_{h + 1} (s, a) = \sum s^{'}, a q_{h} (s^{'}, a) P (s^{'}, a, s) \forall s, \forall h \in [H - 1] ⎫ ⎬ ⎭

Utilitarian.

We maximize the sum of value functions across the $n$ agents.

π_{U L}^{⋆} \in {a r g m a x}_{π} \sum i V^{π} (ρ; r_{i})

We can solve optimization problem (10) with reward function $r = \sum_{i} r_{i}$ and obtain $q_{U L}^{⋆}$ .

\begin{matrix} q_{U L}^{⋆} \in & {a r g m a x}_{q} H \sum h = 1 \sum s, a q_{h} (s, a) \sum i r_{i} (s, a) s.t. & q \in Q (ρ, P) \end{matrix}

(12)

Once we obtain $q_{U L}^{⋆}$ we can obtain the optimal utilitarian policy $π_{U L}^{⋆}$ through eq. 11.

Max-Min Fair.

We maximize the worst value function across the $n$ agents.

π_{M W}^{⋆} \in {a r g m a x}_{π} min i \in [n] V^{π} (ρ; r_{i})

The above optimization problem can also be solved through a linear program.

\begin{matrix} max q, t & t s.t. & H \sum h = 1 \sum s, a q_{h} (s, a) r_{i} (s, a) \geq t \forall i \in [n] q \in Q (ρ, P) \end{matrix}

(13)

Nash Social Welfare.

We maximize the product of the value functions of the $n$ agents.

π_{N W}^{⋆} \in {a r g m a x}_{π} n \prod i = 1 V^{π} (ρ; r_{i})

The above optimization problem can be solved through a concave optimization problem.

\begin{matrix} max q & n \sum i = 1 log (H \sum h = 1 \sum s, a q_{h} (s, a) r_{i} (s, a)) s.t. & q \in Q (ρ, P) \end{matrix}

(14)

Generalized Gini Social Welfare

(GGW): This notion of fairness generalizes max-min fairness and has been considered previously in the literature on multi-objective Markov decision processes [SWZ20, ZGSW21]. We are given a vector of weights $w \in R^{n}$ so that $w_{i} \geq 0$ for each $i$ , $\sum_{i} w_{i} = 1$ , and $w_{1} \geq w_{2} \geq \dots \geq w_{n}$ . Given a policy $π$ , let $i_{1}, \dots, i_{n}$ be an ordering of the $n$ agents so that $V^{π} (r_{i_{1}}) \leq V^{π} (r_{i_{2}}) \leq \dots \leq V^{π} (r_{i_{n}})$ . Then we maximize the following objective

π_{G G W}^{⋆} \in {a r g m a x}_{π} n \sum j = 1 w_{i} V^{π} (r_{i_{j}})

The optimal GGW policy can also be computed through a linear program.

maxq,tts.t.∑iwiH∑h=1∑s,aqh(s,a)rσ(i)(s,a)≥t ∀σ∈Sn\lx@notemarkfootnoteq∈Q(ρ,P)

(15)

Even though the linear program above has exponentially many constraints, one can easily solve the LP using ellipsoid method. In order to apply the ellipsoid method we just need to show that there exists an efficient separation oracle. Given a candidate solution $q, t$ the bellman flow constraints i.e. the last three constraints can be checked efficiently. Then for each agent $i$ we compute $V_{i} (q) = \sum_{h = 1}^{H} \sum_{s, a} q_{h} (s, a) r_{i} (s, a)$ . Let $i_{1}, i_{2}, \dots, i_{n}$ be an ordering of the agents so that $V_{i_{1}} (q) \geq V_{i_{2}} (q) \geq \dots \geq V_{i_{n}} (q)$ . Then we check if $\sum_{j} w_{j} V_{i_{j}} (q) \geq t$ or not. If this constraint is violated then we have a violating permutation $σ$ where $σ (j) = i_{j}$ . On the other hand, if this constraint is satisfied then all the other constraints are satisfied since for any other permutation $σ$ we have $\sum_{j} w_{j} V_{σ (j)} (q) \geq \sum_{j} w_{j} V_{i_{j}} (q)$ . This is because the weights are arranged in a non-increasing order.

Appendix B Proof of Theorem 6

The optimal policy according to Nash social welfare can be computed through the following optimization problem.

	$max q$	$n \prod i = 1 (H \sum h = 1 \sum s, a q_{h} (s, a) r_{i} (s, a))$
	s.t.	$q \in Q (ρ, P)$

Let $q^{⋆}$ be the optimal solution of the above optimization problem. We now consider two cases.

Case 1. $N W (q^{⋆}) > 0$ : Since $q^{⋆}$ is a stationary point and the set $Q (ρ, P)$ is convex, the first order stationarity conditions gives us

\nabla N W (q^{⋆})^{⊤} (q - q^{⋆}) \leq 0 \forall q \in Q (ρ, P)

(16)

The derivative of the Nash welfare objective with respect to $q_{h} (s, a)$ is the following.

\frac{\partial N W (q)}{\partial q_{h} (s, a)}

= \sum i \prod j \neq i (H \sum h = 1 \sum s, a q_{h} (s, a) r_{i} (s, a)) r_{i} (s, a) = N W (q) \sum i \frac{r_{i} (s, a)}{V_{i} (q)}

Then equation (16) gives us the following result.

		$N W (q^{⋆}) \sum i \sum h, s, a \frac{r_{i} (s, a)}{V_{i} (q^{⋆})} (q_{h} (s, a) - q_{h}^{⋆} (s, a)) \leq 0$
	$\Rightarrow$	$N W (q^{⋆}) \sum i \frac{V_{i} (q) - V_{i} (q^{⋆})}{V_{i} (q^{⋆})} \leq 0$
	$\Rightarrow$	$\sum i \frac{V_{i} (q)}{V_{i} (q^{⋆})} \leq n$

This immediately implies that ${max}_{i} \frac{V_{i} (q)}{V_{i} (q^{⋆})} \leq n$ .

Case 2. $N W (q^{⋆}) = 0$ : This implies that for every policy $π$ , $V^{π} (r_{i}) = 0$ for some $i$ . Hence, the minimum value of every policy $π$ is zero and the MM social welfare is zero, contradicting the assumption of the theorem.

Upper Bound

We now construct an instance for the upper bound on the value functions under $π_{N W}^{⋆}$ . Given any $δ > 0$ we construct the following instance. Consider the following example with a single-state MDP and two actions $a$ and $b$ available. The rewards are

	$r_{1} (s, a)$	$= δ^{n}, r_{i} (s, a) = 1 for all i \neq 1,$
	$and r_{i} (s, b)$	$= δ for all i .$

Clearly, $π^{b}$ , which selects action $b$ with probability $1$ maximizes the minimum value, whereby $V^{π^{b}} (r_{i}) = δ$ for all $i \in [n]$ . To find out $π_{N W}^{⋆}$ , consider an arbitrary policy which plays action $a$ with probability $x$ , and we optimize $x$ with respect to the Nash welfare. Consider $N W$ as a function of $x$ , we have

\frac{\partial log N W (x)}{\partial x} = (n - 1) log (x + (1 - x) δ) + log (x δ^{n} + (1 - x) δ),

of which the zero is given by

\frac{(1 - δ) (n - 1)}{x + (1 - x) δ} = \frac{1 - δ^{n - 1}}{x δ^{n - 1} + (1 - x)} .

Rearranging the terms we get $x = (n - 1) / n + o (1)$ , where $o (1)$ is with respect to $δ$ . Hence,

\frac{V^{π_{N W}^{⋆}} (r_{1})}{V^{π_{M W}^{⋆}} (r_{1})} = \frac{\frac{n - 1}{n} δ^{n} + \frac{1}{n} δ + o (1)}{δ} = \frac{1}{n} + o (1) .

Appendix C Properties of Fairness Measures

Proposition 12.

$M W$ satisfies axioms 3 and 4.

Proof.

Since minimum is an anonymous function, the objective $M W$ satisfies axiom 3. We now check axiom 4. Given a policy $π$ , let $q^{π}$ be the occupancy measure of policy $π$ . Then we can express minimum value function as $M W (π; r) = {min}_{i} V^{π} (ρ; r_{i}) = {min}_{i} \sum_{h, s, a} q_{h}^{π} (s, a) r_{i} (s, a)$ . We can write down $M W$ as a function of the occupancy measure of a policy i.e.

{M W}_{o} (q; r) = min i \sum h, s, a q_{h} (s, a) r_{i} (s, a)

Since ${M W}_{o} (\cdot; r)$ is a minimum of linear functions ${M W}_{o}$ is continuous. Therefore, there exists a point on the line segment joining $q_{1}$ and $q_{3}$ that achieves the value ${M W}_{o} (q_{2}; r)$ . In particular, there exists $α \in [0, 1]$ so that

MWo(αq1+(1−α)q3;r)=MWo(q2;r)=MW(π2;r)\qed

Proposition 13.

There exists a weight vector $w \in R_{+}^{n}$ so that $G G W$ satisfies axioms 1, 3 and 4.

Proof.

Consider any weight vector $w$ so that $w_{1} > w_{2} > \dots > w_{n} > 0$ and $\sum_{j} w_{j} = 1$ . Given a policy $π$ , the ordering of the agents in terms of their value functions remains unchanged even if we permute the agents. This implies hat the objective $G G W$ is anonymous for any weight vector and it satisfies axiom 3. We now check axiom 4. Given a policy $π$ , let $q^{π}$ be the occupancy measure of policy $π$ . Then we can express generalized Gini welfare as a function of the occupancy measure of a policy i.e.

{G G W}_{o} (q; {r_{i}}_{i \in [n]}) = \sum j w_{j} \sum h, s, a q_{h} (s, a) r_{i_{j}} (s, a)

Note that ${G G W}_{o} (\cdot; {r_{i}}_{i \in [n]})$ is a weighted sum of order statistics. Since each order statistic is a continuous function of the occupancy measure $q$ , ${G G W}_{o}$ is a continuous function. Therefore, there exists a point on the line segment joining $q_{1}$ and $q_{3}$ that achieves the value ${G G W}_{o} (q_{2}; {r_{i}}_{i \in [n]})$ . In particular, there exists $α \in [0, 1]$ so that

{G G W}_{o} (α q_{1} + (1 - α) q_{3}; {r_{i}}_{i \in [n]}) = {G G W}_{o} (q_{2}; {r_{i}}_{i \in [n]})

Now any policy $π^{q_{2}}$ achieves the generalized Gini welfare under the occupancy measure $q_{2}$ .

In order to check axiom 1 we assume $V^{π} (r_{i}) \geq V^{˜ π} (r_{i})$ for all $i$ and there exists $j$ such that $V^{π} (r_{j}) > V^{˜ π} (r_{j})$ . Without loss of generality we can also assume that the agents are ordered so that $V^{˜ π} (r_{1}) \geq V^{˜ π} (r_{2}) \geq \dots \geq V^{˜ π} (r_{n})$ . Then we have $V^{π} (r_{i_{k}}) \geq V^{π} (r_{k}) \geq V^{˜ π} (r_{k})$ and $V^{π} (r_{i_{j}}) \geq V^{π} (r_{j}) > V^{˜ π} (r_{j})$ . Since $w_{k} > 0$ we have $G G W (π; {r_{i}}_{i \in [n]}) > G G W (π; {r_{i}}_{i \in [n]})$ . ∎

Proposition 14.

$N W$ satisfies Axioms 1, 2, 3, and 4.

Proof.

Since $N W$ is a monotonically increasing function of $V^{π} (r_{i})$ , axiom 1 is immediately satisfied. Moreover, the function $N W$ is a product of the values across the agents and hence anonymous. In order to check axiom 2 suppose there exist policies $π_{1}, π_{2}, {˜ π}_{1}, {˜ π}_{2}$ and reward vectors $r, ˜ r$ so that for all $i$ we have

\frac{V^{π_{1}} (ρ; r_{i})}{V^{π_{2}} (ρ; r_{i})} = \frac{V^{{˜ π}_{1}} (ρ; {˜ r}_{i})}{V^{{˜ π}_{2}} (ρ; {˜ r}_{i})}

Now suppose $N W (π_{1}; r) \geq N W (π_{2}; r)$ . This implies $\prod_{i} V^{π_{1}} (ρ; r_{i}) \geq \prod_{i} V^{π_{2}} (ρ; r_{i})$ . This gives us the following inequality.

\frac{\prod_{i} V^{{˜ π}_{1}} (ρ; {˜ r}_{i})}{\prod_{i} V^{{˜ π}_{2}} (ρ; {˜ r}_{i})} = \frac{\prod_{i} V^{π_{1}} (ρ; r_{i})}{\prod_{i} V^{π_{2}} (ρ; r_{i})} \geq 1

Therefore, we have $N W ({˜ π}_{1}; {{˜ r}_{i}}_{i \in [n]}) \geq N W ({˜ π}_{2}; {{˜ r}_{i}}_{i \in [n]})$ . The other direction of the implication can be proved analogously.

Now we check the final axiom 4. Given a policy $π$ , let $q^{π}$ be the occupancy measure of policy $π$ . Then we can express the Nash social welfare functional as $N W (π; {r_{i}}_{i \in [n]}) = \prod_{i} V^{π} (ρ; r_{i}) = \prod_{i} (\sum_{h, s, a} q_{h}^{π} (s, a) r_{i} (s, a))$ . Therefore, we can write down log-Nash social welfare as a function of the occupancy measure of a policy i.e.

{N W}_{o} (q; r) = \prod i (\sum h, s, a q_{h} (s, a) r_{i} (s, a))

${N W}_{o}$ is a continuous function of the occupancy measure. We are also given that $N W_{o} (q_{1}; r) \geq {N W}_{o} (q_{2}; r) \geq {N W}_{o} (q_{3}; r)$ . Therefore, there exists a point on the line segment joining $q_{1}$ and $q_{3}$ that achieves the value ${N W}_{o} (q_{2}; r)$ . In particular, there exists $α \in [0, 1]$ so that

{N W}_{o} (α q_{1} + (1 - α) q_{3}; r)

= {N W}_{o} (q_{2}; r) = N W (π_{2}; r)

Now any policy with state-action occupancy measure $α q_{1} + (1 - α) q_{3}$ achieves the intermediate value $N W (π_{2}; r)$ . ∎

Appendix D Proof of Theorem 5

We prove that Axioms 1, 2, 3, and 4 uniquely characterize the Nash Social Welfare function. Throughout we will fix the transition probability function of the MDP and let different agents pick different reward functions. We assume that there are at least four actions available at each state.

Defining Partial Order $≽$ : We first define a partial order $≽$ over the elements in $R_{+}^{n} = {x \in R^{n} : x_{i} > 0 \forall i}$ .

x ≽ y ⟺ W (π^{a}; r) \geq W (π^{b}; r)

(17)

for arbitrary $a, b \in A$ , where $π^{a}$ (resp. $π^{b}$ ) denotes a policy such that $π^{a} (s) = a$ (resp. $π^{b} (s) = b$ ), and $r$ is such that

r_{i} (s, a) = x_{i} and r_{i} (s, b) = y_{i}

(18)

for all $i \in [n]$ , and all the other entries of $r$ are arbitrary. Note that because of Axiom 2 this order $≽$ is well-defined despite the arbitrary choice of $a$ and $b$ and reward vector $r$ as we have $V^{π^{a}} (r_{i}) = x_{i} \cdot H$ and $V^{π^{b}} (r_{i}) = y_{i} \cdot H$ .

We now verify that the relation $≽$ satisfies reflexivity, symmetry, and transitivity. Reflexivity and symmetry immediately follows from the definition of the reward function in (18). For transitivity, suppose we are given $x ≽ y$ and $y ≽ z$ . We construct a reward function $r$ so that

r_{i} (s, a) = x_{i}, r_{i} (s, b) = y_{i}, and r_{i} (s, c) = z_{i}

Then we have $W (π^{a}; r) \geq W (π^{b}; r)$ and $W (π^{b}; r) \geq W (π^{c}; r)$ . This implies $W (π^{a}; r) \geq W (π^{c}; r)$ and $x ≽ z$ . Therefore, any two elements of $R_{+}^{n}$ are comparable under the relation $≽$ .

Lemma 1.

Suppose $x, y \in R_{+}^{n}$ with $x_{i} \geq y_{i}$ for all $i$ and $x_{i} > y_{i}$ for some $i$ . Then $x ≻ y$ .

Proof.

Let $r$ be such that $r_{i} (s, a) = x_{i}$ and $r_{i} (s, b) = y_{i}$ for all $i \in [n]$ . Note that we have $V^{π^{a}} (r_{i}) \geq V^{π^{b}} (r_{i})$ for all $i \in [n]$ and $V^{π^{a}} (r_{i}) > V^{π^{b}} (r_{i})$ for some individual $i$ . Therefore, by Axiom 1 we must have $W (π^{a}; r) > W (π^{b}; r)$ . By definition (17), this implies that $x ≻ y$ . ∎

Defining Function $F$ .

Let $e \in R_{+}^{n}$ be a vector with all component being $1$ . For any $x \in R_{+}^{n}$ , there exists $λ_{1} > λ_{0} > 0$ such that $λ_{0} e < x < λ_{1} e$ (where the comparison is component-wise). We choose a reward function $r$ such that

r_{i} (s, a) = λ_{1}, r_{i} (s, b) = λ_{0}, r_{i} (s, c) = x_{i},

and $r_{i} (s, d)$ is arbitrary. Then we have $W (π^{a}; r) > W (π^{c}; r) > W (π^{b}; r)$ . By Axiom 4 there exists $α \in [0, 1]$ such that

W (π^{c}; r) = W (π^{α q_{a} + (1 - α) q_{b}}; r)

(19)

where $q_{a} = q^{π^{a}}$ and $q_{b} = q^{π^{b}}$ . Under policy $π^{c}$ the value function of agent $i$ is $V^{π^{c}} (r_{i}) = x_{i} \cdot H$ . And the value function under the policy $π_{α} = π^{α q_{a} + (1 - α) q_{b}}$ is

	$V^{π_{α}} (r_{i})$	$= ⟨ α q_{a} + (1 - α) q_{b}, r_{i} ⟩$

		$= α λ_{1} \cdot H + (1 - α) λ_{0} \cdot H .$

Therefore, according to the definition of the order $≽$ we have $x \sim [α λ_{1} + (1 - α) λ_{0}] e$ and we define

F (x) = α λ_{1} + (1 - α) λ_{0} .

We first verify that $F$ is a well-defined function despite the arbitrary choice of $λ_{0}, λ_{1}$ , and $α$ . Indeed, suppose that there exists ${˜ λ}_{1} > {˜ λ}_{0} > 0$ and $˜ α$ such that ${˜ λ}_{0} e < x < {˜ λ}_{1} e$ and $˜ α$ satisfies (19) with respect to $λ_{0}$ and $λ_{1}$ . We have $[α λ_{1} + (1 - α) λ_{0}] e \sim [˜ α_{1} + (1 - ˜ α)_{0}] e$ . Then by Lemma 1 it must be that $α λ_{1} + (1 - α) λ_{0} = ˜ α_{1} + (1 - ˜ α)_{0}$ . Hence, the value of $F (x)$ is independent of the choice of $λ_{0}, λ_{1}$ , and $α$ .

We now establish an important property of the function $F$ . The proof of this claim is similar to the proof in [KN79] and uses the following lemma by \citeauthor*osborne1976irrelevant [osborne1976irrelevant] (also see Lemma 3.5 stated by \citeauthor*KN79 [KN79]). We present our version of the proof for completeness.

Lemma 2 (\citeauthor*osborne1976irrelevant [osborne1976irrelevant]).

Suppose $≽$ is reflexive, symmetric, and transitive, and $F : R_{+}^{n} \to R$ satisfies the following properties for any $x, y \in R_{+}^{n}$ :

if $x ≽ y$ , then $F (x) \geq F (y)$ ; and
$F (x) \geq F (y)$ if and only if $F (δ_{1} x_{1}, \dots, δ_{n} x_{n}) \geq F (δ_{1} y_{1}, \dots, δ_{n} y_{n})$ for all $δ \in R_{+}^{n}$ .

Then there exist non-negative real constants $c_{1}, \dots, c_{n}$ and a monotone increasing function $g : R \to R$ , such that $F (x) \equiv g (\prod_{i = 1}^{n} x_{i}^{c_{i}})$ .

Lemma 3.

There exists a monotone increasing function $g$ such that $F (x) = g (\prod_{i = 1}^{n} x_{i})$ .

Proof.

We show that the function $F$ satisfies two requirements of Lemma 2. This will imply that there exist non-negative real constants $c_{1}, \dots, c_{n}$ and a monotone increasing function $g : R \to R$ , such that $F (x) \equiv g (\prod_{i = 1}^{n} x_{i}^{c_{i}})$ . By Axiom 3, the constants $c_{1}, \dots, c_{n}$ must be identical, so the statement of the lemma follows.

Requirement (i).

First, given $x ≽ y$ , we need to show that $F (x) \geq F (y)$ . Since $x, y \in R_{+}^{n}$ there exists $λ_{0}, λ_{1} > 0$ so that $λ_{0} e < x < λ_{1} e$ and $λ_{0} e < y < λ_{1} e$ . We choose a reward function $r$ such that

r_{i} (s, a) = λ_{1}, r_{i} (s, b) = λ_{0}, r_{i} (s, c) = x_{i}, r_{i} (s, d) = y_{i}

and $r_{i} = 0$ otherwise. There exists $α$ such that

F (x) = α λ_{1} + (1 - α) λ_{0} and W (π^{c}; r) = W (π^{α q_{a} + (1 - α) q_{b}}; r) .

There also exists $β$ such that

F (y) = β λ_{1} + (1 - β) λ_{0} and W (π^{c}; r^{'}) = W (π^{β q_{a} + (1 - β) q_{b}}; r^{'}),

where $r_{i}^{'} (s, a) = λ_{1}, r_{i}^{'} (s, b) = λ_{0}, r_{i}^{'} (s, c) = y_{i}$ and other entries of $r^{'}$ are arbitrary. We have

V^{π^{d}} (r) / V^{π^{β q_{a} + (1 - β) q_{b}}} (r) = V^{π^{c}} (r^{'}) / V^{π^{β q_{a} + (1 - β) q_{b}}} (r^{'}) .

By Axiom 2, this means that $W (π^{d}; r) \geq W (π^{β q_{a} + (1 - β) q_{b}}; r)$ if and only if $W (π^{c}; r^{'}) \geq W (π^{β q_{a} + (1 - β) q_{b}}; r^{'})$ . Hence, we also get that

W (π^{d}; r) = W (π^{β q_{a} + (1 - β) q_{b}}; r) .

(20)

Since $x ≽ y$ from definition (17) we have $W (π^{c}; r) \geq W (π^{d}; r)$ . This implies that

W (π^{α q_{a} + (1 - α) q_{b}}; r) \geq W (π^{β q_{a} + (1 - β) q_{b}}; r) .

(21)

But the vector of value functions under policy $π^{α q_{a} + (1 - α) q_{b}}$ is $(α λ_{1} + (1 - α) λ_{0}) \cdot H e$ and similarly the vector of value functions under policy $π^{β q_{a} + (1 - β) q_{b}}$ is $(β λ_{1} + (1 - β) λ_{0}) \cdot H e$ . Therefore, it must be that

(α λ_{1} + (1 - α) λ_{0}) \cdot H e \geq (β λ_{1} + (1 - β) λ_{0}) \cdot H e

as otherwise we would have $(α λ_{1} + (1 - α) λ_{0}) \cdot H e < (β λ_{1} + (1 - β) λ_{0}) \cdot H e$ , which contradicts (21) given Axiom 1. It follows that $F (x) = α λ_{1} + (1 - α) λ_{0} \geq β λ_{1} + (1 - β) λ_{0} = F (y)$ .

Requirement (ii).

Next, we show that $F (x) \geq F (y)$ if and only if $F (δ_{1} x_{1}, \dots, δ_{n} x_{n}) \geq F (δ_{1} y_{1}, \dots, δ_{n} y_{n})$ for all $x, y$ and positive real numbers $δ_{i}$ for $i = 1, \dots, n$ . First, suppose $F (x) \geq F (y)$ . Define $F (x), F (y)$ and the parameters $α, β$ as we defined above. Since $F (x) \geq F (y)$ , we have

(α λ_{1} + (1 - α) λ_{0}) e \geq (β λ_{1} + (1 - β) λ_{1}) e .

Hence, $W (π^{c}; r) \geq W (π^{d}; r)$ by Axiom 1.

We now construct a new reward vector $˜ r$ . Let $δ - = min {δ_{1}, \dots, δ_{n}}$ and $¯ ¯ ¯ δ = max {δ_{1}, \dots, δ_{n}}$ . Then we define the following reward vector $˜ r$

	${˜ r}_{i} (s, a) = ¯ ¯ ¯ δ λ_{1},$	${˜ r}_{i} (s, b) = δ - λ_{0},$
	${˜ r}_{i} (s, c) = δ_{i} x_{i},$	${˜ r}_{i} (s, d) = δ_{i} y_{i} .$

We now apply axiom 2 with respect to policy $π^{c}, π^{d}$ and reward vectors $r$ and $˜ r$ . Note that

\frac{V^{π^{c}} (r_{i})}{V^{π^{d}} (r_{i})} = \frac{V^{π^{c}} ({˜ r}_{i})}{V^{π^{d}} ({˜ r}_{i})} .

Since the function satisfies Axiom 2, we have $W (π^{c}; ˜ r) \geq W (π^{d}; ˜ r)$ . From Definition (19) we know that there exists $˜ α, ˜ β$ such that

W (π^{c}; ˜ r) = W (π^{˜ α q_{a} + (1 - ˜ α) q_{b}}; ˜ r)

and, similarly to (20),

W (π^{d}; ˜ r) = W (π^{˜ β q_{a} + (1 - ˜ β) q_{b}}; ˜ r) .

By Axiom 1, this implies that

(˜ α ¯ ¯ ¯ δ λ_{1} + (1 - ˜ α) δ - λ_{0}) e \geq (˜ β ¯ ¯ ¯ δ λ_{1} + (1 - ˜ β) δ - λ_{0}) e

as the two sides are the value vectors of $π^{˜ α q_{a} + (1 - ˜ α) q_{b}}$ and $π^{˜ β q_{a} + (1 - ˜ β) q_{b}}$ . It follows that

F (δ_{1} x_{1}, \dots, δ_{n} x_{n}) = ˜ α ¯ ¯ ¯ δ λ_{1} + (1 - ˜ α) δ - λ_{0} \geq ˜ β ¯ ¯ ¯ δ λ_{1} + (1 - ˜ β) δ - λ_{0} = F (δ_{1} y_{1}, \dots, δ_{n} y_{n}) .

This completes the proof. ∎

From Function $F$ to $N W$ .

Suppose there are two policies $π_{1}$ and $π_{2}$ such that $W (π_{1}; r) \geq W (π_{2}; r)$ . Let us denote $q_{1} = q^{π_{1}}$ and $q_{2} = q^{π_{2}}$ . Consider two vectors $u, v \in R_{+}^{n}$ such that

In order to determine whether $u ≽ v$ or $v ≽ u$ we define the following reward functions following (18):

Then we have $V^{π^{a}} ({˜ r}_{i}) = ⟨ r_{i}, q_{1} ⟩ = V^{π_{1}} (r_{i})$ and $V^{π^{b}} ({˜ r}_{i}) = ⟨ r_{i}, q_{2} ⟩ = V^{π_{2}} (r_{i})$ . By Axiom 2 it must be that $W (π_{a}; r) \geq W (π_{b}; r)$ ; hence, we have $u ≽ v$ by definition. This implies that $F (u) \geq F (v)$ as $F$ satisfies the first requirement of Lemma 2. By Lemma 3 this is equivalent to:

		$\prod i \in [n] u_{i} \geq \prod i \in [n] v_{i}$
	$⟺$	$\sum i \in [n] log (⟨ r_{i}, q_{1} ⟩) \geq \sum i \in [n] log (⟨ r_{i}, q_{2} ⟩)$
	$⟺$	$\sum i \in [n] log (V^{π_{1}} (r_{i})) \geq \sum i \in [n] log (V^{π_{2}} (r_{i}))$
	$⟺$	$N W (π_{1}; r) \geq N W (π_{2}; r) .$

It can be also shown that $F (u) \geq F (v) ⟹ u ≽ v ⟹ W (π_{1}; r) \geq W (π_{2}; r)$ . Hence, we get that $W (π_{1}; r) \geq W (π_{2}; r) ⟺ N W (π_{1}; r) \geq N W (π_{2}; r)$ .

Appendix E Optimistic Planning

We aim to solve the following optimization problem.

{˜ P}_{t} \in {a r g m a x}_{P \in C_{t} (ˆ P)} max π F (π; P)

(22)

where $C_{t} (ˆ P)$ is the set of plausible transition functions at time $t$ i.e.

C_{t} (ˆ P) = {P : \forall s {∥ ∥ P (s, a, \cdot) - ˆ P (s, a, \cdot) ∥ ∥}_{1} \leq ε_{t} (s, a)}

for $ε_{t} (s, a) = \sqrt{\frac{4 S log (S A t / δ)}{max {1, N_{t} (s, a)}}}$ . We will show that the objective $˜ F (P) = {max}_{π} F (π; P)$ is concave in $P$ for various choices of $F$ , so the problem is a convex optimization. Since the feasibility over $C_{t} (ˆ P)$ can be determined efficiently, one can use any standard optimization method (e.g. the ellipsoid method) to solve the optimistic planning problem.

In order to show concavity of $˜ F$ , let us consider two probability transition functions $P_{1}$ and $P_{2}$ . For the probability transition function $P_{i}$ , let $π_{i}^{⋆}$ be the policy maximizing the objective $˜ F (P_{i})$ and let $q_{i}^{⋆}$ be the corresponding state-action occupancy measure. Now consider the probability transition function $λ P_{1} + (1 - λ) P_{2}$ . Note that, $λ q_{1}^{⋆} + (1 - λ) q_{2}^{⋆}$ satisfies the Bellman-flow constraints with respect to the probability transition function $λ P_{1} + (1 - λ) P_{2}$ i.e. $λ q_{1}^{⋆} + (1 - λ) q_{2}^{⋆} \in Q (ρ, λ P_{1} + (1 - λ) P_{2})$ . With these definitions we are now ready to show that $˜ F (λ P_{1} + (1 - λ) P_{2}) \geq λ ˜ F (P_{1}) + (1 - λ) ˜ F (P_{2})$ .

Nash Social Welfare: In this case, the optimization problem eq. 22 is equivalent to the following optimization problem.

{˜ P}_{t} \in {a r g m a x}_{P \in C_{t} (ˆ P)} max π n \sum i = 1 log V^{π} (r_{i})

We now show that the objective $˜ N W (P) = {max}_{π} \sum_{i} log V^{π} (r_{i}; P)$ is concave in $P$ . Suppose, policy $˜ π$ is the policy with state-action occupancy measure $λ q_{1}^{⋆} + (1 - λ) q_{2}^{⋆}$ we have

	$˜ N W (P) \geq n \sum i = 1 log V^{˜ π} (r_{i})$
	$= n \sum i = 1 log (\sum h \sum s, a (λ q_{1, h}^{⋆} (s, a) + (1 - λ) q_{2, h}^{⋆} (s, a)) r_{i} (s, a))$
	$\geq λ n \sum i = 1 log (\sum h \sum s, a q_{1, h}^{⋆} (s, a) r_{i} (s, a)) + (1 - λ) n \sum i = 1 log (\sum h \sum s, a q_{2, h}^{⋆} (s, a) r_{i} (s, a))$
	$= λ n \sum i = 1 log V^{π_{1}^{⋆}} (r_{i}; P_{1}) + (1 - λ) n \sum i = 1 log V^{π_{2}^{⋆}} (r_{i}; P_{2})$
	$= λ ˜ N W (P_{1}) + (1 - λ) ˜ N W (P_{2})$

Therefore, $˜ N W (P)$ is a concave function of the probability transition function $P$ .

Minimum Welfare: In this case, optimization problem eq. 22 is equivalent to the following optimization problem.

{˜ P}_{t} \in {a r g m a x}_{P \in C_{t} (ˆ P)} max π min i V^{π} (r_{i})

We again show that the objective $˜ M W (P) = {max}_{π} {min}_{i} V^{π} (r_{i})$ is concave in $P$ .

	$˜ M W (P) \geq min i V^{˜ π} (r_{i})$
	$= min i \sum h \sum s, a (λ q_{1, h}^{⋆} (s, a) + (1 - λ) q_{2, h}^{⋆} (s, a)) r_{i} (s, a)$
	$= min i λ V^{π_{1}^{⋆}} (r_{i}) + (1 - λ) V^{π_{2}^{⋆}} (r_{i})$
	$\geq λ min i V^{π_{1}^{⋆}} (r_{i}) + (1 - λ) min i V^{π_{2}^{⋆}} (r_{i})$
	$= λ ˜ M W (P_{1}) + (1 - λ) ˜ M W (P_{2})$

Generalized Gini Welfare: In this case, we are given a weight vector $w \in R^{n}$ so that $w_{i} \geq 0$ for each $i$ , $\sum_{i} w_{i} = 1$ and $w_{1} \geq w_{2} \geq \dots \geq w_{n}$ . Let $i_{1}, i_{2}, \dots, i_{n}$ be an ordering of the $n$ agents so that

V^{˜ π} (r_{i_{1}}) \leq V^{˜ π} (r_{i_{2}}) \leq \dots \leq V^{˜ π} (r_{i_{n}}) .

Here $˜ π$ is the policy with state-action occupancy measure $λ q_{1}^{⋆} + (1 - λ) q_{2}^{⋆}$ . Our objective is

˜ G G W (P) = max π G G W (π; P)

We now show that the objective $˜ G G W (P)$ is concave in $P$ .

	$˜ G G W (P) \geq \sum j w_{j} V^{˜ π} (r_{i_{j}})$
	$= \sum j w_{j} \sum h \sum s, a (λ q_{1, h}^{⋆} (s, a) + (1 - λ) q_{2, h}^{⋆} (s, a)) r_{i_{j}} (s, a)$
	$= \sum j w_{j} (λ V^{π_{1}^{⋆}} (r_{i_{j}}) + (1 - λ) V^{π_{2}^{⋆}} (r_{i_{j}}))$
	$= λ \sum j w_{j} V^{π_{1}^{⋆}} (r_{i_{j}}) + (1 - λ) \sum j w_{j} V^{π_{2}^{⋆}} (r_{i_{j}})$

The last line follows from the following observation. Suppose $ℓ_{1}, ℓ_{2}, ℓ_{n}$ be an ordering of the agents so that

V^{π_{1}^{⋆}} (r_{i_{1}}) \leq V^{π_{1}^{⋆}} (r_{i_{2}}) \leq \dots \leq V^{π_{1}^{⋆}} (r_{i_{n}}) .

Since the weight vector $w$ is non-increasing in $i$ , the ordering $ℓ_{1}, ℓ_{2}, \dots, ℓ_{n}$ achieves the smallest possible value of the weighted sum of the value functions i.e.

\sum j w_{j} V^{π_{1}^{⋆}} (r_{i_{j}}) \geq \sum j w_{j} V^{π_{1}^{⋆}} (r_{ℓ_{j}}) = G G W (π_{1}^{⋆}; P_{1})

The same argument holds for policy $π_{2}^{⋆}$ . Since policy $π_{1}^{⋆}$ (resp. $π_{2}^{⋆}$ ) maximizes generalized Gini welfare with respect to the probability transition function $P_{1}$ (resp. $P_{2}$ ) we have the following inequality.

˜ G G W (P) \geq λ ˜ G G W (P_{1}) + (1 - λ) ˜ G G W (P_{2}) .

Appendix F Proof of Theorem 7

Proof.

Let $P^{⋆}$ be the true probability transition function. Let us define $ε_{t} (s, a) = \sqrt{\frac{4 S log (S A t / δ)}{N_{t} (s, a)}}$ . By the Chernoff-Hoeffding type concentration inequality for categorical random variables, the following bound holds for any state $s$ , action $a$ , and episode $t$ .

Pr ({∥ ∥^P (s, a, \cdot) - P^{⋆} (s, a, \cdot) ∥ ∥}_{1}^{⋆} \geq ε_{t} (s, a)) \leq \frac{δ}{2 t^{2} S A}

Therefore, by a union bound over the $T$ episodes and all state-action pairs we see that with high probability the true probability transition function $P^{⋆}$ is contained in the set $C_{t} (^P)$ for all $t$ i.e.

Pr (\exists t P^{⋆} \notin C_{t} (^P)) \leq δ \sum t \frac{1}{2 t^{2}} \leq δ

So we assume that this event holds. At the start of episode $t$ , we compute the best policy ${˜ π}_{t}$ for the optimistic model ${˜ P}_{t}$ . This implies the following bound on the Nash welfare with respect to the true model $P^{⋆}$ .

	$N W ({˜ π}_{t}; {˜ P}_{t})$	$= max π N W (π; {˜ P}_{t})$
		$\geq max P \in C_{t} (ˆ P) max π N W (π; P) \geq max π N W (π; P^{⋆})$

This allows us to upper bound the regret through the optimistic policy.

	${Reg}_{N W} (T)$	$= T \sum t = 1 N W (π_{N W}^{⋆}) - N W (π_{t})$
		$\leq T \sum t = 1 N W ({˜ π}_{t}; {˜ P}_{t}) - N W ({˜ π}_{t}; P^{⋆})$
		$\leq H^{n - 1} T \sum t = 1 \sum i \in [n] ∣ ∣ V^{{˜ π}_{t}} (r_{i}; {˜ P}_{t}) - V^{{˜ π}_{t}} (r_{i}; P^{⋆}) ∣ ∣$
		$\leq H^{n - 1} n \sum i = 1 T \sum t = 1 ∣ ∣ V^{{˜ π}_{t}} (r_{i}; {˜ P}_{t}) - V^{{˜ π}_{t}} (r_{i}; P^{⋆}) ∣ ∣      := {Reg}_{i}$

The second inequality uses lemma 4. We now use lemma 5 to establish a bound of $˜ O (H^{2} S \sqrt{A T})$ on the term ${Reg}_{i}$ for any $i$ . This proves the desired upper bound on the regret.

Consider any agent $i$ . We can apply lemma 5 with $ϵ_{t} (s^{'}, a, s) = ∣ ∣ {˜ P}_{t} (s^{'}, a, s) - P^{⋆} (s^{'}, a, s) ∣ ∣$ . Notice that as ${˜ P}_{t}, P^{⋆} \in C_{t} (^P)$ we have the following inequality.

	$\sum s ϵ_{t} (s^{'}, a, s)$	$= \sum s ∣ ∣ {˜ P}_{t} (s^{'}, a, s) - P^{⋆} (s^{'}, a, s) ∣ ∣$
		$\leq \sum s ∣ ∣ {^P}_{t} (s^{'}, a, s) - P^{⋆} (s^{'}, a, s) ∣ ∣ + \sum s ∣ ∣ {˜ P}_{t} (s^{'}, a, s) - P^{⋆} (s^{'}, a, s) ∣ ∣$
		$\leq 2 \sqrt{\frac{4 S log (S A t / δ)}{N_{t} (s^{'}, a)}}$

${Reg}_{i}$	$= T \sum t = 1 ∣ ∣ V^{{˜ π}_{t}} (r_{i}; {˜ P}_{t}) - V^{{˜ π}_{t}} (r_{i}; P^{⋆}) ∣ ∣ \leq \sum t H^{2} \sqrt{\sum s^{'}, b {∥ ∥ ϵ_{t} (s^{'}, b, \cdot) ∥ ∥}_{t}^{'}}$
	$\leq 2 H^{2} \sum t \sqrt{\sum s^{'}, b \frac{4 S log (S A t / δ)}{N_{t} (s^{'}, b)}}$
	$= 4 H^{2} \sqrt{S log (S A T / δ)} \sum t \sqrt{\sum s^{'}, b \frac{1}{N_{t} (s^{'}, b)}}$
	$\leq 4 H^{2} \sqrt{S log (S A T / δ)} \sqrt{T} \sqrt{\sum t \sum s^{'}, b \frac{1}{N_{t} (s^{'}, b)}}$
	$= 4 H^{2} \sqrt{S T log (S A T / δ)} \sqrt{\sum s^{'}, b N_{T} (s^{'}, b) \sum t = 1 \frac{1}{t}}$
	$\leq 4 H^{2} \sqrt{S log (S A T / δ)} \sqrt{S A log T}$	$\qed$

Lemma 4.

For any policy $π$ , and probability transition functions $P_{1}$ and $P_{2}$ we have

N W (π; P_{1}) - N W (π; P_{2}) \leq H^{n - 1} n \sum i = 1 | V^{π} (r_{i}; P_{1}) - V^{π} (r_{i}; P_{2}) |

Proof.

We use induction on the number of agents $n$ . When $n = 1$ this is trivially true. So we assume the claim holds for $n = m$ .

	$N W (π; P_{1}) - N W (π; P_{2})$
$=$	$m + 1 \prod i = 1 V^{π} (r_{i}; P_{1}) - m + 1 \prod i = 1 V^{π} (r_{i}; P_{2})$
	$= m \prod i = 1 V^{π} (r_{i}; P_{1}) \cdot V^{π} (r_{m + 1}; P_{1}) - m \prod i = 1 V^{π} (r_{i}; P_{1}) \cdot V^{π} (r_{m + 1}; P_{2})$
	$+ m \prod i = 1 V^{π} (r_{i}; P_{1}) \cdot V^{π} (r_{m + 1}; P_{2}) - m \prod i = 1 V^{π} (r_{i}; P_{2}) \cdot V^{π} (r_{m + 1}; P_{2})$
	$\leq m \prod i = 1 V^{π} (r_{i}; P_{1}) \| V^{π} (r_{m + 1}; P_{1}) - V^{π} (r_{m + 1}; P_{2}) \| + V^{π} (r_{m + 1}; P_{2}) (m \prod i = 1 V^{π} (r_{i}; P_{1}) - m \prod i = 1 V^{π} (r_{i}; P_{2}))$
	$\leq H^{m} \| V^{π} (r_{m + 1}; P_{1}) - V^{π} (r_{m + 1}; P_{2}) \| + H \cdot H^{m - 1} m \sum i = 1 \| V^{π} (r_{i}; P_{1}) - V^{π} (r_{i}; P_{2}) \|$
	[By the induction hypothesis and the fact that $V^{π} (r_{i}; P) \leq H$ for any $i$ and $P$ ]
	$\leq H^{m} m + 1 \sum i = 1 \| V^{π} (r_{i}; P_{1}) - V^{π} (r_{i}; P_{2}) \|$	$\qed$

Lemma 5.

Let $ϵ (s^{'}, a, s) = ∣ ∣ P (s^{'}, a, s) - ˜ P (s^{'}, a, s) ∣ ∣$ for all tuples $(s^{'}, a, s)$ . Then for any policy $˜ π$ we have

Proof.

We will use $˜ q$ to write the state-action occupancy measure under policy $˜ π$ and probability transition function $˜ P$ i.e. ${˜ q}_{h} (s, a) = {Pr}_{˜ π, ˜ P} (s_{h} = s, a_{h} = a)$ . Similarly, we will use $q$ to denote the state-action occupancy measure under policy $˜ π$ and probability transition function $P$ . We will also write $ϵ (s^{'}, b, s) = ˜ P (s^{'}, b, s) - P (s^{'}, b, s)$ .

Since rewards are bounded between $0$ and $1$ we have the following inequality.

	$∣ ∣ V^{˜ π} (r; ˜ P) - V^{˜ π} (r; P) ∣ ∣$	(23)
$=$	$∣ ∣ ∣ ∣ H \sum h = 1 \sum s, a {˜ q}_{h} (s, a) r (s, a) - H \sum h = 1 \sum s, a q_{h} (s, a) r (s, a) ∣ ∣ ∣ ∣$
	$\leq H \sum h = 1 \sum s, a \| {˜ q}_{h} (s, a) - q_{h} (s, a) \|      := Δ_{h}$	(24)

We now establish a recurrence relation for the term $Δ_{h} = \sum_{s, a} | {˜ q}_{h} (s, a) - q_{h} (s, a) |$ . First note that, $Δ_{1} = 0$ since

Δ_{1}

= \sum s, a | {˜ q}_{1} (s, a) - q_{1} (s, a) | = \sum s, a | ρ (s) ˜ π (s, a) - ρ (s) ˜ π (s, a) | = 0

Here we use the equality constraint $\sum_{a} q_{1} (s, a) = ρ (s) = \sum_{a} {˜ q}_{1} (s, a)$ .

	$Δ_{h} = \sum s, a \| {˜ q}_{h} (s, a) - q_{h} (s, a) \|$
	$= \sum s, a ∣ ∣ ∣ ∣ \sum b {˜ q}_{h} (s, b) ˜ π (a \| s) - \sum b q_{h} (s, b) ˜ π (a \| s) ∣ ∣ ∣ ∣$
	$\leq \sum s, a ˜ π (a \| s) ∣ ∣ ∣ ∣ \sum b {˜ q}_{h} (s, b) - \sum b q_{h} (s, b) ∣ ∣ ∣ ∣$
	$\leq \sum s ∣ ∣ ∣ ∣ \sum s^{'}, b {˜ q}_{h - 1} (s^{'}, b) P (s^{'}, b, s) - \sum s^{'}, b q_{h - 1} (s^{'}, b) ˜ P (s^{'}, b, s) ∣ ∣ ∣ ∣$

	$\leq \sum s \sum s^{'}, b ∣ ∣ {˜ q}_{h - 1} (s^{'}, b) - q_{h - 1} (s^{'}, b) ∣ ∣ P (s^{'}, b, s) + \sum s \sum s^{'}, b q_{h - 1} (s^{'}, b) ϵ (s^{'}, b, s)$
	$\leq \sum s^{'}, b ∣ ∣ {˜ q}_{h - 1} (s^{'}, b) - q_{h - 1} (s^{'}, b) ∣ ∣ + \sum s^{'}, b q_{h - 1} (s^{'}, b) \sum s ϵ (s^{'}, b, s)$
	$\leq Δ_{h - 1} + \sqrt{\sum s^{'}, b q_{h - 1} (s^{'}, b)^{2}} \sqrt{\sum s^{'}, b {(\sum s ϵ (s^{'}, b, s))}^{2}}$
	$\leq Δ_{h - 1} + \sum s^{'}, b q_{h - 1} (s^{'}, b) \sqrt{\sum s^{'}, b {∥ ∥ ϵ (s^{'}, b, \cdot) ∥ ∥}_{1}^{'}}$
	$\leq Δ_{h - 1} + \sqrt{\sum s^{'}, b {∥ ∥ ϵ (s^{'}, b, \cdot) ∥ ∥}_{1}^{'}}      := ε$

The above recurrence relation and $Δ_{1} = 0$ gives us $Δ_{h} = (h - 1) ε$ . Substituting this result in equation 24 we get the following bound.

∣∣V˜π(r;˜P)−V˜π(r;P)∣∣≤H∑h=1(h−1)ε≤H2ε\qed

Appendix G Unknown Reward Functions

When the reward function is unknown we update the optimistic planning step as follows

	$({˜ P}_{t}, {˜ r}_{t}) \leftarrow {a r g m a x}_{P \in C_{t} (ˆ P), r \in D_{t} (^r)} max π F (π; P, r)$
	${˜ π}_{t} \leftarrow {a r g m a x}_{π} F (π; {˜ P}_{t}, {˜ r}_{t})$

Here $D_{t} (^r)$ is a confidence set around the empirical reward function $^r$ and it can be constructed by using Chernoff-Hoeffding inequality and the union bound. Here we consider the case $F = N W$ . The proof for $M W$ , and $G G W$ are similar. As in the proof of theorem 7 we can upper bound regret as

	${Reg}_{N W} (T) \leq H^{n - 1} T \sum t = 1 \sum i \in [n] ∣ ∣ V^{{˜ π}_{t}} ({˜ r}_{t, i}; {˜ P}_{t}) - V^{{˜ π}_{t}} (r_{i}; P^{⋆}) ∣ ∣$
	$\leq H^{n - 1} \sum i \in [n] T \sum t = 1 ∣ ∣ V^{{˜ π}_{t}} ({˜ r}_{t, i}; {˜ P}_{t}) - V^{{˜ π}_{t}} (r_{i}; {˜ P}_{t}) ∣ ∣      := {Reg}_{i}^{r} + H^{n - 1} \sum i \in [n] T \sum t = 1 ∣ ∣ V^{{˜ π}_{t}} (r_{i}; {˜ P}_{t}) - V^{{˜ π}_{t}} (r_{i}; P^{⋆}) ∣ ∣      := {Reg}_{i}^{P}$

The second term ${Reg}_{i}^{P}$ exactly equals the term ${Reg}_{i}$ introduced in theorem 7 and is bounded by $O (H^{2} S \sqrt{A T})$ . The first term ${Reg}_{i}^{r}$ equals the difference in value functions between optimistic reward function ${˜ r}_{t, i}$ and true reward function $r_{i}$ but with respect to the fixed probability transition function ${˜ P}_{t}$ . Let $ε_{r} (s, a) = | {˜ r}_{t, i} (s, a) - r_{i} (s, a) |$ and ${˜ q}_{t}$ be the state-action occupancy measure under policy ${˜ π}_{t}$ and probability transition function ${˜ P}_{t}$ . Then we have,

{Reg}_{i}^{r}

Now, by an analysis very similar to bounding the error term $ε = \sqrt{\sum_{s^{'}, b} {∥ ϵ (s^{'}, b, \cdot) ∥}_{1}^{'}}$ , one can show that the term ${∥ ε_{r} ∥}_{1}$ is bounded by $O (S \sqrt{A T})$ . This implies that the term ${Reg}_{i}^{P}$ dominates the term ${Reg}_{i}^{r}$ and the regret is still bounded by $O (n H^{n + 1} S \sqrt{A T})$ .

Appendix H Lower Bound on ${Reg}_{N W} (T)$

Figure 2: Lower Bound Instance (following [LS20])

We consider a collection of MDPs, where the state-space of each of the MDPs consists of a good state ( $s_{g}$ ) and a bad state ( $s_{b}$ ). The remaining $S - 2$ states are arranged in a $A$ -ary tree. The transitions within the $S$ -ary tree is deterministic. Let $L$ be the set of leaves in the $A$ -ary tree. Then for the $ℓ$ -th leaf node, and $a$ -th action we consider the following transition probabilities.

P (s_{g} | ℓ, a) = \frac{1}{2} + Δ, P (s_{b} | ℓ, a) = \frac{1}{2} - Δ

In the $(ℓ, a)$ -th MDP, all the other actions from the leaf nodes have uniform probability of transitioning to either the good or the bad state. From the good or the bad state, taking any action maintains the current state with probability $1 - δ$ , and transitions to $s_{0}$ with probability $δ$ . The rewards are $1$ for all the $n$ agents at state $s_{g}$ . All the other rewards are zero. We choose $δ = \frac{1}{2 H}$

h.1 Proof of theorem 10

Proof.

We will write $M_{0}$ to denote the MDP where $P (s_{g} | ℓ, a) = P (s_{b} | ℓ, a) = \frac{1}{2}$ for all $ℓ$ and $a$ . Let us define the following stopping time $τ$ .

τ = T \land min {t : t \sum u = 1 H \sum h = 1 1 {s_{u, h} = s_{g}} \geq T - 1}

(25)

i.e. $τ$ is the minimum episode number when the number of visits to state $s_{g}$ is at least $T - 1$ . If this episode number is larger than $T$ then we just set $τ$ as $T$ .

Let $T_{ℓ, a}$ be the total number of times the policy visits state $ℓ$ and takes action $a$ until the stopping time $τ$ . We first show that $E_{0} [\sum_{ℓ, a} T_{ℓ, a}] = Θ (T)$ where $E_{0} [\cdot]$ denotes expectation with respect to the MDP $M_{0}$ . Note that a visit to one of the leaf nodes is followed by either a visit to the good node $s_{g}$ or a visit to the bad node $s_{b}$ . Since the number of visits to $s_{g}$ is $T - 1$ we just need to bound the number of visits to the state $s_{b}$ .

Let $T_{b}$ the total number of visits to the bad state starting from the node $s_{0}$ . We write $T_{b} = T_{b}^{1} + T_{b}^{2}$ where $T_{b}^{1}$ is the number of visits to state $s_{b}$ which were followed by episode reset before visiting the starting state $s_{0}$ . On the other hand, $T_{b}^{2}$ is the number of visits to state $s_{b}$ that were not stopped by episode reset before visiting state $s_{0}$ . Since there are $T$ episodes, we have $T_{b}^{1} \leq T$ .

In order to bound $T_{b}^{2}$ we consider the total amount of time the policy stays at state $s_{b}$ before visiting $s_{0}$ . This is a geometric random variable with parameter $1 / 2 H$ . Therefore, if we write $Y_{b}$ to denote the total amount of time the policy stays at $s_{b}$ because of visitations in the set $T_{b}^{2}$ , $Y_{b}$ is a sum of i.i.d. geometric random variables, and by standard concentration inequality [DP09, brown2011] we get

Pr (Y_{b} < \frac{1}{2} \cdot T_{b}^{2} \cdot \frac{H}{2}) \leq e^{- T_{b}^{2} / 6}

Therefore, as long as $T_{b}^{2} \geq 6 log (1 / δ)$ the total time spent at state $s_{b}$ is at least $T_{b}^{2} H / 4$ with probability at least $1 - δ$ . Since the total number of time steps is exactly $T H$ , it must be that $T_{b}^{2} \leq 4 T$ in this case. Therefore, either $T_{b}^{2}$ is less than $6 log (1 / δ)$ or it is less than $4 T$ with probability at least $1 - δ$ . Combining these two cases we get the following upper bound on expected value of $T_{b}^{2}$

E [T_{b}^{2}] \leq 6 log (1 / δ) + (1 - δ) 4 T + δ T H

This bound holds for any choice of $δ$ . In particular, for $δ = 1 / 4 H$ and $T \geq Ω (log H)$ we get that $E [T_{b}^{2}] \leq 5 T$ .

Therefore, we have $E [T_{b}] = E [T_{b}^{1}] + E [T_{b}^{2}] \leq 6 T$ . As $E_{0} [\sum_{ℓ, a} T_{ℓ, a}]$ is bounded by the total number of visits to node $s_{g}$ and node $s_{b}$ we have the following bound.

T - 1 \leq E_{0} ⎡ ⎣ \sum ℓ, a T_{ℓ, a} ⎤ ⎦ \leq 7 T - 1

By a very similar argument we can also establish a similar bound for MDP $(ℓ, a)$ .

Lower Bound on Regret for Model $P_{ℓ, a}$ : In this model the optimal policy is to navigate to the $ℓ$ -th leaf node and then take action $a$ . Let $W$ be a geometric random variable with parameter $p = \frac{1}{2 H}$ . Then the value function of an agent $i$ is at least

(\frac{1}{2} + Δ) E [min {W, H - log S}]

We now lower bound the term $E [min {W, H - log S}]$ .

	$E [min {W, H - log S}]$	$= H - log S \sum x = 1 x (1 - p)^{x - 1} \cdot p + \sum x \geq H - log S (H - log S) (1 - p)^{x - 1} \cdot p$
		$= \frac{1}{p} - 2 (1 - p)^{H - log S}$
		$= 2 H - 2 {(1 - \frac{1}{2 H})}^{H - log S}$
		$\geq 2 H - 2 \cdot e^{- \frac{1}{2} - \frac{log S}{2 H}} = Θ (H)$

as long as $H = Ω (log S)$ . Therefore, the total expected Nash welfare over the $T$ episodes is at least

Θ (T {(\frac{1}{2} + Δ)}^{n} H^{n}) = Θ ⎛ ⎝ {(\frac{1}{2} + Δ)}^{n} H^{n} E_{ℓ, a} ⎡ ⎣ \sum ℓ^{'}, a^{'} T_{ℓ^{'}, a^{'}} ⎤ ⎦ ⎞ ⎠

Let $X_{ℓ^{'}, a^{'}}^{t}$ be the indicator variable that denotes whether the policy visits leaf node $ℓ^{'}$ , and takes action $a^{'}$ at episode $t$ . If $X_{ℓ, a}^{t} = 1$ then by a very similar argument as above, we can show that expected Nash welfare at episode $t$ is at most $O ((1 / 2 + Δ)^{n} H^{n})$ . On the other hand, if $X_{ℓ, a}^{t} = 0$ then expected Nash welfare at episode $t$ is $O ((1 / 2)^{n} H^{n})$ . Therefore, the sum of expected Nash welfare over the $T$ episodes is

	$\sum t \sum (ℓ^{'}, a^{'}) \neq (ℓ, a) E_{ℓ, a} [X_{ℓ^{'}, a^{'}}^{t}] {(\frac{H}{2})}^{n} + \sum t E_{ℓ, a} [X_{ℓ, a}^{t}] {(\frac{1}{2} + Δ)}^{n} H^{n}$
	$= {(\frac{H}{2})}^{n} \sum ℓ^{'}, a^{'} E_{ℓ, a} [T_{ℓ^{'}, a^{'}}] + E_{ℓ, a} [T_{ℓ, a}] ({(\frac{1}{2} + Δ)}^{n} - {(\frac{}{1} 2)}^{n}) H^{n}$

Let $T_{σ} = \sum_{ℓ^{'}, a^{'}} T_{ℓ^{'}, a^{'}}$ . Then regret on model $P_{ℓ, a}$ is at least

R_{ℓ, a}

\geq ({(\frac{1}{2} + Δ)}^{n} - {(\frac{1}{2})}^{n}) H^{n} E_{ℓ, a} [T_{σ} - T_{ℓ, a}] \geq n Δ {(\frac{H}{2})}^{n} E_{ℓ, a} [T_{σ} - T_{ℓ, a}]

Now we can proceed very similarly to the proof of theorem 38.7 from [LS20] and establish that there exists some $ℓ^{'}, a^{'}$ so that $E_{ℓ^{'}, a^{'}} [T_{σ} - T_{ℓ^{'}, a^{'}}] \geq O (T)$ for $Δ = Ω (\sqrt{S A / T})$ . This choice of $Δ$ implies the following lower bound.

Rℓ′,a′≥Ω(n(H2)n√SAT)\qed

Appendix I Proof of Theorem 11

Proof.

We first assume that $v^{⋆}$ equals the maximin value $v_{M W}^{⋆} = {max}_{π} {min}_{i} V^{π} (r_{i})$ . We will assume $B \geq 1$ . Notice that UOB-REPS is run with reward function

{˜ r}_{t} (s, a) = \sum i λ_{i}^{t} r_{i} (s, a) + \frac{v^{⋆}}{H} (1 - \sum i λ_{i}^{t})

at time $t$ . Since each entry of the reward function $r_{i}$ is bounded by $1$ we have $| {˜ r}_{i} (s, a) | = O (B)$ . Therefore, from the regret guarantee of UOB-REPS we have

		$max q T \sum t = 1 L (q, λ^{t}) - T \sum t = 1 L (q^{t}, λ^{t}) \leq ˜ O (B H S \sqrt{A T})$
	$\Rightarrow$	$max q T \cdot L (q, ¯ ¯ ¯ λ) - T \sum t = 1 L (q^{t}, λ^{t}) \leq ˜ O (B H S \sqrt{A T})$		(26)

where $¯ ¯ ¯ λ = \frac{1}{T} \sum_{t = 1}^{T} λ_{t}$ . On the other hand, from the regret guarantee of OSMD we have,

		$T \sum t = 1 L (q^{t}, λ^{t}) - min λ \in C T \sum t = 1 L (q^{t}, λ) \leq O (B H \sqrt{n T log A})$
	$\Rightarrow$	$T \sum t = 1 L (q^{t}, λ^{t}) - min λ \in C T \cdot L (¯ ¯ ¯ q, λ) \leq O (B H \sqrt{n T log A})$		(27)

where $¯ ¯ ¯ q$ is defined as ${¯ ¯ ¯ q}_{h} (s, a) = \frac{1}{T} \sum_{t = 1}^{T} q_{h}^{t} (s, a)$ . Let $i^{⋆} \in {a r g m i n}_{i \in [n]} \sum_{t = 1}^{T} V^{π_{t}} (r_{i})$ . Since $E [V^{π_{t}} (r_{i})] = \sum_{h} \sum_{s, a} q_{h}^{t} (s, a) r_{i} (s, a)$ , an application of Chernoff’s implies the following inequality holds with probability at least $1 - δ$ .

∣ ∣ ∣ ∣ T \sum t = 1 \sum h \sum s, a q_{h}^{t} (s, a) r_{i^{⋆}} (s, a) - T \sum t = 1 V^{π^{t}} (r_{i^{⋆}}) ∣ ∣ ∣ ∣ \leq O (H \sqrt{T log (n / δ)})

Let $e_{i^{⋆}}$ be the unit vector with exactly $1$ at index $i^{⋆}$ , and $0$ otherwise. Then from eq. 27 we get the following bound.

	$T \cdot L (¯ ¯ ¯ q, B e_{i^{*}}) \geq min λ \in C T \cdot L (¯ ¯ ¯ q, λ)$
	$\geq T \sum t = 1 L (q^{t}, λ^{t}) - O (B H \sqrt{n T log A})$
	$\geq max q T \cdot L (q, ¯ ¯ ¯ λ) - ˜ O (B H S \sqrt{A T}) - O (B H \sqrt{n T log A}) [By~{}% (???)]$
	$= T \cdot v^{⋆} (1 - \sum i {¯ ¯ ¯ λ}_{i}) + T \cdot max q \sum i {¯ ¯ ¯ λ}_{i} \sum h \sum s, a q_{h} (s, a) - ˜ O (B H S \sqrt{A T}) - O (B H \sqrt{n T log A})$

	$= T \cdot v^{⋆} - ˜ O (B H S \sqrt{A T}) - O (B H \sqrt{n T log A})$

We can upper bound $T \cdot L (¯ ¯ ¯ q, B e_{i^{⋆}})$ as

	$T \cdot (v^{⋆} + B (H \sum h = 1 \sum s, a {¯ ¯ ¯ q}_{h} (s, a) r_{i^{⋆}} (s, a) - v^{⋆}))$
	$= T v^{⋆} (1 - B) + B T \sum t = 1 H \sum h = 1 \sum s, a q_{h}^{t} (s, a) r_{i^{⋆}} (s, a)$		(28)

This equality gives us the following bound.

- T v^{⋆} B + B min i T \sum t = 1 V^{π_{t}} (r_{i}) \geq - ˜ O (B H S \sqrt{A T}) - O (B H \sqrt{n T log A})

After rearranging, and dividing throughout by $B$ we get the following bound on regret.

T v^{⋆} - min i T \sum t = 1 V^{π_{t}} (r_{i}) \leq ˜ O (H S \sqrt{A T}) + O (H \sqrt{n T log A})

Now consider the case when $v^{⋆} \neq v_{M W}^{⋆}$ . Pick any $v^{⋆} \geq v_{M W}^{⋆}$ and run algorithm 2. By a similar argument as above, we can establish the following bound.

T \cdot L (¯ ¯ ¯ q, B e_{i^{⋆}}) \geq T \cdot v^{⋆} (1 - \sum i {¯ ¯ ¯ λ}_{i}) + T \cdot v_{M W}^{⋆} \sum i {¯ ¯ ¯ λ}_{i} - ˜ O (B H S \sqrt{A T}) - O (B H \sqrt{n T log A})

Using the upper bound established in (28) we get the following bound.

	$T v^{⋆} (1 - B) + B min i T \sum t = 1 V^{π_{t}} (r_{i}) \geq T \cdot v^{⋆} (1 - \sum i {¯ ¯ ¯ λ}_{i})$
	$+ T \cdot v_{M W}^{⋆} \sum i {¯ ¯ ¯ λ}_{i} - ˜ O (B H S \sqrt{A T}) - O (B H \sqrt{n T log A})$

After rearranging and dividing throughout by $B$ we get the following inequality.

T v^{⋆} - min i T \sum t = 1 V^{π_{t}} (r_{i}) \leq \frac{T}{B} \sum i {¯ ¯ ¯ λ}_{i} (v^{⋆} - v_{M W}^{⋆}) + ˜ O (H S \sqrt{A T}) + O (H \sqrt{n T log A})

This gives us the following bound on regret.

	$T v_{M W}^{⋆} - min i T \sum t = 1 V^{π_{t}} (r_{i})$
	$\leq T v^{⋆} - min i T \sum t = 1 V^{π_{t}} (r_{i}) + T (v_{M W}^{⋆} - v^{⋆})$
	$\leq T (v_{M W}^{⋆} - v^{⋆}) ⎛ ⎝ 1 - \frac{\sum_{i} {¯ ¯ ¯ λ}_{i}}{B} ⎞ ⎠ + ˜ O (H S \sqrt{A T}) + O (H \sqrt{n T log A})$
	$\leq ˜ O (H S \sqrt{A T}) + O (H \sqrt{n T log A})$

The last inequality follows because $v_{M W}^{⋆} \leq v^{⋆}$ and $\sum_{i} {¯ ¯ ¯ λ}_{i} \leq B$ . ∎

	$Δ_{h} = \sum s, a \| {˜ q}_{h} (s, a) - q_{h} (s, a) \|$
	$= \sum s, a ∣ ∣ ∣ ∣ \sum b {˜ q}_{h} (s, b) ˜ π (a \| s) - \sum b q_{h} (s, b) ˜ π (a \| s) ∣ ∣ ∣ ∣$
	$\leq \sum s, a ˜ π (a \| s) ∣ ∣ ∣ ∣ \sum b {˜ q}_{h} (s, b) - \sum b q_{h} (s, b) ∣ ∣ ∣ ∣$
	$\leq \sum s ∣ ∣ ∣ ∣ \sum s^{'}, b {˜ q}_{h - 1} (s^{'}, b) P (s^{'}, b, s) - \sum s^{'}, b q_{h - 1} (s^{'}, b) ˜ P (s^{'}, b, s) ∣ ∣ ∣ ∣$

Socially Fair Reinforcement Learning

Abstract

1 Introduction

1.1 Contributions

1.2 Related Work

2 Preliminaries

Minimum Welfare.

Generalized Gini Social Welfare (GGW).

Nash Social Welfare.

2.1 Axioms

Axiom 1 (Pareto Optimality).

Axiom 2 (Independence of Irrelevant Alternatives with Neutrality).

Axiom 3 (Anonymity).

Axiom 4 (Continuity).

3 Axiomatic Analysis

Minimum Welfare Violates 1 (Po).

Minimum Welare Violates 2 (Iian).

GGW Violates 2 (Iia).

3.1 Nash Social Welfare

Theorem 5.

Theorem 6.

4 Learning

Regret.

Algorithm.

Optimistic Planning.

Theorem 7.

Proof sketch.

Theorem 8.

Proof.

Theorem 9.

Proof.

4.1 Lower Bound on Regret

Theorem 10.

4.2 Improved Bound for Minimum Welfare

Theorem 11.

5 Conclusion

Appendix A Computing Optimal Policies

Utilitarian.

Max-Min Fair.

Nash Social Welfare.

Generalized Gini Social Welfare

Appendix B Proof of Theorem 6

Upper Bound

Appendix C Properties of Fairness Measures

Proposition 12.

Proof.

Proposition 13.

Proof.

Proposition 14.

Proof.

Appendix D Proof of Theorem 5

Lemma 1.

Proof.

Defining Function F.

Lemma 2 (\citeauthor*osborne1976irrelevant [osborne1976irrelevant]).

Lemma 3.

Proof.

Requirement (i).

Requirement (ii).

From Function F to NW.

Appendix E Optimistic Planning

Appendix F Proof of Theorem 7

Proof.

Lemma 4.

Proof.

Lemma 5.

Proof.

Appendix G Unknown Reward Functions

Appendix H Lower Bound on RegNW(T)

h.1 Proof of theorem 10

Proof.

Appendix I Proof of Theorem 11

Proof.

Defining Function $F$ .

From Function $F$ to $N W$ .

Appendix H Lower Bound on ${Reg}_{N W} (T)$