Socially Fair Reinforcement Learning

Debmalya Mandal
Max Planck Institute for Software Systems
dmandal@mpi-sws.org
   Jiarui Gan
University of Oxford
jiarui.gan@cs.ox.ac.uk
Abstract

We consider the problem of episodic reinforcement learning where there are multiple stakeholders with different reward functions. Our goal is to output a policy that is socially fair with respect to different reward functions. Prior works have proposed different objectives that a fair policy must optimize including minimum welfare, and generalized Gini welfare. We first take an axiomatic view of the problem, and propose four axioms that any such fair objective must satisfy. We show that the Nash social welfare is the unique objective that uniquely satisfies all four objectives, whereas prior objectives fail to satisfy all four axioms. We then consider the learning version of the problem where the underlying model i.e. Markov decision process is unknown. We consider the problem of minimizing regret with respect to the fair policies maximizing three different fair objectives – minimum welfare, generalized Gini welfare, and Nash social welfare. Based on optimistic planning, we propose a generic learning algorithm and derive its regret bound with respect to the three different policies. For the objective of Nash social welfare, we also derive a lower bound in regret that grows exponentially with , the number of agents. Finally, we show that for the objective of minimum welfare, one can improve regret by a factor of for a weaker notion of regret.

\addbibresource

refs.bib

1 Introduction

In recent years, reinforcement learning has been immensely successful in various domains including game playing [Silver+17], and robotics [Andry+20]. These breakthroughs have opened up new avenues for applying reinforcement learning in real-world decision making systems like healthcare [GJKF+19], and finance [LLXW+21]. However, such human-facing systems often include multiple stakeholders with different preferences, and the classical goal of maximization of rewards is no longer sufficient. A major design challenge is the selection of right objective that provides certain fairness guarantees across different agents.

Our first example comes from the use of reinforcement learning for vaccine distribution [MMMC21]. In this context, the vulnerable population is in immediate need, and their utilities should be prioritized. Similarly, multi-armed bandit algorithms have been deployed for improving access to maternal and child healthcare [MMTM+22]. In this domain, researchers have found that maximizing the minimum reward provides better fairness guarantees.

These examples suggest that in the context of multi-agent sequential decision making, we should be careful about selecting the right measure. But what should be the right measure? Besides maximizing minimum welfare discussed above, researchers have proposed selecting a policy at the pareto frontier [DCR18], or a policy maximizing generalized Gini social welfare [SWZ20]. In this paper, we focus on the problem of selecting a measure that ensures fairness in multi-agent sequential decision making. For the static setting, this is a well-studied problem in economics [Moulin04], and also in computational social choice [BCEL+16]. Motivated by the success of axiomatic framework in these fields, we aim to layout a reasonable set of axioms for selecting a fairness measure in dynamic multi-agent sequential decision making systems.

Even though a set of axioms might uniquely characterize a fairness measure, the optimal policy according to this fairness measure is unknown. This is because the underlying MDP is unknown, and we must interact with the environment to learn such an optimal policy. Moreover, given the use of different fairness measures in different contexts, can we come up with a generic reinforcement learning algorithm that learns different fair optimal policies? This is the second question we study in this paper, and in particular, we consider the problem of minimizing regret with respect to different fair optimal policies. Note that, this question is challenging since most fairness measures are non-linear in value functions and dynamic programming  [Bertsekas12] based learning algorithms cannot be applied.

1.1 Contributions

At a high level, we propose a set of axioms for selecting fairness measure in multi-agent sequential decision making systems. Three natural fairness measures are then analyzed against the proposed axioms. When the underlying MDP is unknown, we propose a generic learning algorithm for minimizing regret with respect to different fairness measures. In more detail, we make the following contributions.

  • We consider the episodic reinforcement learning with multiple agents, and propose a set of four axioms that any fairness measure should satisfy. We show that these axioms uniquely characterize Nash social welfare (), whereas prior measures like minimum welfare () or generalized Gini welfare () violates some of them.

  • When the underlying model is unknown, we propose a generic learning algorithm for minimizing regret with respect to the optimal fair according to a fairness measure. For a setting with agents, states, actions, and horizon length , we instantiate this algorithm for different measures, and obtain regret upper bounds of for the objective of Nash social welfare, and for both the minimum welfare, and generalized Gini welfare.

  • For the objective of Nash social welfare, we derive a lower bound in regret that scales exponentially with , whereas for the other two measures lower bound immediately follows from lower bound in regret for single-agent setting. Finally, for the objective of minimum welfare, we show that one can improve regret by a factor of for a weaker notion of regret.

1.2 Related Work

Our work is related to several lines of research.

Fairness in Multi-Agent Systems. Our work falls under the framework of fairness in multi-agent sequential decision-making [ZS14, JL19]. In a multi-agent MDP model, \citeauthor*ZS14 [ZS14] propose to solve for a regularized maximin policy, where the regularizer offers a trade-off between utilitarian and max-min fair solution. \citeauthor*JL19 [JL19], on the other hand, selects a policy in the pareto frontier. Recently, several papers have proposed to use generalized Gini welfare ([SWZ20, ZGSW21] as a metric for selecting policies in multi-agent MDP.

Nash Welfare in Bandits. Beyond learning in Markov decision processes, our paper is closely related to \citeauthor*HMS21 [HMS21] who consider the objective of Nash social welfare in multi-armed bandit and achieve regret. Note that, our regret upper bound also becomes if horizon length , and number of states . Additionally, there is no need to generalize the classical axioms of [KN79] in a setting with a single state. \citeauthor*BKMS22 [BKMS22] also consider regret with respect to the Nash social welfare objective. However, they consider a setting where an agent arrives each round, and the goal is to maximize the Nash welfare (i.e. geometric mean of the rewards) over the rounds. Beyond Nash welfare, several papers [BBLB20, PGNN21] have also studied other notions of fairness in multi-armed bandits.

Fairness in Reinforcement Learning. Our work is related to fairness in online learning  [JKMR16, JKMN+16, LRDMP17] and reinforcement learning [JJKMR17]. In contrast to our setting, these papers mainly define fairness with respect to the arms and stipulates that arms of lower quality should not be selected over arms of higher quality. Subsequent papers have considered group fairness in online learning [HLW+20, SLMD19, WBT21]. with metrics like demographic parity. These notions of fairness are different than ours and we refer the reader to the recent survey [GSTF+22] for more details.

Episodic Reinforcement Learning. The classic paper of \citeauthor*AJO08 [AJO08] introduced the UCRL algorithm and studied regret minimization in average reward MDP setting. Our main learning algorithm is based on the optimistic planning approach of UCRL algorithm. In the context of episodic reinforcement learning \citeauthor*AOM17 [AOM17] obtained the minimax optimal regret bound. Subsequent papers have considered other versions of episodic RL including changing reward function [JJLS+20, RM19], and changing transition function [JZBJ18, JL20].

Axioms Regret
PO ANON IIAN CON Upper Bound Lower Bound
Nash Social Welfare Y Y Y Y
Minimum Welfare N Y N Y
Generalized Gini Welfare Y Y N Y
Figure 1: Comparison between different fairness measures for reinforcement learning. Measure is the unique fairness measure (up to a monotone transformation) satisfying all four axioms. The regret for scales exponentially with , whereas for the other measure it is independent of .

2 Preliminaries

We consider the episodic reinforcement learning problem with multiple reward functions. Before describing the general setting, we begin with the single-agent episodie reinforcement learning. We are given an MDP with states and actions i.e. and . The initial state is drawn from a distribution . We will write (resp. ) to denote the state visited (resp. action taken) at time-step . For the next state . The goal is to maximize the expected sum of rewards over the steps i.e. .

In this paper, we consider a setting with different reward functions corresponding to different agents. We will assume for all and state,action pair . Given a policy , we can define the value function corresponding to the -th reward function as follows.

(1)

Often the starting state distribution will be clear from the context and we will simply write instead of . Note that we are considering an episodic reinforcement learning setup, so the policy need not be a stationary policy. We will write where is the (non-stationary) policy used at time-step . We will write to denote the set of such non-stationary policies.

Any policy affects different reward functions differently (e.g. through value functions), and we want to build a measure to evaluate how fair the policy is with respect to different reward functions. In particular, a fairness measure maps a policy and a set of reward functions to a real number i.e. . Therefore, provides a way to evaluate the fairness of a policy and we want to maximize the measure to find the optimal fair policy. The most basic measure is the utilitarian measure which is the sum of values across the agents. However, this measure violates basic axioms and we will consider the following three measures of fairness.

Minimum Welfare.

We maximize the minimum value function across the agents.

Generalized Gini Social Welfare (GGW).

This notion of fairness generalizes max-min fairness and has been considered previously in the literature on multi-objective Markov decision processes [SWZ20, ZGSW21]. We are given a vector of weights so that for each , , and . Given a policy , let be an ordering of the agents so that . We then maximize the following objective:

When , the above objective coincides with minimum welfare.

Nash Social Welfare.

We maximize the product of the value functions of the agents, which is known as the Nash social welfare.

When the full MDP is known, a policy that maximizes each of the measures can be computed by linear programming. The details can be found in the appendix.

Prior work on (multi-objective) fair reinforcement learning has mostly focused on measures minimum welfare and generalized Gini welfare. In this work, we want to emphasize the Nash social welfare as an alternative measure of fairness because of its attractive axiomatic properties.

2.1 Axioms

We view the problem of solving a fair reinforcement learning problem as maximization of a fairness measure . Therefore, we propose four axioms that any such fairness measure should satisfy. These axioms are inspired by classical axioms in economics, particularly social choice theory [Sen18]. For our setting, we build on the axiomatic framework developed by \citeauthor*KN79 [KN79]. To define the axioms we will need some additional notations.

  • Given a policy we will write to denote its state-action occupancy measure, i.e.,

    (2)
  • Given an occupancy measure we will write to denote the corresponding policy, i.e.,

    It is known that the occupancy measure of the policy is .

  • Given two occupancy measures and , we will write to denote the policy corresponding to the convex combination of the occupancy measures and . In particular, first we define the occupancy measure and then take the policy .

Axiom 1 (Pareto Optimality).

If for all and for some , then it must be that

Namely, if policy pareto-improves policy , then must assign higher value to a policy .

Axiom 2 (Independence of Irrelevant Alternatives with Neutrality).

Suppose that for all ,

Then if and only if .

Namely, if to is the same as to to each agent in terms of value function, then to is the same as to .

Axiom 3 (Anonymity).

For any permutation of the agents and policy , it must be that

Namely, is independent of the indices of the agents.

Axiom 4 (Continuity).

Suppose and let for . Then there exists such that

Namely, there is an intermediate policy between and that attains the same value as under . Note that, we don’t take direct convex combination of policies, rather we take convex combination in the occupancy measure. This is because the value function in an MDP is a non-linear function of policy, and there might not exist a policy of the form that achieves the same value as .

3 Axiomatic Analysis

We investigate which of the above axioms are satisfied by different fairness measures. First, we show that even though the axioms appear basic, not all of them are satisfied by the fairness measures considered previously in the literature. In fact, we can show that minimum welfare violates PO and IIAN, whereas the measure generalized Gini welfare () violates IIAN under any choice of weight vector .

Minimum Welfare Violates  1 (Po).

Consider a MDP with a single state and two actions and . Agent has the same reward for both the actions, say . On the other hand, agent has reward for action and reward for action . Consider a policy that always pulls action in state . Then . Now consider another policy that always pulls action in state . Then and . Since the minimum value function does not change we have .

Minimum Welare Violates  2 (Iian).

We again consider a MDP with a single state and two actions and . We consider two reward functions and defined as follows. For the first agent

(3)

On the other hand, for the second agent, we have

(4)

Consider two policies: pulls or with equal probability, whereas pulls with probability and with probability . For the first agent we then have,

For the second agent we have,

Therefore, we observe that the rewards and the policy satisfy the following condition.

For the reward vector we have and . On the other hand, for the reward vector we have and . Therefore, but . This implies that the objective does not satisfy the independence of irrelevant alternatives.

GGW Violates  2 (Iia).

We consider the same example as discussed for the setting of max-min fairness. Consider a weight vector where and . For the reward vector , we have

For the reward vector , we have

We now consider two cases. First, if we have but . On the other hand, if we have but .

Now we consider the remaining case of . We now consider a different choice of reward functions than defined in equations (3) and (4). Reward functions and remain as they are, but we change the definition of reward functions and as follows.

Now it can be again checked that the required assumptions for axiom 2 is satisfied. However, for the weight vector we have but . Therefore, we have shown that for any weight vector with , one can construct an instance where the fairness measure violates axiom IIA. Note that, the case of corresponds to minimum welfare, and the construction can be generalized to more than two agents by just duplicating the reward functions.

Apart from the above violations, the remaining axioms are satisfied by the fairness measures minimum welfare, and the generalized Gini welfare. The proofs are straightforward and are provided in the appendix.

3.1 Nash Social Welfare

We now turn to the fairness measure Nash social welfare. In the appendix, we show that satisfies all four axioms. The main difference with , and is the satisfaction of axiom IIA. Next we state our first main result and show that up to a transformation by a monotonically increasing function, is the unique fairness measure that satisfies all four axioms.

Theorem 5.

Suppose there are at least four actions available at each state, and is a fairness measure satisfying Axioms 1, 2, 3, and 4. Then it holds that for any policy , and reward function , where is some monotonically increasing function.

The proof of Theorem 5 is provided in the appendix, and adapts main ideas from \citeauthor*KN79 [KN79] to the setting of episodic reinforcement learning.

  1. We first define an order on as if and only if . Here (resp. ) always plays action (resp. ) and is a reward function constructed using and .

  2. Based on the function , we then define a function that respects the order . The construction of this function crucially depends on axiom 4.

  3. We then show the function satisfy some requirements so that we can invoke a result due to [osborne1976irrelevant] and obtain that there exists a monotone increasing function such that . This enables us to show that is a monotone increasing transformation of .

In addition to the above properties, the Nash social welfare also provides a utility guarantee in comparison to the optimal max-min fair policy: provides any agent at least fraction of her value under .

Theorem 6.

Suppose that the maximum social welfare is not zero.111When the maximum social welfare is zero, for each policy there exists an agent such that , i.e., . This implies that is also a max-min fair policy.Then for any agent , it holds that

Moreover, this bound is tight, i.e., for any there exists an instance s.t. for some .

The main idea to prove Theorem 6 is to show that the optimal policy can be computed by optimizing a log-concave function over a convex set, and then use first order optimality conditions at the optimal solution.222A similar relationship is known in the fair division literature [CKMP+19], who showed a tight bound of , but their setting and proof techniques are very different than ours. The details can be found in the appendix.

4 Learning

We now consider the setting where the probability transition function is unknown and the learner needs to learn in order to compute a fair policy. The learner interacts with the environment over episodes. We will write to denote the policy (possibly non-stationary) adopted by the learner during episode . We will also use to denote the state visited at time-step of episode , and to denote the action taken at time-step of episode . As is standard in the literature on online reinforcement learning, we will assume that the reward functions are known, but we show in the appendix how our algorithm and analysis can be easily generalized to handle unknown reward functions. We next define the regret of a learner that interacts with the world over episodes.

for  do
       /* Compute optimistic MDP and the corresponding policy */
      
(5)
for  do
             Observer state .
             Play .
            
      /* Update estimates of */
      
      
       for  do
             .
      Set
ALGORITHM 1 Upper Confidence Reinforcement Learning for Fair Objective (UCRL-)

Regret.

We define regret with respect to a generic fair objective . For a policy we will write to denote its value according to the objective e.g. . Let be the policy that maximizes the objective i.e. Then we measure regret of a learning algorithm with respect to this optimal fair policy.

(6)

Algorithm.

Algorithm 1 presents our learning algorithm for a generic objective . The algorithm is based on the principle of optimistic planning, introduced by \citeauthor*AJO08 [AJO08]. The learner interacts with the environment over episodes each of length . At the start of episode , the learner first computes the the optimistic model and the optimistic policy (eq. 5). Then the learner plays policy for steps, and at the end of the episode, updates the empirical probability transition function .

Optimistic Planning.

The most important step of algorithm 1 is the computation of an optimistic model and a policy (eq. 5). Here is the probability transition function that is plausible at time (i.e., lies within the confidence ellipsoid centered around ) and has the highest possible objective value according to the function . For an arbitrary objective , it is not even clear that this step can be performed efficiently. However, we show that for objectives , one can use the ellipsoid method to efficiently compute and . The details for the three objectives are provided in the appendix. The next theorem bounds the regret of algorithm for the objective of Nash social welfare ().

Theorem 7.

For the objective of Nash social welfare, Algorithm 1 has regret

Proof sketch.

The proof of this theorem is provided in the appendix. Here we discuss the main ideas behind the proof.

  1. We can apply Chernoff-Hoeffding inequality and the union bound to show that, with high probability, the true probability transition function is contained within the confidence ellipsoid for all time steps i.e.

  2. Conditioned on the event above, we show that the regret can be upper bounded as . This step uses the fact that is the optimistic model and is the optimistic policy at time .

  3. We can then bound the difference in Nash welfare in terms of sum of difference in value functions. This gives us the following upper bound on regret (see lemma 4 in the appendix).

    This result can be thought of as a linearization lemma, and it generalizes Lemma 2 by \citeauthor*HMS21 [HMS21] to the general setting of finite-horizon reinforcement learning.

  4. Finally, we prove a bound of on the term to complete the proof. ∎

We now consider the regret in learning the max-min fair policy. We instantiate algorithm 1 with objective and obtain a regret bound of . Note that, unlike the objective of Nash social welfare, the regret in this case doesn’t grow with the number of agents .

Theorem 8.

For the objective of minimum welfare, Algorithm 1 has regret

Proof.

Proceeding similarly as the proof of Theorem 7 we can show that the regret can be upper bounded by the regret with respect to the optimistic policies.

We can bound the term as follows.

Substituting this bound gives us the following upper bound on .

where . The proof of Theorem 7 first upper bounds by for any and then bounds by . ∎

The next theorem upper bounds the regret for the objective of Generalized Gini Social welfare.

Theorem 9.

For the objective of generalized Gini social welfare, Algorithm 1 has regret

Proof.

We proceed very similarly to the proof of Theorem 7. Since the true probability transition function is contained within with high probability and algorithm 1 uses optimistic planning, we can bound the regret as follows.

We now consider two ordering of the agents. Let be an ordering of the agents so that

Let be an ordering of the agents so that

Without loss of generality, we can assume that for all . Then we can rewrite the upper bound on regret as follows:

We can show that for all it holds that

(7)

Indeed, if , then we are done, so we assume that .

Note that there must be such that . This is because the converse, i.e., for all , would imply that for some , which is a contradiction. It then follows that

so (7) follows.

Hence, we have

and the remainder of the proof is the same as that of Theorem 8. ∎

4.1 Lower Bound on Regret

If all the agents have identical reward functions then regret in learning minimum welfare () just corresponds to regret in learning a single reward function. Therefore, must be at least the regret in learning for the single-agent setting which is  [AOM17]. Now, for the generalized Gini welfare recall that the weights are normalized i.e. . So, the regret in learning () again coincides with the regret in learning for the single-agent setting and we have .

We now consider the regret in learning for the Nash welfare objective. Note that the upper bound in regret scales exponentially with , the number of agents. Intuitively this is expected as in an episodic reinforcement learning setting, the Nash social welfare of agents scales as . We now provide a lower bound in regret that also scales exponentially with .

Theorem 10.

Suppose and . Then for any policy , there exists a MDP with states, and actions so that for some universal constant .

Here we briefly highlight the main ideas of the proof. We use the same instance used in the lower bound construction of average reward MDP [LS20, AJO08], where there is a collection of MDPs. Each MDP has states arranged in the form of an -ary tree, and two other states – good state (reward ) and bad state (reward ). Any action from the leaves of this tree transition uniformly at random to the remaining two states are good () and bad (). However, on model we define

Now observe that on Model the optimal policy can navigate to leaf and take action . Moreover, the transition probabilities are designed in such a way that once either the good or the bad state is reached, it takes at least steps to leave that state. This means the optimal policy achieves Nash welfare of at least per episode. On the other hand, if the learning policy does not take action at leaf node , it achieves a Nash welfare of at most that episode. Therefore, the regret in these episodes is at least . Then the rest of the proof shows that one can choose small enough so that there exists a model under which the number of such misses by the learning algorithm is at least .

4.2 Improved Bound for Minimum Welfare

We now consider a weaker notion of regret for the fairness measure , minimum social welfare.

(8)

The difference with the regret definition (6) is that for the second quantity the order of minimum and summation is interchanged. Since value functions are non-negative, sum of minimum welfare across episodes is upper bounded by the minimum of the total welfare across episodes. This implies that , and we show that it is possible to improve the bound for .

Before deriving a regret bound with respect to the new definition, we first argue that the new regret eq. 8 is perhaps more appropriate for our setting. Since each of the agents interact with the learner for episodes, the total utility received by agent is . Therefore, the minimum total welfare is . On the other hand, the optimal minimum welfare achievable over episodes is

The difference between these two quantities is precisely the regret in eq. 8.

Algorithm 2 describes our new learning algorithm and is inspired by the dual formulation of the problem of solving max-min fair policy.

(9)

Here is the state-action occupancy measure (as defined in eq. 2) and the constraint ensures that is valid with respect to . The Lagrangian of the optimization problem (9) is given as

Suppose we know , the optimal solution to (9), i.e., the maxmin value. Then substituting we get the following expression for the Lagrangian.

Here we assume that is known and it is an input to Algorithm 2. In our full proof we show that it suffices to set to an upper bound of the maxmin value in Algorithm 2 (e.g., we can use an upper bound of the rewards).

The Lagrangian can be interpreted as a two-player zero-sum game where the learner (max-player) plays and the adversary (min-player) plays . Algorithm 2 tries to find an equilibrium of this game by using a reinforcement learning algorithm for the -player and a bandit algorithm for the -player. In order to see which RL algorithm is suitable let us first rewrite the term as

Therefore, we can define a new environment with rewards and in the new environment the RL agent receives the expected reward . We will fix the choice of at the start of an episode (say at the start of episode ), and we also need to fix the policy at the start of the episode. For this reason, we will be using the UOB-REPS algorithm [JJLS+20] which works with unknown transition, and adversarial rewards, and has regret .

The -player, on the other hand, chooses a point in the convex set at the start of each episode and receives an expected cost of . As the cost function is linear in we can use any algorithm for linear bandit with adversarial reward. We will use the OSMD algorithm proposed by  \citeauthor*BCK12 [BCK12] which has regret .

Input: Maxmin value , number of episodes , length of each episode .
Instantiate UOB-REPS with action set and state space .
Instantiate OSMD with action set .
for  do
       policy chosen according to UOB-REPS.
       action chosen according to OSMD.
       for  do
             Observe .
             Action .
             Reward feedback .
            
      Loss feedback to OSMD: .
      
ALGORITHM 2 Lagrange-Maximin 
Theorem 11.

For , algorithm 2 has regret

The main idea is to show that the average occupancy measure and is an -approximate fixed point of the game for . This lets us bound the constraint violation of with respect to the LP (9) and also bound regret . The proof also shows that any choice of larger than works, so one can call algorithm 2 with . Additionally, the parameter needs to be at least the maximum possible value of any reward function.

5 Conclusion

In this paper, we proposed a set of axioms for selecting fairness measure in multi-agent sequential decision making systems. We analyzed three different fairness measures (, and ) against the axioms. When the underlying MDP is unknown, we proposed a generic learning algorithm for minimizing regret with respect to different fair optimal policies. There are many interesting directions for future work. First, it would be interesting to extend our framework to consider RL with general function approximation. Our axioms should generalize easily, but the computation of fair optimal policies might be challenging. Second, it would be great to bridge the gap between the lower and upper bounds in regret (table 1). For single-agent RL, value iteration based methods achieve minimax regret bounds. However, fairness measures are non-linear, and we might need to devise different learning algorithms. Finally, it would also be interesting to consider a setting where agents have partial information about the underlying MDP [DCR18].

\printbibliography

Appendix A Computing Optimal Policies

We first recall the linear programming formulation of reinforcement learning. Consider an MDP and our goal is to maximize value function with respect to the reward function .

The primal linear program for this problem is the following.

s.t.

We will use the dual formulation of the above linear program.

(10)

Here the variable should be interpreted as the probability that the policy visits state at time-step and takes action . We will refer to the variables as a state-action occupancy measure. Once we solve the LP 12, we can obtain the optimal policy as

(11)

Occupancy Measure Polytope: Given an initial state distribution and a probability transition function we will write to denote the set of state-action occupancy measures satisfying the Bellman flow conditions with respect to and , i.e.

Utilitarian.

We maximize the sum of value functions across the agents.

We can solve optimization problem (10) with reward function and obtain .

(12)

Once we obtain we can obtain the optimal utilitarian policy through eq. 11.

Max-Min Fair.

We maximize the worst value function across the agents.

The above optimization problem can also be solved through a linear program.

(13)

Nash Social Welfare.

We maximize the product of the value functions of the agents.

The above optimization problem can be solved through a concave optimization problem.

(14)

Generalized Gini Social Welfare

(GGW): This notion of fairness generalizes max-min fairness and has been considered previously in the literature on multi-objective Markov decision processes [SWZ20, ZGSW21]. We are given a vector of weights so that for each , , and . Given a policy , let be an ordering of the agents so that . Then we maximize the following objective

The optimal GGW policy can also be computed through a linear program.

(15)

Even though the linear program above has exponentially many constraints, one can easily solve the LP using ellipsoid method. In order to apply the ellipsoid method we just need to show that there exists an efficient separation oracle. Given a candidate solution the bellman flow constraints i.e. the last three constraints can be checked efficiently. Then for each agent we compute . Let be an ordering of the agents so that . Then we check if or not. If this constraint is violated then we have a violating permutation where . On the other hand, if this constraint is satisfied then all the other constraints are satisfied since for any other permutation we have . This is because the weights are arranged in a non-increasing order.

Appendix B Proof of Theorem 6

The optimal policy according to Nash social welfare can be computed through the following optimization problem.

s.t.

Let be the optimal solution of the above optimization problem. We now consider two cases.

Case 1. : Since is a stationary point and the set is convex, the first order stationarity conditions gives us

(16)

The derivative of the Nash welfare objective with respect to is the following.

Then equation (16) gives us the following result.

This immediately implies that .

Case 2. : This implies that for every policy , for some . Hence, the minimum value of every policy is zero and the MM social welfare is zero, contradicting the assumption of the theorem.

Upper Bound

We now construct an instance for the upper bound on the value functions under . Given any we construct the following instance. Consider the following example with a single-state MDP and two actions and available. The rewards are

Clearly, , which selects action with probability maximizes the minimum value, whereby for all . To find out , consider an arbitrary policy which plays action with probability , and we optimize with respect to the Nash welfare. Consider as a function of , we have

of which the zero is given by

Rearranging the terms we get , where is with respect to . Hence,

Appendix C Properties of Fairness Measures

Proposition 12.

satisfies axioms 3 and 4.

Proof.

Since minimum is an anonymous function, the objective satisfies axiom 3. We now check axiom 4. Given a policy , let be the occupancy measure of policy . Then we can express minimum value function as . We can write down as a function of the occupancy measure of a policy i.e.

Since is a minimum of linear functions is continuous. Therefore, there exists a point on the line segment joining and that achieves the value . In particular, there exists so that

Proposition 13.

There exists a weight vector so that satisfies axioms 1, 3 and 4.

Proof.

Consider any weight vector so that and . Given a policy , the ordering of the agents in terms of their value functions remains unchanged even if we permute the agents. This implies hat the objective is anonymous for any weight vector and it satisfies axiom 3. We now check axiom  4. Given a policy , let be the occupancy measure of policy . Then we can express generalized Gini welfare as a function of the occupancy measure of a policy i.e.

Note that is a weighted sum of order statistics. Since each order statistic is a continuous function of the occupancy measure , is a continuous function. Therefore, there exists a point on the line segment joining and that achieves the value . In particular, there exists so that

Now any policy achieves the generalized Gini welfare under the occupancy measure .

In order to check axiom 1 we assume for all and there exists such that . Without loss of generality we can also assume that the agents are ordered so that . Then we have and . Since we have . ∎

Proposition 14.

satisfies Axioms 1, 2, 3, and 4.

Proof.

Since is a monotonically increasing function of , axiom 1 is immediately satisfied. Moreover, the function is a product of the values across the agents and hence anonymous. In order to check axiom 2 suppose there exist policies and reward vectors so that for all we have

Now suppose . This implies . This gives us the following inequality.

Therefore, we have . The other direction of the implication can be proved analogously.

Now we check the final axiom 4. Given a policy , let be the occupancy measure of policy . Then we can express the Nash social welfare functional as . Therefore, we can write down log-Nash social welfare as a function of the occupancy measure of a policy i.e.

is a continuous function of the occupancy measure. We are also given that . Therefore, there exists a point on the line segment joining and that achieves the value . In particular, there exists so that

Now any policy with state-action occupancy measure achieves the intermediate value . ∎

Appendix D Proof of Theorem 5

We prove that Axioms 1, 2, 3, and 4 uniquely characterize the Nash Social Welfare function. Throughout we will fix the transition probability function of the MDP and let different agents pick different reward functions. We assume that there are at least four actions available at each state.

Defining Partial Order : We first define a partial order over the elements in .

(17)

for arbitrary , where (resp. ) denotes a policy such that (resp. ), and is such that

(18)

for all , and all the other entries of are arbitrary. Note that because of Axiom 2 this order is well-defined despite the arbitrary choice of and and reward vector as we have and .

We now verify that the relation satisfies reflexivity, symmetry, and transitivity. Reflexivity and symmetry immediately follows from the definition of the reward function in (18). For transitivity, suppose we are given and . We construct a reward function so that

Then we have and . This implies and . Therefore, any two elements of are comparable under the relation .

Lemma 1.

Suppose with for all and for some . Then .

Proof.

Let be such that and for all . Note that we have for all and for some individual . Therefore, by Axiom 1 we must have . By definition (17), this implies that . ∎

Defining Function .

Let be a vector with all component being . For any , there exists such that (where the comparison is component-wise). We choose a reward function such that

and is arbitrary. Then we have . By Axiom 4 there exists such that

(19)

where and . Under policy the value function of agent is . And the value function under the policy is

Therefore, according to the definition of the order we have and we define

We first verify that is a well-defined function despite the arbitrary choice of , and . Indeed, suppose that there exists and such that and satisfies (19) with respect to and . We have . Then by Lemma 1 it must be that . Hence, the value of is independent of the choice of , and .

We now establish an important property of the function . The proof of this claim is similar to the proof in [KN79] and uses the following lemma by \citeauthor*osborne1976irrelevant [osborne1976irrelevant] (also see Lemma 3.5 stated by \citeauthor*KN79 [KN79]). We present our version of the proof for completeness.

Lemma 2 (\citeauthor*osborne1976irrelevant [osborne1976irrelevant]).

Suppose is reflexive, symmetric, and transitive, and satisfies the following properties for any :

  • if , then ; and

  • if and only if for all .

Then there exist non-negative real constants and a monotone increasing function , such that .

Lemma 3.

There exists a monotone increasing function such that .

Proof.

We show that the function satisfies two requirements of Lemma 2. This will imply that there exist non-negative real constants and a monotone increasing function , such that . By Axiom 3, the constants must be identical, so the statement of the lemma follows.

Requirement (i).

First, given , we need to show that . Since there exists so that and . We choose a reward function such that

and otherwise. There exists such that

There also exists such that

where and other entries of are arbitrary. We have

By Axiom 2, this means that if and only if . Hence, we also get that

(20)

Since from definition (17) we have . This implies that

(21)

But the vector of value functions under policy is and similarly the vector of value functions under policy is . Therefore, it must be that

as otherwise we would have , which contradicts (21) given Axiom 1. It follows that .

Requirement (ii).

Next, we show that if and only if for all and positive real numbers for . First, suppose . Define and the parameters as we defined above. Since , we have

Hence, by Axiom 1.

We now construct a new reward vector . Let and . Then we define the following reward vector

We now apply axiom 2 with respect to policy and reward vectors and . Note that

Since the function satisfies Axiom 2, we have . From Definition (19) we know that there exists such that

and, similarly to (20),

By Axiom 1, this implies that

as the two sides are the value vectors of and . It follows that

This completes the proof. ∎

From Function to .

Suppose there are two policies and such that . Let us denote and . Consider two vectors such that

In order to determine whether or we define the following reward functions following (18):

Then we have and . By Axiom 2 it must be that ; hence, we have by definition. This implies that as satisfies the first requirement of Lemma 2. By Lemma 3 this is equivalent to:

It can be also shown that . Hence, we get that .

Appendix E Optimistic Planning

We aim to solve the following optimization problem.

(22)

where is the set of plausible transition functions at time i.e.

for . We will show that the objective is concave in for various choices of , so the problem is a convex optimization. Since the feasibility over can be determined efficiently, one can use any standard optimization method (e.g. the ellipsoid method) to solve the optimistic planning problem.

In order to show concavity of , let us consider two probability transition functions and . For the probability transition function , let be the policy maximizing the objective and let be the corresponding state-action occupancy measure. Now consider the probability transition function . Note that, satisfies the Bellman-flow constraints with respect to the probability transition function i.e. . With these definitions we are now ready to show that .

Nash Social Welfare: In this case, the optimization problem eq. 22 is equivalent to the following optimization problem.

We now show that the objective is concave in . Suppose, policy is the policy with state-action occupancy measure we have

Therefore, is a concave function of the probability transition function .

Minimum Welfare: In this case, optimization problem eq. 22 is equivalent to the following optimization problem.

We again show that the objective is concave in .

Generalized Gini Welfare: In this case, we are given a weight vector so that for each , and . Let be an ordering of the agents so that

Here is the policy with state-action occupancy measure . Our objective is

We now show that the objective is concave in .

The last line follows from the following observation. Suppose be an ordering of the agents so that

Since the weight vector is non-increasing in , the ordering achieves the smallest possible value of the weighted sum of the value functions i.e.

The same argument holds for policy . Since policy (resp. ) maximizes generalized Gini welfare with respect to the probability transition function (resp. ) we have the following inequality.

Appendix F Proof of Theorem 7

Proof.

Let be the true probability transition function. Let us define . By the Chernoff-Hoeffding type concentration inequality for categorical random variables, the following bound holds for any state , action , and episode .

Therefore, by a union bound over the episodes and all state-action pairs we see that with high probability the true probability transition function is contained in the set for all i.e.

So we assume that this event holds. At the start of episode , we compute the best policy for the optimistic model . This implies the following bound on the Nash welfare with respect to the true model .

This allows us to upper bound the regret through the optimistic policy.

The second inequality uses lemma 4. We now use lemma 5 to establish a bound of on the term for any . This proves the desired upper bound on the regret.

Consider any agent . We can apply lemma 5 with . Notice that as we have the following inequality.

Lemma 4.

For any policy , and probability transition functions and we have

Proof.

We use induction on the number of agents . When this is trivially true. So we assume the claim holds for .

[By the induction hypothesis and the fact that for any and ]
Lemma 5.

Let for all tuples . Then for any policy we have

Proof.

We will use to write the state-action occupancy measure under policy and probability transition function i.e. . Similarly, we will use to denote the state-action occupancy measure under policy and probability transition function . We will also write .

Since rewards are bounded between and we have the following inequality.

(23)
(24)

We now establish a recurrence relation for the term . First note that, since

Here we use the equality constraint .

The above recurrence relation and gives us . Substituting this result in equation 24 we get the following bound.

Appendix G Unknown Reward Functions

When the reward function is unknown we update the optimistic planning step as follows

Here is a confidence set around the empirical reward function and it can be constructed by using Chernoff-Hoeffding inequality and the union bound. Here we consider the case . The proof for , and are similar. As in the proof of theorem 7 we can upper bound regret as

The second term exactly equals the term introduced in theorem 7 and is bounded by . The first term equals the difference in value functions between optimistic reward function and true reward function but with respect to the fixed probability transition function . Let and be the state-action occupancy measure under policy and probability transition function . Then we have,

Now, by an analysis very similar to bounding the error term , one can show that the term is bounded by . This implies that the term dominates the term and the regret is still bounded by .

Appendix H Lower Bound on

Figure 2: Lower Bound Instance (following  [LS20])

We consider a collection of MDPs, where the state-space of each of the MDPs consists of a good state () and a bad state (). The remaining states are arranged in a -ary tree. The transitions within the -ary tree is deterministic. Let be the set of leaves in the -ary tree. Then for the -th leaf node, and -th action we consider the following transition probabilities.

In the -th MDP, all the other actions from the leaf nodes have uniform probability of transitioning to either the good or the bad state. From the good or the bad state, taking any action maintains the current state with probability , and transitions to with probability . The rewards are for all the agents at state . All the other rewards are zero. We choose

h.1 Proof of theorem 10

Proof.

We will write to denote the MDP where for all and . Let us define the following stopping time .

(25)

i.e. is the minimum episode number when the number of visits to state is at least . If this episode number is larger than then we just set as .

Let be the total number of times the policy visits state and takes action until the stopping time . We first show that where denotes expectation with respect to the MDP . Note that a visit to one of the leaf nodes is followed by either a visit to the good node or a visit to the bad node . Since the number of visits to is we just need to bound the number of visits to the state .

Let the total number of visits to the bad state starting from the node . We write where is the number of visits to state which were followed by episode reset before visiting the starting state . On the other hand, is the number of visits to state that were not stopped by episode reset before visiting state . Since there are episodes, we have .

In order to bound we consider the total amount of time the policy stays at state before visiting . This is a geometric random variable with parameter . Therefore, if we write to denote the total amount of time the policy stays at because of visitations in the set , is a sum of i.i.d. geometric random variables, and by standard concentration inequality [DP09, brown2011] we get

Therefore, as long as the total time spent at state is at least with probability at least . Since the total number of time steps is exactly , it must be that in this case. Therefore, either is less than or it is less than with probability at least . Combining these two cases we get the following upper bound on expected value of

This bound holds for any choice of . In particular, for and we get that .

Therefore, we have . As is bounded by the total number of visits to node and node we have the following bound.

By a very similar argument we can also establish a similar bound for MDP .

Lower Bound on Regret for Model : In this model the optimal policy is to navigate to the -th leaf node and then take action . Let be a geometric random variable with parameter . Then the value function of an agent is at least

We now lower bound the term .

as long as . Therefore, the total expected Nash welfare over the episodes is at least

Let be the indicator variable that denotes whether the policy visits leaf node , and takes action at episode . If then by a very similar argument as above, we can show that expected Nash welfare at episode is at most . On the other hand, if then expected Nash welfare at episode is . Therefore, the sum of expected Nash welfare over the episodes is

Let . Then regret on model is at least

Now we can proceed very similarly to the proof of theorem 38.7 from [LS20] and establish that there exists some so that for . This choice of implies the following lower bound.

Appendix I Proof of Theorem 11

Proof.

We first assume that equals the maximin value . We will assume . Notice that UOB-REPS is run with reward function

at time . Since each entry of the reward function is bounded by we have . Therefore, from the regret guarantee of UOB-REPS we have

(26)

where . On the other hand, from the regret guarantee of OSMD we have,

(27)

where is defined as . Let . Since , an application of Chernoff’s implies the following inequality holds with probability at least .

Let be the unit vector with exactly at index , and otherwise. Then from eq. 27 we get the following bound.

We can upper bound as

(28)

This equality gives us the following bound.

After rearranging, and dividing throughout by we get the following bound on regret.

Now consider the case when . Pick any and run algorithm 2. By a similar argument as above, we can establish the following bound.

Using the upper bound established in (28) we get the following bound.

After rearranging and dividing throughout by we get the following inequality.

This gives us the following bound on regret.

The last inequality follows because and . ∎