Counterfactual Multi-Agent

Paper link: COMA

Quick facts:

COMA is an on-policy actor-critic algorithm.
COMA uses a centralized critic with a decentralized actors.

Background

A straightforward adaptation of single-agent actor-critic algorithms to multi-agent RL would be to have, for each agent, an independent actor \(\pi_i(a_i| o_i; \theta)\) and an independent critic \(Q_i(o_i,a_i;\phi)\), and use the following losses:

\[\mathcal{L}^{actor}_i(\theta) = - A(o_i,a_i;\phi) log(\pi_i(a_i| o_i; \theta))\]

\[\mathcal{L}^{critic}_i(\phi) = (y - Q_i(o_i,a_i;\phi))^2\]

The two main ideas of COMA are:

To use only one centralized critic \(Q(\mathbf{s}, \mathbf{o},\mathbf{a})\)
Replace the standard advantage \(A(o_i,a_i)\) with an individual counterfactual advantage that compares the action-value of an agent’s action \(a_i\) to the expected action-value if the agent had selected other actions, while other agents’ actions remain fixed.

CConcretely, the counterfactual advantage for each agent \(i \in \mathcal{I}\) is computed as:

\[A_i(\mathbf{s}, \mathbf{o},\mathbf{a}) = Q(\mathbf{s}, \mathbf{o},\mathbf{a}) - \sum_{a'_i} \pi_i(a'_i|o_i) Q(\mathbf{s}, \mathbf{o},(\mathbf{a}_{-i},a'_i))\]

Implementation-wise, the centralized Q-network takes as input the global state \(\mathcal{s}\), the agent’s observation \(o_i\), and the other \(n-1\) agents’ actions \(\mathbf{a}_{-i}\). The output of the network has the same size as the action space, giving a value for each possible action \(a_i\). This architecture allows easy computation of the counterfactual advantage.

Pseudocode

Implementations

We implemented five variants of COMA:

coma.py: COMA with a single environment and MLP neural networks.
coma_multienvs.py: COMA with parallel environments and MLP neural networks.
coma_lstm.py: COMA with single environment and recurrent neural networks.
coma_lstm_multienvs.py: COMA with parallel environments and recurrent neural networks.
coma_lbf.py: COMA with a single environment and MLP neural networks with additional implementation tricks. This script was added to see if COMA can learn Level-Based Foraging environment. We add two things: (1) we use individual rewards and, (2) we correct TD target when the environment is truncated (i.e. time-out) rather than completed.

Additional details:

Rollout buffer: we store episodes rather than transitions {"obs": [],"actions":[],"reward":[],"states":[],"done":[],"avail_actions":[]}. Storing avail_actions is importing to compute the correct critic and actor losses
Parallel environment: we run batch_size environments in parallel
Parallel environments with RNNs: When using multiple environments in parallel, some episodes may complete before others. We track alive environments at each timestep. This is critical for RNN policies, as the hidden state is initially sized (num_envs x num_agents, hidden_dim) but only updated for (num_alive_envs x num_agents, hidden_dim) when some episodes finish.
RNN training : We use truncated backpropagation through time (TBPTT) to train the RNN network. You can set the length of the sequence using tbptt.
TD(λ) return: we use the recursive formula from Reconciling λ-Returns with Experience Replay (Equation 3) . We start by \(R^{\lambda}_T = 0\)

\[\begin{split}\begin{align} R^{\lambda}_t &= R^{(1)}_t + \gamma \lambda \Big[ R^{\lambda}_{t+1} - \max_{a' \in \mathcal{A}} Q(\hat{s}_{t+1}, a') \Big] \\ &= r_t + \gamma \Big[ \lambda R^{\lambda}_{t+1} + (1-\lambda) \max_{a' \in \mathcal{A}} Q(\hat{s}_{t+1}, a') \Big] \end{align}\end{split}\]

Exploration: We use the exploration strategy suggested in COMA paper. \(ε\) is linearly annealed across a number of training steps.

\[\pi(a_i) = (1 - \varepsilon) \, \text{softmax}(z_i) + \frac{\varepsilon}{|\mathcal{A}_i|}.\]

Individual rewards: As COMA support individual rewards, we implement a COMA variants that support this configuration for LBF environments. You can allow individual rewards by setting reward_aggr=None for LBF environments. By default, LBF returns individual rewards

obs, reward, terminated, truncated, info = self.env.step(actions)
...
if self.reward_aggr == "sum":
    reward = np.sum(reward)
elif self.reward_aggr == "mean":
    reward = np.mean(reward)
## When reward_aggr == None, we keep it as a list of per-agent rewards.

Logging

We record the following metrics:

rollout/ep_reward : Mean episode reward during environment rollout.
rollout/ep_length : Mean episode length during rollout.
rollout/epsilon : Current exploration epsilon.
rollout/num_episodes : Total number of completed episodes until the current step.
rollout/battle_won (SMAClite only): Fraction of battle won by SMAC agents.
train/critic_loss : The critic loss at the current optimization step.
train/actor_loss : The actor loss at the current optimization step.
train/entropy : The average entropy per-agent at the current optimization step
train/actor_gradients : Magnitude of gradients of actor network.
train/critic_gradients : Magnitude of gradients of critic network.
train/num_updates : Total number of network updates until the current step.
eval/ep_reward : Mean episode reward during evaluation.
eval/std_ep_reward : Standard deviation of episode rewards during evaluation.
eval/ep_length : Mean episode length during evaluation.
eval/battle_won ( SMAClite only): Fraction of battles won during evaluation episodes.

Documentation

class cleanmarl.coma.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, batch_size=3, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=128, critic_num_layers=1, optimizer='Adam', learning_rate_actor=0.0005, learning_rate_critic=0.0005, total_timesteps=1000000, gamma=0.99, td_lambda=0.8, normalize_reward=False, normalize_advantage=True, normalize_return=False, target_network_update_freq=1, polyak=0.005, entropy_coef=0.001, use_tdlamda=True, nsteps=1, start_e=0.5, end_e=0.002, exploration_fraction=750, clip_gradients=-1, log_every=10, eval_steps=50, num_eval_ep=5, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)

Parameters:

env_type (str) – Type of the environment: smaclite, pz for PettingZoo, lbf for Level-based Foraging.
env_name (str) – Name of the environment (3m, simple_spread_v3, Foraging-2s-10x10-4p-2f-v3 …).
env_family (str) – Env family when using a PettingZoo environment (sisl, mpe …).
agent_ids (bool) – Include agent IDs (one-hot vector) in observations.
batch_size (int) – Number of episodes to collect in each rollout.
actor_hidden_dim (int) – Hidden dimension of the actor network.
actor_num_layers (int) – Number of hidden layers of the actor network.
critic_hidden_dim (int) – Hidden dimension of the critic network.
critic_num_layers (int) – Number of hidden layers of the critic network.
optimizer (str) – The optimizer.
learning_rate_actor (float) – Learning rate for the actor.
learning_rate_critic (float) – Learning rate for the critic.
total_timesteps (int) – Total steps in the environment during training.
gamma (float) – Discount factor.
use_tdlamda (bool) – Whether to use TD(λ) as a target for the critic. If False, n-step returns (n=nsteps) are used instead.
td_lambda (float) – TD(λ) discount factor.
nsteps (int) – Number of steps when using n-step returns as a target for the critic.
normalize_reward (bool) – Normalize the rewards if True.
normalize_advantage (bool) – Normalize the advantage if True.
normalize_return (bool) – Normalize the returns if True.
target_network_update_freq (int) – Update the target network each target_network_update_freq training step.
polyak (float) – Polyak coefficient when using polyak averaging for target network update.
entropy_coef (float) – Entropy coefficient used to encourage exploration.
start_e (float) – The starting value of epsilon. See Architecture & Training in COMA’s paper (Sec. 5).
end_e (float) – The end value of epsilon. See Architecture & Training in COMA’s paper (Sec. 5).
exploration_fraction (int) – Number of training steps required to linearly decay epsilon from start_e to end_e.
clip_gradients (float) – 0< for no gradient clipping and 0> if clipping gradients at clip_gradients.
log_every (int) – Log rollout stats every log_every episode.
eval_steps (int) – Evaluate the policy every eval_steps training steps.
num_eval_ep (int) – Number of evaluation episodes.
use_wnb (bool) – Logging to Weights & Biases if True.
wnb_project (str) – Weights & Biases project name.
wnb_entity (str) – Weights & Biases entity name.
device (str) – Device (cpu, gpu, mps). We only support CPU training for now.
seed (int) – Random seed.

class cleanmarl.coma_multienvs.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, batch_size=3, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=128, critic_num_layers=1, optimizer='Adam', learning_rate_actor=0.0005, learning_rate_critic=0.0005, total_timesteps=1000000, gamma=0.99, td_lambda=0.8, normalize_reward=False, normalize_advantage=True, normalize_return=False, target_network_update_freq=1, polyak=0.005, entropy_coef=0.001, use_tdlamda=True, nsteps=1, start_e=0.5, end_e=0.002, exploration_fraction=750, clip_gradients=-1, log_every=10, eval_steps=10, num_eval_ep=10, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)

class cleanmarl.coma_lstm.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, batch_size=5, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=128, critic_num_layers=1, optimizer='Adam', learning_rate_actor=0.0005, learning_rate_critic=0.0005, total_timesteps=1000000, gamma=0.99, td_lambda=0.8, normalize_reward=False, normalize_advantage=True, normalize_return=False, target_network_update_freq=1, polyak=0.005, eval_steps=10, use_tdlamda=True, nsteps=1, start_e=0.5, end_e=0.002, exploration_fraction=750, clip_gradients=-1, tbptt=10, log_every=10, num_eval_ep=10, entropy_coef=0.001, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)

Parameters:: tbptt (int) – Chunk size for Truncated Backpropagation Through Time (TBPTT).

class cleanmarl.coma_lstm_multienvs.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, batch_size=3, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=128, critic_num_layers=1, optimizer='Adam', learning_rate_actor=0.0005, learning_rate_critic=0.0005, total_timesteps=1000000, gamma=0.99, td_lambda=0.8, normalize_reward=False, normalize_advantage=True, normalize_return=False, target_network_update_freq=1, polyak=0.005, entropy_coef=0.001, use_tdlamda=True, nsteps=1, start_e=0.5, end_e=0.002, exploration_fraction=750, clip_gradients=-1, tbptt=10, log_every=10, eval_steps=50, num_eval_ep=10, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)

Parameters:: tbptt (int) – Chunk size for Truncated Backpropagation Through Time (TBPTT).