Multi-Agent Proximal Policy Optimization

Quick facts:
  • MAPPO trains PPO with a centralized critic and decentralized actors.

Background

The unique difference between MAPPO and IPPO is that MAPPO uses a centralizes critic \(V(s;\phi)\), instead of a decentralized critic \(V_i(o_i;\phi)\)

Architecture diagram

Pseudocode

Architecture diagram

Implementations

We implemented four variants of MAPPO:

  • mappo.py: MAPPO with a single environment and MLP neural networks.

  • mappo_multienvs.py: MAPPO with parallel environments and MLP neural networks.

  • mappo_lstm.py: MAPPO with single environment and recurrent neural networks.

  • mappo_lstm_multienvs.py: MAPPO with parallel environments and recurrent neural networks.

Additional details:

  • Rollout buffer: we store episodes {"obs": [],"actions":[],"reward":[],"states":[],"done":[],"avail_actions":[]}. Storing avail_actions is importing to compute the correct critic and actor losses.

  • Parallel environment: we run batch_size environments in parallel

  • Parallel environment with RNN networks: When running multiple environments in parallel, some episodes may complete before others, therefor, we keep track of alive anvironments at each time step. This is especially important when using RNN policies as the size of the hidden state is fixed at the beginning of the rollout at (num_envs x num_agents, hidden_dim) , but we should only keep upadating (num_alive_envs x num_agents, hidden_dim) , when some episodes finish.

  • TD(λ) return: we use the recursive formula from Reconciling λ-Returns with Experience Replay (Equation 3) . We start by \(R^{\lambda}_T = 0\)

\[\begin{split}\begin{align} R^{\lambda}_t &= R^{(1)}_t + \gamma \lambda \Big[ R^{\lambda}_{t+1} - \max_{a' \in \mathcal{A}} Q(\hat{s}_{t+1}, a') \Big] \\ &= r_t + \gamma \Big[ \lambda R^{\lambda}_{t+1} + (1-\lambda) \max_{a' \in \mathcal{A}} Q(\hat{s}_{t+1}, a') \Big] \end{align}\end{split}\]
  • Advantages: We don’t directly compute the advantages using GAE estimates, we instead use the TD(λ) return by exploiting the following formula that can be found in page 47 in David Silver’s lecture n 4

\[A(s_t,a_t) = R^{\lambda}_t -V(s_t)\]
  • RNN training : We use truncated backpropagation through time (TBPTT) to train the RNN network. You can set the length of the sequence using tbptt.

Logging

We record the following metrics:

  • rollout/ep_reward : Mean episode reward during environment rollout.

  • rollout/ep_length : Mean episode length during rollout.

  • rollout/num_episodes : Total number of completed episodes until the current step.

  • rollout/battle_won (SMAClite only): Fraction of battle won by SMAC agents.

  • train/critic_loss : The critic loss at the current optimization step.

  • train/actor_loss : The actor loss at the current optimization step.

  • train/entropy : The average entropy per-agent at the current optimization step.

  • train/kl_divergence : The average kl-divergence per-agent at the current optimization step.

  • train/clipped_ratios : The ratio of clipped policies at the current optimization step.

  • train/actor_gradients : Magnitude of gradients of actor network.

  • train/critic_gradients : Magnitude of gradients of critic network.

  • train/num_updates : Total number of network updates until the current step.

  • eval/ep_reward : Mean episode reward during evaluation.

  • eval/std_ep_reward : Standard deviation of episode rewards during evaluation.

  • eval/ep_length : Mean episode length during evaluation.

  • eval/battle_won ( SMAClite only): Fraction of battles won during evaluation episodes.

Documentation

class cleanmarl.mappo.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, batch_size=3, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=64, critic_num_layers=1, optimizer='Adam', learning_rate_actor=0.0008, learning_rate_critic=0.0008, total_timesteps=1000000, gamma=0.99, td_lambda=0.95, normalize_reward=False, normalize_advantage=False, normalize_return=False, epochs=3, ppo_clip=0.2, entropy_coef=0.001, log_every=10, clip_gradients=-1, eval_steps=10, num_eval_ep=10, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)
Parameters:
  • env_type (str) – Type of the environment: smaclite, pz for PettingZoo, lbf for Level-based Foraging.

  • env_name (str) – Name of the environment (3m, simple_spread_v3 Foraging-2s-10x10-4p-2f-v3 …)

  • env_family (str) – Env family when using a PettingZoo environment (sisl, mpe …)

  • agent_ids (bool) – Include agent IDs (one-hot vectors) in the observations.

  • batch_size (int) – Number of episodes to collect in each rollout.

  • actor_hidden_dim (int) – Hidden dimension of the actor network.

  • actor_num_layers (int) – Number of hidden layers in the actor network.

  • critic_hidden_dim (int) – Hidden dimension of the critic network.

  • critic_num_layers (int) – Number of hidden layers in the critic network.

  • optimizer (str) – Optimizer to use.

  • learning_rate_actor (float) – Learning rate for the actor network.

  • learning_rate_critic (float) – Learning rate for the critic network.

  • total_timesteps (int) – Total number of environment steps during training.

  • gamma (float) – Discount factor.

  • td_lambda (float) – TD(λ) discount factor.

  • normalize_reward (bool) – Whether to normalize rewards.

  • normalize_advantage (bool) – Whether to normalize advantages.

  • normalize_return (bool) – Whether to normalize returns.

  • epochs (int) – Number of training epochs per update.

  • ppo_clip (float) – PPO clipping factor for policy updates.

  • entropy_coef (float) – Entropy coefficient to encourage exploration.

  • clip_gradients (float) – Gradient clipping value (negative to disable clipping).

  • log_every (int) – Log rollout statistics every log_every episode

  • eval_steps (int) – Evaluate the policy each eval_steps training steps

  • num_eval_ep (int) – Number of episodes used during evaluation.

  • use_wnb (bool) – Whether to log to Weights & Biases.

  • wnb_project (str) – Weights & Biases project name.

  • wnb_entity (str) – Weights & Biases entity name.

  • device (str) – Device to run training on (cpu, gpu, or mps).

  • seed (int) – Random seed for reproducibility.

class cleanmarl.mappo_multienvs.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, batch_size=3, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=64, critic_num_layers=1, optimizer='Adam', learning_rate_actor=0.0008, learning_rate_critic=0.0008, total_timesteps=1000000, gamma=0.99, td_lambda=0.95, normalize_reward=False, normalize_advantage=False, normalize_return=False, epochs=3, ppo_clip=0.2, entropy_coef=0.001, clip_gradients=-1, log_every=10, eval_steps=50, num_eval_ep=10, use_wnb=False, wnb_project='', wnb_entity='', seed=1, device='cpu')
class cleanmarl.mappo_lstm.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, batch_size=3, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=64, critic_num_layers=1, optimizer='Adam', learning_rate_actor=0.0008, learning_rate_critic=0.0008, total_timesteps=1000000, gamma=0.99, td_lambda=0.95, normalize_reward=False, normalize_advantage=False, normalize_return=False, epochs=3, ppo_clip=0.2, entropy_coef=0.001, tbptt=10, clip_gradients=-1, log_every=10, eval_steps=50, num_eval_ep=5, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)
Parameters:

tbptt (int) – Chunk size for Truncated Backpropagation Through Time (TBPTT).

class cleanmarl.mappo_lstm_multienvs.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, batch_size=3, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=64, critic_num_layers=1, optimizer='Adam', learning_rate_actor=0.0008, learning_rate_critic=0.0008, total_timesteps=1000000, gamma=0.99, td_lambda=0.95, normalize_reward=False, normalize_advantage=False, normalize_return=False, epochs=3, ppo_clip=0.2, entropy_coef=0.001, clip_gradients=-1, tbptt=10, log_every=10, eval_steps=50, num_eval_ep=5, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)