Independent Proximal Policy Optimization

Quick facts:
  • IPPO is each agent implementing PPO using its local observations and actions.

Background

Independent PPO is a straightforward extension of PPO to multi-agent RL. In IPPO, each agent has its own actor and critic that condition on local observations \(o_i\). Agents share the weights.

For each agent \(i\) we use the following loss to train its actor:

\[L_i(\theta) = \min\!\left( \frac{\pi(a_i^t \mid o_i^t;\theta)}{\pi(a_i^t \mid o_i^t;\theta_{\text{old}})}\, A_i^t,\; \operatorname{clip}\!\left(\frac{\pi(a_i^t \mid o_i^t;\theta)}{\pi(a_i^t \mid o_i^t;\theta_{\text{old}})},\,1-\varepsilon,\,1+\varepsilon\right) A_i^t \right)\]

with \(A_i^t\) estimated using \(V_i(;\phi)\)

And for the critics, we use:

\[y_i - V_i(o_i;\phi)\]

where \(y_i\) can be any TD target.

Architecture diagram

Pseudocode

Architecture diagram

Implementations

We implemented four variants of IPPO:

  • ippo.py: IPPO with a single environment and MLP neural networks.

  • ippo_multienvs.py: IPPO with parallel environments and MLP neural networks.

  • ippo_lstm.py: IPPO with single environment and recurrent neural networks.

  • ippo_lstm_multienvs.py: IPPO with parallel environments and recurrent neural networks.

Additional details:

  • Rollout buffer: we store episodes {"obs": [],"actions":[],"reward":[],"states":[],"done":[],"avail_actions":[]}. Storing avail_actions is importing to compute the correct critic and actor losses.

  • Parallel environment: we run batch_size environments in parallel

  • Parallel environments with RNNs: When using multiple environments in parallel, some episodes may complete before others. We track alive environments at each timestep. This is critical for RNN policies, as the hidden state is initially sized (num_envs x num_agents, hidden_dim) but only updated for (num_alive_envs x num_agents, hidden_dim) when some episodes finish.

  • TD(λ) return: we use the recursive formula from Reconciling λ-Returns with Experience Replay (Equation 3) . We start by \(R^{\lambda}_T = 0\)

\[\begin{split}\begin{align} R^{\lambda}_t &= R^{(1)}_t + \gamma \lambda \Big[ R^{\lambda}_{t+1} - \max_{a' \in \mathcal{A}} Q(\hat{s}_{t+1}, a') \Big] \\ &= r_t + \gamma \Big[ \lambda R^{\lambda}_{t+1} + (1-\lambda) \max_{a' \in \mathcal{A}} Q(\hat{s}_{t+1}, a') \Big] \end{align}\end{split}\]
  • Advantages: We don’t directly estimate the advantages using GAE estimates, we instead use the TD(λ) return by exploiting the following formula that can be found in page 47 in David Silver’s lecture n 4

\[A(s_t,a_t) = R^{\lambda}_t -V(s_t)\]
  • RNN training : We use Truncated Back-Propagation Through Time (TBPTT) to train the RNN network. You can set the length of the sequence using tbptt.

Logging

We record the following metrics:

  • rollout/ep_reward : Mean episode reward during environment rollout.

  • rollout/ep_length : Mean episode length during rollout.

  • rollout/num_episodes : Total number of completed episodes until the current step.

  • rollout/battle_won (SMAClite only): Fraction of battle won by SMAC agents.

  • train/critic_loss : The critic loss at the current optimization step.

  • train/actor_loss : The actor loss at the current optimization step.

  • train/entropy : The average entropy per-agent at the current optimization step.

  • train/kl_divergence : The average kl-divergence per-agent at the current optimization step.

  • train/clipped_ratios : The ratio of clipped policies at the current optimization step.

  • train/actor_gradients : Magnitude of gradients of actor network.

  • train/critic_gradients : Magnitude of gradients of critic network.

  • train/num_updates : Total number of network updates until the current step.

  • eval/ep_reward : Mean episode reward during evaluation.

  • eval/std_ep_reward : Standard deviation of episode rewards during evaluation.

  • eval/ep_length : Mean episode length during evaluation.

  • eval/battle_won ( SMAClite only): Fraction of battles won during evaluation episodes.

Documentation

class cleanmarl.ippo.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, batch_size=3, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=32, critic_num_layers=1, optimizer='Adam', learning_rate_actor=0.0008, learning_rate_critic=0.0008, total_timesteps=1000000, gamma=0.99, td_lambda=0.95, normalize_reward=False, normalize_advantage=False, normalize_return=False, ppo_clip=0.2, entropy_coef=0.001, epochs=3, clip_gradients=-1, log_every=10, eval_steps=50, num_eval_ep=10, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)
Parameters:
  • env_type (str) – Type of the environment: smaclite, pz for PettingZoo, lbf for Level-based Foraging.

  • env_name (str) – Name of the environment (3m, simple_spread_v3 Foraging-2s-10x10-4p-2f-v3 …)

  • env_family (str) – Env family when using a PettingZoo environment (sisl, mpe …)

  • agent_ids (bool) – Include agent IDs (one-hot vector) in observations

  • batch_size (int) – Number of episodes to collect in each rollout

  • actor_hidden_dim (int) – Hidden dimension of actor network

  • actor_num_layers (int) – Number of hidden layers of actor network

  • critic_hidden_dim (int) – Hidden dimension of critic network

  • critic_num_layers (int) – Number of hidden layers of critic network

  • optimizer (str) – The optimizer

  • learning_rate_actor (float) – Learning rate for the actor

  • learning_rate_critic (float) – Learning rate for the critic

  • total_timesteps (int) – Total steps in the environment during training

  • gamma (float) – Discount factor

  • td_lambda (float) – TD(λ) discount factor

  • normalize_reward (bool) – Normalize the rewards if True

  • normalize_advantage (bool) – Normalize the advantage if True

  • normalize_return (bool) – Normalize the returns if True

  • ppo_clip (float) – PPO clipping factor

  • entropy_coef (float) – Entropy coefficient

  • epochs (int) – Number of training epochs

  • clip_gradients (float) – 0 < for no clipping and 0 > if clipping at clip_gradients

  • log_every (int) – Log rollout statistics every log_every episode

  • eval_steps (int) – Evaluate the policy each eval_steps training steps

  • num_eval_ep (int) – Number of evaluation episodes

  • use_wnb (bool) – Logging to Weights & Biases if True

  • wnb_project (str) – Weights & Biases project name

  • wnb_entity (str) – Weights & Biases entity name

  • device (str) – Device (cpu, gpu, mps)

  • seed (int) – Random seed

class cleanmarl.ippo_multienvs.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, batch_size=3, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=32, critic_num_layers=1, optimizer='Adam', learning_rate_actor=0.0008, learning_rate_critic=0.0008, total_timesteps=1000000, gamma=0.99, td_lambda=0.95, normalize_reward=False, normalize_advantage=False, normalize_return=False, ppo_clip=0.2, entropy_coef=0.001, epochs=3, clip_gradients=-1, log_every=10, eval_steps=50, num_eval_ep=10, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)
class cleanmarl.ippo_lstm.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, batch_size=3, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=32, critic_num_layers=1, optimizer='Adam', learning_rate_actor=0.0008, learning_rate_critic=0.0008, total_timesteps=1000000, gamma=0.99, td_lambda=0.95, normalize_reward=False, normalize_advantage=False, normalize_return=False, log_every=10, ppo_clip=0.2, entropy_coef=0.001, epochs=3, clip_gradients=-1, tbptt=5, eval_steps=50, num_eval_ep=10, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)
Parameters:

tbptt (int) – Chunk size for Truncated Backpropagation Through Time (TBPTT).

class cleanmarl.ippo_lstm_multienvs.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, batch_size=3, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=32, critic_num_layers=1, optimizer='AdamW', learning_rate_actor=0.0008, learning_rate_critic=0.0008, total_timesteps=1000000, gamma=0.99, td_lambda=0.95, normalize_reward=False, normalize_advantage=False, normalize_return=False, log_every=10, ppo_clip=0.2, entropy_coef=0.001, epochs=3, clip_gradients=-1, tbptt=5, eval_steps=50, num_eval_ep=10, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)