Independent Proximal Policy Optimization

Paper link: IPPO

Quick facts:

IPPO is each agent implementing PPO using its local observations and actions.

Background

Independent PPO is a straightforward extension of PPO to multi-agent RL. In IPPO, each agent has its own actor and critic that condition on local observations \(o_i\). Agents share the weights.

For each agent \(i\) we use the following loss to train its actor:

\[L_i(\theta) = \min\!\left( \frac{\pi(a_i^t \mid o_i^t;\theta)}{\pi(a_i^t \mid o_i^t;\theta_{\text{old}})}\, A_i^t,\; \operatorname{clip}\!\left(\frac{\pi(a_i^t \mid o_i^t;\theta)}{\pi(a_i^t \mid o_i^t;\theta_{\text{old}})},\,1-\varepsilon,\,1+\varepsilon\right) A_i^t \right)\]

with \(A_i^t\) estimated using \(V_i(;\phi)\)

And for the critics, we use:

\[y_i - V_i(o_i;\phi)\]

where \(y_i\) can be any TD target.

Pseudocode

Implementations

We implemented four variants of IPPO:

ippo.py: IPPO with a single environment and MLP neural networks.
ippo_multienvs.py: IPPO with parallel environments and MLP neural networks.
ippo_lstm.py: IPPO with single environment and recurrent neural networks.
ippo_lstm_multienvs.py: IPPO with parallel environments and recurrent neural networks.

Additional details:

Rollout buffer: we store episodes {"obs": [],"actions":[],"reward":[],"states":[],"done":[],"avail_actions":[]}. Storing avail_actions is importing to compute the correct critic and actor losses.
Parallel environment: we run batch_size environments in parallel
Parallel environments with RNNs: When using multiple environments in parallel, some episodes may complete before others. We track alive environments at each timestep. This is critical for RNN policies, as the hidden state is initially sized (num_envs x num_agents, hidden_dim) but only updated for (num_alive_envs x num_agents, hidden_dim) when some episodes finish.
TD(λ) return: we use the recursive formula from Reconciling λ-Returns with Experience Replay (Equation 3) . We start by \(R^{\lambda}_T = 0\)

\[\begin{split}\begin{align} R^{\lambda}_t &= R^{(1)}_t + \gamma \lambda \Big[ R^{\lambda}_{t+1} - \max_{a' \in \mathcal{A}} Q(\hat{s}_{t+1}, a') \Big] \\ &= r_t + \gamma \Big[ \lambda R^{\lambda}_{t+1} + (1-\lambda) \max_{a' \in \mathcal{A}} Q(\hat{s}_{t+1}, a') \Big] \end{align}\end{split}\]

Advantages: We don’t directly estimate the advantages using GAE estimates, we instead use the TD(λ) return by exploiting the following formula that can be found in page 47 in David Silver’s lecture n 4

\[A(s_t,a_t) = R^{\lambda}_t -V(s_t)\]

RNN training : We use Truncated Back-Propagation Through Time (TBPTT) to train the RNN network. You can set the length of the sequence using tbptt.

Logging

We record the following metrics:

rollout/ep_reward : Mean episode reward during environment rollout.
rollout/ep_length : Mean episode length during rollout.
rollout/num_episodes : Total number of completed episodes until the current step.
rollout/battle_won (SMAClite only): Fraction of battle won by SMAC agents.
train/critic_loss : The critic loss at the current optimization step.
train/actor_loss : The actor loss at the current optimization step.
train/entropy : The average entropy per-agent at the current optimization step.
train/kl_divergence : The average kl-divergence per-agent at the current optimization step.
train/clipped_ratios : The ratio of clipped policies at the current optimization step.
train/actor_gradients : Magnitude of gradients of actor network.
train/critic_gradients : Magnitude of gradients of critic network.
train/num_updates : Total number of network updates until the current step.
eval/ep_reward : Mean episode reward during evaluation.
eval/std_ep_reward : Standard deviation of episode rewards during evaluation.
eval/ep_length : Mean episode length during evaluation.
eval/battle_won ( SMAClite only): Fraction of battles won during evaluation episodes.

Documentation

class cleanmarl.ippo.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, batch_size=3, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=32, critic_num_layers=1, optimizer='Adam', learning_rate_actor=0.0008, learning_rate_critic=0.0008, total_timesteps=1000000, gamma=0.99, td_lambda=0.95, normalize_reward=False, normalize_advantage=False, normalize_return=False, ppo_clip=0.2, entropy_coef=0.001, epochs=3, clip_gradients=-1, log_every=10, eval_steps=50, num_eval_ep=10, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)

Parameters:

env_type (str) – Type of the environment: smaclite, pz for PettingZoo, lbf for Level-based Foraging.
env_name (str) – Name of the environment (3m, simple_spread_v3 Foraging-2s-10x10-4p-2f-v3 …)
env_family (str) – Env family when using a PettingZoo environment (sisl, mpe …)
agent_ids (bool) – Include agent IDs (one-hot vector) in observations
batch_size (int) – Number of episodes to collect in each rollout
actor_hidden_dim (int) – Hidden dimension of actor network
actor_num_layers (int) – Number of hidden layers of actor network
critic_hidden_dim (int) – Hidden dimension of critic network
critic_num_layers (int) – Number of hidden layers of critic network
optimizer (str) – The optimizer
learning_rate_actor (float) – Learning rate for the actor
learning_rate_critic (float) – Learning rate for the critic
total_timesteps (int) – Total steps in the environment during training
gamma (float) – Discount factor
td_lambda (float) – TD(λ) discount factor
normalize_reward (bool) – Normalize the rewards if True
normalize_advantage (bool) – Normalize the advantage if True
normalize_return (bool) – Normalize the returns if True
ppo_clip (float) – PPO clipping factor
entropy_coef (float) – Entropy coefficient
epochs (int) – Number of training epochs
clip_gradients (float) – 0 < for no clipping and 0 > if clipping at clip_gradients
log_every (int) – Log rollout statistics every log_every episode
eval_steps (int) – Evaluate the policy each eval_steps training steps
num_eval_ep (int) – Number of evaluation episodes
use_wnb (bool) – Logging to Weights & Biases if True
wnb_project (str) – Weights & Biases project name
wnb_entity (str) – Weights & Biases entity name
device (str) – Device (cpu, gpu, mps)
seed (int) – Random seed

class cleanmarl.ippo_multienvs.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, batch_size=3, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=32, critic_num_layers=1, optimizer='Adam', learning_rate_actor=0.0008, learning_rate_critic=0.0008, total_timesteps=1000000, gamma=0.99, td_lambda=0.95, normalize_reward=False, normalize_advantage=False, normalize_return=False, ppo_clip=0.2, entropy_coef=0.001, epochs=3, clip_gradients=-1, log_every=10, eval_steps=50, num_eval_ep=10, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)

class cleanmarl.ippo_lstm.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, batch_size=3, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=32, critic_num_layers=1, optimizer='Adam', learning_rate_actor=0.0008, learning_rate_critic=0.0008, total_timesteps=1000000, gamma=0.99, td_lambda=0.95, normalize_reward=False, normalize_advantage=False, normalize_return=False, log_every=10, ppo_clip=0.2, entropy_coef=0.001, epochs=3, clip_gradients=-1, tbptt=5, eval_steps=50, num_eval_ep=10, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)

Parameters:: tbptt (int) – Chunk size for Truncated Backpropagation Through Time (TBPTT).

class cleanmarl.ippo_lstm_multienvs.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, batch_size=3, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=32, critic_num_layers=1, optimizer='AdamW', learning_rate_actor=0.0008, learning_rate_critic=0.0008, total_timesteps=1000000, gamma=0.99, td_lambda=0.95, normalize_reward=False, normalize_advantage=False, normalize_return=False, log_every=10, ppo_clip=0.2, entropy_coef=0.001, epochs=3, clip_gradients=-1, tbptt=5, eval_steps=50, num_eval_ep=10, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)