Multi-Agent Deep Deterministic Policy Gradient

Quick facts:
  • MADDPG is an off-policy actor-critic algorithm.

  • MADDPG uses a centralized critic with decentralized actors.

  • MADDPG support continuous and discrete actions.

  • MADDPG support individual rewards, thus can be used for cooperative, competitive, and mixed.

Background

MADDPG is an extension of DDPG to multi-agent settings, using a centralized critic. The critic network takes as input the state information and all agents’ actions \(Q(\mathbf{s},\mathbf{a};\phi)\) and outputs a single value. When using individual rewards, separate centralized critics can be used: \(Q_i(\mathbf{s},\mathbf{a};\phi_i)\). Each agent has a deterministic policy \(\mu(o_i;\theta)\). For discrete actions, we use Gumbel-Softmax to compute gradients with respect to actions.

MADDPG is off-policy, therefor we store transitions while interacting with the environment and sample batches for training the actor and critic. Superscript \(b\) indicates values sampled from the replay buffer.

To train the critic we use the following loss:

\[r + \gamma Q(\mathbf{s'^b},\mu(o'^b_1;\theta^-), \dots , \mu(o'^b_n;\theta^-); \phi^-) - Q(\mathbf{s}^b,a^b_1, \dots , a^b_n; \phi)\]

To better understand how to update the actor of agent \(i\) , it’s better if we pay close attention to the used gradient:

\[\nabla_{\theta} \mu_i(o_i) \, \nabla_{a_i} Q(s, a^b_1, \dots,a_i =\mu_i(o_i), \dots , a^b_n)\]

When computing the gradient of agent \(i\), all the actions are from the replay buffer, except the \(i\)-th action, which is replaced by the current policy.

Architecture diagram

Pseudocode

Architecture diagram

Implementations

We implemented four variants of MADDPG:

  • maddpg.py: MADDPG with a single environment and MLP neural networks.

  • maddpg_multienvs.py: MADDPG with parallel environments and MLP neural networks.

  • maddpg_lstm.py: MADDPG with single environment and recurrent neural networks.

  • maddpg_lstm_multienvs.py: MADDPG with parallel environments and recurrent neural networks.

Additional details:

  • Replay buffer: The replay buffer stores episodes instead of transitions, therefore, we sample batch of episodes rather than a batch of transitions. Each episode is stored as {"obs": [],"actions":[],"reward":[],"states":[],"done":[],"next_avail_actions":[]} . We store next_avail_actions to accurately compute TD targets for the best available next action.

  • Discrete actions: we only support discrete actions for now

  • Gumbel-Softmax: We use PyTorch’s torch.nn.functional.gumbel_softmax. During episode collection and critic training, set hard=True; during actor training, set hard=False for better results.

  • Parallel environments: Parallel environments are not as useful for off-policy algorithms as for on-policy settings as we sample from a replay buffer. In order to keep comparable number of network updates, we train for multiple epochs in each training step by adding a n_epochs argument. We log the number of network updates under the name train/num_updates.

  • Parallel environments with RNNs: When using multiple environments in parallel, some episodes may complete before others. We track alive environments at each timestep. This is critical for RNN policies, as the hidden state is initially sized (num_envs x num_agents, hidden_dim) but only updated for (num_alive_envs x num_agents, hidden_dim) when some episodes finish.

  • RNN training : We use truncated backpropagation through time (TBPTT) to train the RNN network. You can set the length of the sequence using tbptt.

Logging

We record the following metrics:

  • rollout/ep_reward : Mean episode reward during environment rollouts.

  • rollout/ep_length : Mean episode length during rollouts.

  • rollout/num_episodes : Total number of completed episodes until the current step.

  • rollout/battle_won (SMAClite only): Fraction of battle won by SMAC agents

  • train/critic_loss : The critic loss at the current optimization step.

  • train/actor_loss : The actor loss at the current optimization step.

  • train/actor_gradients : Magnitude of gradients of actor network.

  • train/critic_gradients : Magnitude of gradients of critic network.

  • train/num_updates : Total number of network updates until the current step.

  • eval/ep_reward : Mean episode reward during evaluation.

  • eval/std_ep_reward : Standard deviation of episode rewards during evaluation.

  • eval/ep_length : Mean episode length during evaluation.

  • eval/battle_won ( SMAClite only): Fraction of battles won during evaluation episodes.

Documentation

class cleanmarl.maddpg.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, gamma=0.99, buffer_size=5000, batch_size=10, normalize_reward=False, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=128, critic_num_layers=1, train_freq=1, optimizer='Adam', learning_rate_actor=0.0003, learning_rate_critic=0.0003, total_timesteps=1000000, target_network_update_freq=1, polyak=0.005, clip_gradients=-1, log_every=10, eval_steps=50, num_eval_ep=5, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)
Parameters:
  • env_type (str) – Type of the environment: smaclite, pz for PettingZoo, etc.

  • env_name (str) – Name of the environment (3m, simple_spread_v3, etc.)

  • env_family (str) – Environment family when using PettingZoo (sisl, mpe …).

  • agent_ids (bool) – Include agent IDs (one-hot vector) in observations.

  • gamma (float) – Discount factor for returns.

  • buffer_size (int) – Number of episodes in the replay buffer.

  • batch_size (int) – Batch size for training.

  • normalize_reward (bool) – Normalize the rewards if True.

  • actor_hidden_dim (int) – Hidden dimension of the actor network.

  • actor_num_layers (int) – Number of hidden layers in the actor network.

  • critic_hidden_dim (int) – Hidden dimension of the critic network.

  • critic_num_layers (int) – Number of hidden layers in the critic network.

  • train_freq (int) – Train the network each train_freq episodes of the environment.

  • optimizer (str) – Optimizer for both actor and critic.

  • learning_rate_actor (float) – Learning rate for the actor network.

  • learning_rate_critic (float) – Learning rate for the critic network.

  • total_timesteps (int) – Total number of environment steps during training.

  • target_network_update_freq (int) – Update the target network each target_network_update_freq episode

  • polyak (float) – Polyak coefficient for target network updates.

  • clip_gradients (float) – 0< for no clipping and 0> to clip gradients at this value.

  • log_every (int) – Log rollout statistics every log_every episode.

  • eval_steps (int) – Evaluate the policy every eval_steps episode.

  • num_eval_ep (int) – Number of evaluation episodes.

  • use_wnb (bool) – Enable logging to Weights & Biases if True.

  • wnb_project (str) – Weights & Biases project name.

  • wnb_entity (str) – Weights & Biases entity name.

  • device (str) – Device to use (cpu, gpu, mps).

  • seed (int) – Random seed for reproducibility.

class cleanmarl.maddpg_multienvs.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, num_envs=4, gamma=0.99, buffer_size=5000, batch_size=10, normalize_reward=False, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=128, critic_num_layers=1, epochs=4, optimizer='Adam', learning_rate_actor=0.0003, learning_rate_critic=0.0003, total_timesteps=1000000, target_network_update_freq=1, polyak=0.01, clip_gradients=-1, log_every=10, eval_steps=50, num_eval_ep=5, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)
Parameters:
  • num_envs (int) – Number of parallel environments

  • epochs – Number of batches sampled in one update

class cleanmarl.maddpg_lstm.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, gamma=0.99, buffer_size=5000, batch_size=10, normalize_reward=False, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=128, critic_num_layers=1, train_freq=1, optimizer='Adam', learning_rate_actor=0.0006, learning_rate_critic=0.0006, total_timesteps=1000000, target_network_update_freq=1, polyak=0.005, clip_gradients=-1, tbptt=10, log_every=10, eval_steps=50, num_eval_ep=5, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)
Parameters:

tbptt (int) – Chunk size for Truncated Backpropagation Through Time (TBPTT).

class cleanmarl.maddpg_lstm_multienvs.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, num_envs=4, gamma=0.99, buffer_size=5000, batch_size=10, normalize_reward=False, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=128, critic_num_layers=1, optimizer='Adam', learning_rate_actor=0.0003, learning_rate_critic=0.0003, total_timesteps=1000000, target_network_update_freq=1, polyak=0.01, epochs=4, clip_gradients=-1, tbptt=10, log_every=10, eval_steps=50, num_eval_ep=5, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)