Factored Multi-Agent Centralised Policy Gradients

Paper link: MADDPG

Quick facts:

FACMAC is an off-policy actor-critic algorithm.
FACMAC uses a centralized critic with decentralized actors.
FACMAC support continuous and discrete actions.

Background

FACMAC combines ideas from both QMIX and MADDPG.

Centralized critic (like QMIX): FACMAC uses individual critics for each agent, which are combined using a mixing network whose weights are generated by a hypernetwork. The joint-action-value function is not required to respect the monotonicity with respect to individual critics. :

\[Q^{\text{tot}}(\mathbf{s}, \mathbf{o},\mathbf{a};\phi,\psi) = g(\mathbf{s}, Q_1(o_1, a_1;\phi), \dots,Q_n(o_n, a_n;\phi); \psi)\]

Actor-critic structure (like MADDPG): FACMAC uses deterministic policies for each agent. However, the actor update differs from MADDPG: In MADDPG, the actor loss for agent \(i\) uses the other agents’ actions sampled from the replay buffer (see MADDPG). In contrast, in FACMAC, the other agents’ actions are sampled from their current policies, not from the buffer.

\[\nabla_{\theta} \mu_i(o_i) \, \nabla_{a_i} Q(s, a_1 =\mu_1(o_1), \dots,a_i =\mu_i(o_i), \dots , \dots,a_n =\mu_n(o_n),)\]

Pseudocode

Implementations

We implemented two variants of FACMAC:

facmac.py: FACMAC with a single environment and MLP neural networks.
facmac_multienvs.py: FACMAC with parallel environments and MLP neural networks.

Additional details:

Replay buffer: The replay buffer stores episodes instead of transitions, therefore, we sample batch of episodes rather than a batch of transitions. Each episode is stored as {"obs": [],"actions":[],"reward":[],"states":[],"done":[],"next_avail_actions":[]} . We store next_avail_actions to accurately compute TD targets for the best available next action.
Discrete actions: we only support discrete actions for now
Gumbel-Softmax: We use PyTorch’s torch.nn.functional.gumbel_softmax. During episode collection and critic training, set hard=True; during actor training, set hard=False for better results.
Parallel environments: Parallel environments are not as useful for off-policy algorithms as for on-policy settings as we sample from a replay buffer. In order to keep comparable number of network updates, we train for multiple epochs in each training step by adding a n_epochs argument. We log the number of network updates under the name train/num_updates.
Exploration: We use the exploration strategy suggested in COMA paper. \(ε\) is linearly annealed across a number of training steps.

\[\pi(a_i) = (1 - \varepsilon) \, \text{softmax}(z_i) + \frac{\varepsilon}{|\mathcal{A}_i|}.\]

Logging

We record the following metrics:

rollout/ep_reward : Mean episode reward during environment rollouts.
rollout/ep_length : Mean episode length during rollouts.
rollout/epsilon : Current exploration epsilon.
rollout/num_episodes : Total number of completed episodes until the current step.
rollout/battle_won (SMAClite only): Fraction of battle won by SMAC agents
train/critic_loss : The critic loss at the current optimization step.
train/actor_loss : The actor loss at the current optimization step.
train/actor_gradients : Magnitude of gradients of actor network.
train/critic_gradients : Magnitude of gradients of critic network.
train/num_updates : Total number of network updates until the current step.
eval/ep_reward : Mean episode reward during evaluation.
eval/std_ep_reward : Standard deviation of episode rewards during evaluation.
eval/ep_length : Mean episode length during evaluation.
eval/battle_won ( SMAClite only): Fraction of battles won during evaluation episodes.

Documentation

class cleanmarl.facmac.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, gamma=0.99, buffer_size=5000, batch_size=10, normalize_reward=False, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=64, critic_num_layers=1, hyper_dim=32, train_freq=1, optimizer='Adam', learning_rate_actor=0.0008, learning_rate_critic=0.0008, total_timesteps=1000000, target_network_update_freq=1, polyak=0.005, log_every=10, clip_gradients=-1, start_e=0.5, end_e=0.002, exploration_fraction=750, eval_steps=50, num_eval_ep=5, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)

Parameters:

env_type (str) – Type of the environment: smaclite, pz for PettingZoo, etc.
env_name (str) – Name of the environment (3m, simple_spread_v3, etc.)
env_family (str) – Environment family when using PettingZoo (sisl, mpe …).
agent_ids (bool) – Include agent IDs (one-hot vector) in observations.
gamma (float) – Discount factor for returns.
buffer_size (int) – Number of episodes in the replay buffer.
batch_size (int) – Batch size for training.
normalize_reward (bool) – Normalize the rewards if True.
actor_hidden_dim (int) – Hidden dimension of the actor network.
actor_num_layers (int) – Number of hidden layers in the actor network.
critic_hidden_dim (int) – Hidden dimension of the critic network.
critic_num_layers (int) – Number of hidden layers in the critic network.
hyper_dim (int) – Hidden dimension of the hyper-network.
train_freq (int) – Train the network each train_freq episodes.
optimizer (str) – Optimizer for both actor and critic.
learning_rate_actor (float) – Learning rate for the actor network.
learning_rate_critic (float) – Learning rate for the critic network.
total_timesteps (int) – Total number of environment steps during training.
target_network_update_freq (int) – Update the target network each target_network_update_freq episode
polyak (float) – Polyak coefficient for target network updates.
clip_gradients (float) – 0< for no clipping and 0> to clip gradients at clip_gradients.
start_e (float) – The starting value of epsilon.
end_e (float) – The end value of epsilon.
exploration_fraction (float) – Number of training steps to go from start_e to end_e.
log_every (int) – Log rollout stats every log_every episode.
eval_steps (int) – Evaluate the policy each eval_steps episode.
num_eval_ep (int) – Number of evaluation episodes.
use_wnb (bool) – Logging to Weights & Biases if True.
wnb_project (str) – Weights & Biases project name.
wnb_entity (str) – Weights & Biases entity name.
device (str) – Device (cpu, gpu, mps).
seed (int) – Random seed.

class cleanmarl.facmac_multienvs.Args(env_type='smaclite', env_name='3m', env_family='mpe', num_envs=4, agent_ids=True, gamma=0.99, buffer_size=5000, batch_size=10, normalize_reward=False, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=128, critic_num_layers=1, hyper_dim=32, train_freq=1, optimizer='AdamW', learning_rate_actor=0.0008, learning_rate_critic=0.0008, total_timesteps=1000000, target_network_update_freq=1, polyak=0.005, log_every=10, eval_steps=50, num_eval_ep=5, epochs=4, clip_gradients=-1, start_e=0.5, end_e=0.002, exploration_fraction=750, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)

Parameters:

num_envs (int) – Number of parallel environments
epochs – Number of batches sampled in one update