Factored Multi-Agent Centralised Policy Gradients
Paper link: MADDPG
- Quick facts:
FACMAC is an off-policy actor-critic algorithm.
FACMAC uses a centralized critic with decentralized actors.
FACMAC support continuous and discrete actions.
Background
FACMAC combines ideas from both QMIX and MADDPG.
Centralized critic (like QMIX): FACMAC uses individual critics for each agent, which are combined using a mixing network whose weights are generated by a hypernetwork. The joint-action-value function is not required to respect the monotonicity with respect to individual critics. :
Actor-critic structure (like MADDPG): FACMAC uses deterministic policies for each agent. However, the actor update differs from MADDPG: In MADDPG, the actor loss for agent \(i\) uses the other agents’ actions sampled from the replay buffer (see MADDPG). In contrast, in FACMAC, the other agents’ actions are sampled from their current policies, not from the buffer.
Pseudocode
Implementations
We implemented two variants of FACMAC:
facmac.py: FACMAC with a single environment and MLP neural networks.facmac_multienvs.py: FACMAC with parallel environments and MLP neural networks.
Additional details:
Replay buffer: The replay buffer stores episodes instead of transitions, therefore, we sample batch of episodes rather than a batch of transitions. Each episode is stored as
{"obs": [],"actions":[],"reward":[],"states":[],"done":[],"next_avail_actions":[]}. We storenext_avail_actionsto accurately compute TD targets for the best available next action.Discrete actions: we only support discrete actions for now
Gumbel-Softmax: We use PyTorch’s
torch.nn.functional.gumbel_softmax. During episode collection and critic training, sethard=True; during actor training, sethard=Falsefor better results.Parallel environments: Parallel environments are not as useful for off-policy algorithms as for on-policy settings as we sample from a replay buffer. In order to keep comparable number of network updates, we train for multiple epochs in each training step by adding a
n_epochsargument. We log the number of network updates under the nametrain/num_updates.Exploration: We use the exploration strategy suggested in COMA paper. \(ε\) is linearly annealed across a number of training steps.
Logging
We record the following metrics:
rollout/ep_reward : Mean episode reward during environment rollouts.
rollout/ep_length : Mean episode length during rollouts.
rollout/epsilon : Current exploration epsilon.
rollout/num_episodes : Total number of completed episodes until the current step.
rollout/battle_won (SMAClite only): Fraction of battle won by SMAC agents
train/critic_loss : The critic loss at the current optimization step.
train/actor_loss : The actor loss at the current optimization step.
train/actor_gradients : Magnitude of gradients of actor network.
train/critic_gradients : Magnitude of gradients of critic network.
train/num_updates : Total number of network updates until the current step.
eval/ep_reward : Mean episode reward during evaluation.
eval/std_ep_reward : Standard deviation of episode rewards during evaluation.
eval/ep_length : Mean episode length during evaluation.
eval/battle_won ( SMAClite only): Fraction of battles won during evaluation episodes.
Documentation
- class cleanmarl.facmac.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, gamma=0.99, buffer_size=5000, batch_size=10, normalize_reward=False, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=64, critic_num_layers=1, hyper_dim=32, train_freq=1, optimizer='Adam', learning_rate_actor=0.0008, learning_rate_critic=0.0008, total_timesteps=1000000, target_network_update_freq=1, polyak=0.005, log_every=10, clip_gradients=-1, start_e=0.5, end_e=0.002, exploration_fraction=750, eval_steps=50, num_eval_ep=5, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)
- Parameters:
env_type (str) – Type of the environment:
smaclite,pzfor PettingZoo, etc.env_name (str) – Name of the environment (
3m,simple_spread_v3, etc.)env_family (str) – Environment family when using PettingZoo (
sisl,mpe…).agent_ids (bool) – Include agent IDs (one-hot vector) in observations.
gamma (float) – Discount factor for returns.
buffer_size (int) – Number of episodes in the replay buffer.
batch_size (int) – Batch size for training.
normalize_reward (bool) – Normalize the rewards if True.
actor_hidden_dim (int) – Hidden dimension of the actor network.
actor_num_layers (int) – Number of hidden layers in the actor network.
critic_hidden_dim (int) – Hidden dimension of the critic network.
critic_num_layers (int) – Number of hidden layers in the critic network.
hyper_dim (int) – Hidden dimension of the hyper-network.
train_freq (int) – Train the network each
train_freqepisodes.optimizer (str) – Optimizer for both actor and critic.
learning_rate_actor (float) – Learning rate for the actor network.
learning_rate_critic (float) – Learning rate for the critic network.
total_timesteps (int) – Total number of environment steps during training.
target_network_update_freq (int) – Update the target network each
target_network_update_freqepisodepolyak (float) – Polyak coefficient for target network updates.
clip_gradients (float) –
0<for no clipping and0>to clip gradients atclip_gradients.start_e (float) – The starting value of epsilon.
end_e (float) – The end value of epsilon.
exploration_fraction (float) – Number of training steps to go from
start_etoend_e.log_every (int) – Log rollout stats every
log_everyepisode.eval_steps (int) – Evaluate the policy each
eval_stepsepisode.num_eval_ep (int) – Number of evaluation episodes.
use_wnb (bool) – Logging to Weights & Biases if True.
wnb_project (str) – Weights & Biases project name.
wnb_entity (str) – Weights & Biases entity name.
device (str) – Device (
cpu,gpu,mps).seed (int) – Random seed.
- class cleanmarl.facmac_multienvs.Args(env_type='smaclite', env_name='3m', env_family='mpe', num_envs=4, agent_ids=True, gamma=0.99, buffer_size=5000, batch_size=10, normalize_reward=False, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=128, critic_num_layers=1, hyper_dim=32, train_freq=1, optimizer='AdamW', learning_rate_actor=0.0008, learning_rate_critic=0.0008, total_timesteps=1000000, target_network_update_freq=1, polyak=0.005, log_every=10, eval_steps=50, num_eval_ep=5, epochs=4, clip_gradients=-1, start_e=0.5, end_e=0.002, exploration_fraction=750, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)
- Parameters:
num_envs (int) – Number of parallel environments
epochs – Number of batches sampled in one update