Factored Multi-Agent Centralised Policy Gradients ================================================= - Paper link: `MADDPG `_ Quick facts: - FACMAC is an off-policy actor-critic algorithm. - FACMAC uses a centralized critic with decentralized actors. - FACMAC support continuous and discrete actions. Background ---------- FACMAC combines ideas from both QMIX and MADDPG. - **Centralized critic (like QMIX)**: FACMAC uses individual critics for each agent, which are combined using a mixing network whose weights are generated by a hypernetwork. The joint-action-value function is not required to respect the monotonicity with respect to individual critics. : .. math:: Q^{\text{tot}}(\mathbf{s}, \mathbf{o},\mathbf{a};\phi,\psi) = g(\mathbf{s}, Q_1(o_1, a_1;\phi), \dots,Q_n(o_n, a_n;\phi); \psi) - **Actor-critic structure (like MADDPG)**: FACMAC uses deterministic policies for each agent. However, the actor update differs from MADDPG: In MADDPG, the actor loss for agent :math:`i` uses the other agents’ actions sampled from the replay buffer (see :doc:`MADDPG `). In contrast, in FACMAC, the other agents’ actions are sampled from their **current policies**, not from the buffer. .. math:: \nabla_{\theta} \mu_i(o_i) \, \nabla_{a_i} Q(s, a_1 =\mu_1(o_1), \dots,a_i =\mu_i(o_i), \dots , \dots,a_n =\mu_n(o_n),) .. image:: ../_static/facmac_network.png :alt: Architecture diagram :width: 500px :align: center Pseudocode ---------- .. image:: ../_static/facmac_algorithm.svg :alt: Architecture diagram :width: 100% :align: center Implementations --------------- We implemented two variants of FACMAC: - ``facmac.py``: FACMAC with a single environment and MLP neural networks. - ``facmac_multienvs.py``: FACMAC with parallel environments and MLP neural networks. Additional details: - **Replay buffer**: The replay buffer stores episodes instead of transitions, therefore, we sample batch of episodes rather than a batch of transitions. Each episode is stored as ``{"obs": [],"actions":[],"reward":[],"states":[],"done":[],"next_avail_actions":[]}`` . We store ``next_avail_actions`` to accurately compute TD targets for the best available next action. - **Discrete actions**: we only support discrete actions for now - **Gumbel-Softmax**: We use PyTorch's ``torch.nn.functional.gumbel_softmax``. During episode collection and critic training, set ``hard=True``; during actor training, set ``hard=False`` for better results. - **Parallel environments**: Parallel environments are not as useful for off-policy algorithms as for on-policy settings as we sample from a replay buffer. In order to keep comparable number of network updates, we train for multiple epochs in each training step by adding a ``n_epochs`` argument. We log the number of network updates under the name ``train/num_updates``. - **Exploration**: We use the exploration strategy suggested in COMA paper. :math:`ε` is linearly annealed across a number of training steps. .. math:: \pi(a_i) = (1 - \varepsilon) \, \text{softmax}(z_i) + \frac{\varepsilon}{|\mathcal{A}_i|}. Logging ------- We record the following metrics: - **rollout/ep_reward** : Mean episode reward during environment rollouts. - **rollout/ep_length** : Mean episode length during rollouts. - **rollout/epsilon** : Current exploration epsilon. - **rollout/num_episodes** : Total number of completed episodes until the current step. - **rollout/battle_won** (SMAClite only): Fraction of battle won by SMAC agents - **train/critic_loss** : The critic loss at the current optimization step. - **train/actor_loss** : The actor loss at the current optimization step. - **train/actor_gradients** : Magnitude of gradients of actor network. - **train/critic_gradients** : Magnitude of gradients of critic network. - **train/num_updates** : Total number of network updates until the current step. - **eval/ep_reward** : Mean episode reward during evaluation. - **eval/std_ep_reward** : Standard deviation of episode rewards during evaluation. - **eval/ep_length** : Mean episode length during evaluation. - **eval/battle_won** ( SMAClite only): Fraction of battles won during evaluation episodes. Documentation ------------- .. py:class:: cleanmarl.facmac.Args(env_type="smaclite", env_name="3m", env_family="mpe", agent_ids=True, gamma=0.99, buffer_size=5000, batch_size=10, normalize_reward=False, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=64, critic_num_layers=1, hyper_dim=32, train_freq=1, optimizer="Adam", learning_rate_actor=0.0008, learning_rate_critic=0.0008, total_timesteps=1000000, target_network_update_freq=1, polyak=0.005, log_every=10, clip_gradients=-1, start_e=0.5, end_e=0.002, exploration_fraction=750, eval_steps=50, num_eval_ep=5, use_wnb=False, wnb_project="", wnb_entity="", device="cpu", seed=1) :param env_type: Type of the environment: ``smaclite``, ``pz`` for PettingZoo, etc. :type env_type: str :param env_name: Name of the environment (``3m``, ``simple_spread_v3``, etc.) :type env_name: str :param env_family: Environment family when using PettingZoo (``sisl``, ``mpe`` ...). :type env_family: str :param agent_ids: Include agent IDs (one-hot vector) in observations. :type agent_ids: bool :param gamma: Discount factor for returns. :type gamma: float :param buffer_size: Number of episodes in the replay buffer. :type buffer_size: int :param batch_size: Batch size for training. :type batch_size: int :param normalize_reward: Normalize the rewards if True. :type normalize_reward: bool :param actor_hidden_dim: Hidden dimension of the actor network. :type actor_hidden_dim: int :param actor_num_layers: Number of hidden layers in the actor network. :type actor_num_layers: int :param critic_hidden_dim: Hidden dimension of the critic network. :type critic_hidden_dim: int :param critic_num_layers: Number of hidden layers in the critic network. :type critic_num_layers: int :param hyper_dim: Hidden dimension of the hyper-network. :type hyper_dim: int :param train_freq: Train the network each ``train_freq`` episodes. :type train_freq: int :param optimizer: Optimizer for both actor and critic. :type optimizer: str :param learning_rate_actor: Learning rate for the actor network. :type learning_rate_actor: float :param learning_rate_critic: Learning rate for the critic network. :type learning_rate_critic: float :param total_timesteps: Total number of environment steps during training. :type total_timesteps: int :param target_network_update_freq: Update the target network each ``target_network_update_freq`` episode :type target_network_update_freq: int :param polyak: Polyak coefficient for target network updates. :type polyak: float :param clip_gradients: ``0<`` for no clipping and ``0>`` to clip gradients at ``clip_gradients``. :type clip_gradients: float :param start_e: The starting value of epsilon. :type start_e: float :param end_e: The end value of epsilon. :type end_e: float :param exploration_fraction: Number of training steps to go from ``start_e`` to ``end_e``. :type exploration_fraction: float :param log_every: Log rollout stats every ``log_every`` episode. :type log_every: int :param eval_steps: Evaluate the policy each ``eval_steps`` episode. :type eval_steps: int :param num_eval_ep: Number of evaluation episodes. :type num_eval_ep: int :param use_wnb: Logging to Weights & Biases if True. :type use_wnb: bool :param wnb_project: Weights & Biases project name. :type wnb_project: str :param wnb_entity: Weights & Biases entity name. :type wnb_entity: str :param device: Device (``cpu``, ``gpu``, ``mps``). :type device: str :param seed: Random seed. :type seed: int .. py:class:: cleanmarl.facmac_multienvs.Args(env_type="smaclite", env_name="3m", env_family="mpe", num_envs=4, agent_ids=True, gamma=0.99, buffer_size=5000, batch_size=10, normalize_reward=False, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=128, critic_num_layers=1, hyper_dim=32, train_freq=1, optimizer="AdamW", learning_rate_actor=0.0008, learning_rate_critic=0.0008, total_timesteps=1000000, target_network_update_freq=1, polyak=0.005, log_every=10, eval_steps=50, num_eval_ep=5, epochs=4, clip_gradients=-1, start_e=0.5, end_e=0.002, exploration_fraction=750, use_wnb=False, wnb_project="", wnb_entity="", device="cpu", seed=1) :param num_envs: Number of parallel environments :type num_envs: int :param epochs: Number of batches sampled in one update :type n_epochs: int