Factored Multi-Agent Centralised Policy Gradients
=================================================

    - Paper link:  `MADDPG <https://arxiv.org/abs/2003.06709>`_ 

Quick facts:
    - FACMAC is an off-policy actor-critic algorithm.
    - FACMAC uses a centralized critic with decentralized actors.
    - FACMAC support continuous and discrete actions.

Background
----------

FACMAC combines ideas from both QMIX and MADDPG. 

- **Centralized critic (like QMIX)**: FACMAC uses individual critics for each agent, which are combined using a mixing network whose weights are generated by a hypernetwork. The joint-action-value function is not required to respect the monotonicity with respect to individual critics. :

.. math::
    Q^{\text{tot}}(\mathbf{s}, \mathbf{o},\mathbf{a};\phi,\psi) = g(\mathbf{s}, Q_1(o_1, a_1;\phi), \dots,Q_n(o_n, a_n;\phi); \psi)


- **Actor-critic structure (like MADDPG)**: FACMAC uses deterministic policies for each agent. However, the actor update differs from MADDPG: In MADDPG, the actor loss for agent :math:`i` uses the other agents’ actions sampled from the replay buffer (see :doc:`MADDPG <maddpg>`). In contrast, in FACMAC, the other agents’ actions are sampled from their **current policies**, not from the buffer.

.. math::

   \nabla_{\theta} \mu_i(o_i) \, \nabla_{a_i} Q(s, a_1 =\mu_1(o_1), \dots,a_i =\mu_i(o_i), \dots , \dots,a_n =\mu_n(o_n),) 


.. image:: ../_static/facmac_network.png
   :alt: Architecture diagram
   :width: 500px
   :align: center


Pseudocode
----------
.. image:: ../_static/facmac_algorithm.svg
   :alt: Architecture diagram
   :width: 100%
   :align: center

Implementations
---------------

We implemented two variants of FACMAC:

- ``facmac.py``: FACMAC with a single environment and MLP neural networks.
- ``facmac_multienvs.py``: FACMAC with parallel environments and MLP neural networks.

Additional details:

- **Replay buffer**: The replay buffer stores episodes instead of transitions, therefore, we sample batch of episodes rather than a batch of transitions. Each episode is stored as ``{"obs": [],"actions":[],"reward":[],"states":[],"done":[],"next_avail_actions":[]}`` . We store ``next_avail_actions`` to accurately compute TD targets for the best available next action.
- **Discrete actions**: we only support discrete actions for now
- **Gumbel-Softmax**: We use PyTorch's ``torch.nn.functional.gumbel_softmax``. During episode collection and critic training, set ``hard=True``; during actor training, set ``hard=False`` for better results. 
- **Parallel environments**: Parallel environments are not as useful for off-policy algorithms as for on-policy settings as we sample from a replay buffer. In order to keep comparable number of network updates, we train for multiple epochs in each training step by adding a ``n_epochs`` argument. We log the number of network updates under the name ``train/num_updates``. 
- **Exploration**: We use the exploration strategy suggested in COMA paper.  :math:`ε` is linearly annealed across a number of training steps.

.. math::

    \pi(a_i) = (1 - \varepsilon) \, \text{softmax}(z_i) + \frac{\varepsilon}{|\mathcal{A}_i|}.

Logging
-------

We record the following metrics:

- **rollout/ep_reward** : Mean episode reward during environment rollouts.
- **rollout/ep_length** : Mean episode length during rollouts.
- **rollout/epsilon** : Current exploration epsilon.
- **rollout/num_episodes** : Total number of completed episodes until the current step.
- **rollout/battle_won** (SMAClite only): Fraction of battle won by SMAC agents
- **train/critic_loss** : The critic loss at the current optimization step.
- **train/actor_loss** : The actor loss at the current optimization step.
- **train/actor_gradients** : Magnitude of gradients of actor network.
- **train/critic_gradients** : Magnitude of gradients of critic network.
- **train/num_updates** : Total number of network updates until the current step.
- **eval/ep_reward** : Mean episode reward during evaluation.
- **eval/std_ep_reward** : Standard deviation of episode rewards during evaluation.
- **eval/ep_length** : Mean episode length during evaluation.
- **eval/battle_won** ( SMAClite only): Fraction of battles won during evaluation episodes.

Documentation
-------------

.. py:class:: cleanmarl.facmac.Args(env_type="smaclite", env_name="3m", env_family="mpe", agent_ids=True, gamma=0.99, buffer_size=5000, batch_size=10, normalize_reward=False, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=64, critic_num_layers=1, hyper_dim=32, train_freq=1, optimizer="Adam", learning_rate_actor=0.0008, learning_rate_critic=0.0008, total_timesteps=1000000, target_network_update_freq=1, polyak=0.005, log_every=10, clip_gradients=-1, start_e=0.5, end_e=0.002, exploration_fraction=750, eval_steps=50, num_eval_ep=5, use_wnb=False, wnb_project="", wnb_entity="", device="cpu", seed=1)

    :param env_type: Type of the environment: ``smaclite``, ``pz`` for PettingZoo, etc.
    :type env_type: str

    :param env_name: Name of the environment (``3m``, ``simple_spread_v3``, etc.)
    :type env_name: str

    :param env_family: Environment family when using PettingZoo (``sisl``, ``mpe`` ...).
    :type env_family: str

    :param agent_ids: Include agent IDs (one-hot vector) in observations.
    :type agent_ids: bool

    :param gamma: Discount factor for returns.
    :type gamma: float

    :param buffer_size: Number of episodes in the replay buffer.
    :type buffer_size: int

    :param batch_size: Batch size for training.
    :type batch_size: int

    :param normalize_reward: Normalize the rewards if True.
    :type normalize_reward: bool

    :param actor_hidden_dim: Hidden dimension of the actor network.
    :type actor_hidden_dim: int

    :param actor_num_layers: Number of hidden layers in the actor network.
    :type actor_num_layers: int

    :param critic_hidden_dim: Hidden dimension of the critic network.
    :type critic_hidden_dim: int

    :param critic_num_layers: Number of hidden layers in the critic network.
    :type critic_num_layers: int

    :param hyper_dim: Hidden dimension of the hyper-network.
    :type hyper_dim: int

    :param train_freq: Train the network each ``train_freq`` episodes.
    :type train_freq: int

    :param optimizer: Optimizer for both actor and critic.
    :type optimizer: str

    :param learning_rate_actor: Learning rate for the actor network.
    :type learning_rate_actor: float

    :param learning_rate_critic: Learning rate for the critic network.
    :type learning_rate_critic: float

    :param total_timesteps: Total number of environment steps during training.
    :type total_timesteps: int

    :param target_network_update_freq: Update the target network each ``target_network_update_freq`` episode
    :type target_network_update_freq: int

    :param polyak: Polyak coefficient for target network updates.
    :type polyak: float

    :param clip_gradients: ``0<`` for no clipping and ``0>`` to clip gradients at ``clip_gradients``.
    :type clip_gradients: float

    :param start_e: The starting value of epsilon.
    :type start_e: float

    :param end_e: The end value of epsilon.
    :type end_e: float

    :param exploration_fraction: Number of training steps to go from ``start_e`` to ``end_e``.
    :type exploration_fraction: float

    :param log_every: Log rollout stats every ``log_every`` episode.
    :type log_every: int

    :param eval_steps: Evaluate the policy each ``eval_steps`` episode.
    :type eval_steps: int

    :param num_eval_ep: Number of evaluation episodes.
    :type num_eval_ep: int

    :param use_wnb: Logging to Weights & Biases if True.
    :type use_wnb: bool

    :param wnb_project: Weights & Biases project name.
    :type wnb_project: str

    :param wnb_entity: Weights & Biases entity name.
    :type wnb_entity: str

    :param device: Device (``cpu``, ``gpu``, ``mps``).
    :type device: str

    :param seed: Random seed.
    :type seed: int

.. py:class:: cleanmarl.facmac_multienvs.Args(env_type="smaclite", env_name="3m", env_family="mpe", num_envs=4, agent_ids=True, gamma=0.99, buffer_size=5000, batch_size=10, normalize_reward=False, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=128, critic_num_layers=1, hyper_dim=32, train_freq=1, optimizer="AdamW", learning_rate_actor=0.0008, learning_rate_critic=0.0008, total_timesteps=1000000, target_network_update_freq=1, polyak=0.005, log_every=10, eval_steps=50, num_eval_ep=5, epochs=4, clip_gradients=-1, start_e=0.5, end_e=0.002, exploration_fraction=750, use_wnb=False, wnb_project="", wnb_entity="", device="cpu", seed=1)

    :param num_envs: Number of parallel environments
    :type num_envs: int

    :param epochs: Number of batches sampled in one update
    :type n_epochs: int