Value-Decomposition Networks For Cooperative Multi-Agent Learning
=================================================================

    - Paper link:  `VDN <https://arxiv.org/abs/1706.05296>`_ 

Quick facts:
    - VDN is an off-policy and value-based algorithm.
    - VDN works only for discrete actions.
    - Requires a common reward.
    - Uses additive factorization.

Key ideas:
    - VDN learns centralized action-value function :math:`Q^{tot}` decomposed into the same of individual :math:`Q_i` networks

    .. math::

        Q(\mathbf{o}, \mathbf{a}) = \sum_{i \in I} Q_i(o_i, a_i).


    - This factorization allows us to have decentralized policies.

    - :math:`Q_i` networks are refereed to as *utility networks* rather than action-value function, as they don't satisfy the Bellman equation. Instead,  :math:`Q^{tot}` is a true action-value function.


Background
----------


VDN is based on Q-learning and works in settings with a common reward :math:`r`.  
Let’s first forget about VDN for a while and focus on how Q-learning can solve the cooperative MARL problem. There are two main approaches, each with its pros and cons.


The first approach is to use a single-agent RL algorithm. This treats the system as **one central agent** that receives the joint observation :math:`\mathbf{o}_t`, and whose action is the joint action :math:`\mathbf{a}_t`.  
We then estimate :math:`Q(\mathbf{o}_t, \mathbf{a}_t; \theta)` as in DQN. The loss function is:

.. math::

   L(\theta) =  (y_t - Q(\mathbf{o}_t, \mathbf{a}_t; \theta))^2 \tag{1}

where:

.. math::
    
    y_t = r_t + \gamma(1-done) \max_{\mathbf{a'}_t} Q(\mathbf{o}_{t+1}, \mathbf{a'}_t; \theta^{-})

- **Pros:** The loss function backpropagates using the team reward, which is strongly related to the input of the Q-network: the joint action :math:`a`.
- **Cons:** The Q-network takes as input the joint observation :math:`o` and outputs the joint action :math:`a`. This leads to extremely large inputs, and the output size grows exponentially with the number of agents.

The second approach is **independent Q-learning (IQL)**.  
Each agent trains its own Q-learning algorithm using only its local observation :math:`o_i` and local actions :math:`a_i`.  
We therefore have :math:`n` independent loss functions to optimize:


.. math::
    
    L_i(\theta) =  (y_i^t - Q_i(o_i^t, a_i^t; \theta))^2 \tag{3}

.. math::
    
     y_i^t =r_t + \gamma(1-done)\max_{a'_i} Q(o_i^{t+1}, a'_i; \theta^{-}) 


- **Pros:** Each Q-network is trained using local observations :math:`o_i` and actions :math:`a_i`, making training more efficient and easier to deploy.
- **Cons:** Each Q-network backpropagates a reward signal that depends on the actions of other agents rather than its own.


The idea of VDN is to **combine these two approaches**.  
We want to train decentralized networks using local observations :math:`o_i` and actions :math:`a_i` while still backpropagating the common reward through a loss function that aggregates all agents.

To achieve this, we assume a centralized Q-network :math:`Q(\mathbf{o}^t, \mathbf{a}^t; \theta)` that can be written as:


.. math::
    
     Q(\mathbf{o}^t, \mathbf{a}^t; \theta) = \sum_{i \in I} Q_i(o_i^t, a_i^t; \theta) \tag{5}

and we use the following loss function: 

.. math::
    
     L(\theta) = \left( r^t + \gamma \max_{\mathbf{a'} \in A} Q(\mathbf{o}^{t+1}, \mathbf{a'}; \theta^{-}) - Q(\mathbf{o}^{t}, \mathbf{a}^t; \theta) \right)^2 \tag{6}


with

.. math::
    
     Q(\mathbf{o}^{t}, \mathbf{a}^t; \theta) = \sum_{i \in I} Q_i(o_i^t, a_i^t; \theta) \tag{7}

and

.. math::
    
     \max_{\mathbf{a'} \in A} Q(\mathbf{o}^{t+1}, \mathbf{a'}; \theta^{-}) = \sum_{i \in I} \max_{a'_i \in A_i} Q_i(o^{t+1}_i, a'_i;\theta) \tag{8}


We don’t update each individual Q-network separately; instead, we backpropagate through the **sum** of the individual Q-networks.


It’s important to note that :math:`Q(.; \theta)` is not an actual neural network.  
Only the individual networks :math:`Q_i(.; \theta)` are instantiated.  
In practice, we can give each agent its own parameters :math:`\theta_i`. Sharing weights among agents is a common practice in MARL.

Pseudocode
----------

.. image:: ../_static/vdn_algorithm.svg
   :alt: Architecture diagram
   :width: 100%
   :align: center

Implementations
---------------

We implemented three variants of VDN:

- ``vdn.py``: VDN with single environment and MLP neural networks.
- ``vdn_multienvs.py``: VDN with parallel environments and MLP neural networks.
- ``vdn_lstm.py``: VDN with single environment and recurrent neural networks.

Additional details:

- **Replay buffer**: For MLP-based implementations, we store transitions ``(obs, actions,reward,done,next_obs,next_avail_action)``. We need to store the ``next_avail_action`` in order to accurately compute the TD targets as we need the action-value of the best available next action. For the RNN-based implementation, we store sequences of transitions ``(seq_obs, seq_actions,seq_reward,seq_done,seq_next_obs,seq_next_avail_action)`` , and during the training we use the first ``burn_in`` transitions to compute the hidden state ``h``, and use the remaining of the sequence to update the network.

Logging
-------

We record the following metrics:

- **rollout/ep_reward** : Mean episode reward during environment rollouts.
- **rollout/ep_length** : Mean episode length during rollouts.
- **rollout/epsilon** : Current exploration epsilon.
- **rollout/battle_won** (SMAClite only): Fraction of battle won by SMAC agents
- **train/loss** : Training loss at the current optimization step.
- **train/grads** : Magnitude of gradients of the VDN networks.
- **eval/ep_reward** : Mean episode reward during evaluation.
- **eval/std_ep_reward** : Standard deviation of episode rewards during evaluation.
- **eval/ep_length** : Mean episode length during evaluation.
- **eval/battle_won** ( SMAClite only): Fraction of battles won during evaluation episodes.


Documentation
-------------

.. py:class::  cleanmarl.vdn.Args(env_type="smaclite", env_name="3m", env_family="sisl", agent_ids=True,           buffer_size=10000, total_timesteps=1000000, gamma=0.99, learning_starts=5000, train_freq=5, optimizer="Adam", learning_rate=0.0005, batch_size=32, start_e=1, end_e=0.05, exploration_fraction=0.05, hidden_dim=64, num_layers=1, target_network_update_freq=5, polyak=0.005, normalize_reward=False, clip_gradients=5, log_every=10, eval_steps=5000, num_eval_ep=10, use_wnb=False, wnb_project="", wnb_entity="", device="cpu", seed=1)

    :param env_type: Type of the environment: ``smaclite``, ``pz`` for PettingZoo, ``lbf`` for Level-based Foraging.
    :type env_type: str

    :param env_name: Name of the environment (``3m``, ``simple_spread_v3`` ``Foraging-2s-10x10-4p-2f-v3`` ...)
    :type env_name: str

    :param env_family: Env family when using a PettingZoo environment (``sisl``, ``mpe`` ...)
    :type env_family: str

    :param agent_ids: Include agent IDs (one-hot vector) in observations
    :type agent_ids: bool

    :param buffer_size: The size of the replay buffer
    :type buffer_size: int

    :param total_timesteps: Total steps of the environment during the training
    :type total_timesteps: int

    :param gamma: Discount factor
    :type gamma: float

    :param learning_starts: Number of environment steps to initialize the replay buffer
    :type learning_starts: int

    :param train_freq: Train the network each ``train_freq`` step in the environment
    :type train_freq: int

    :param optimizer: The optimizer
    :type optimizer: str

    :param learning_rate: Learning rate
    :type learning_rate: float

    :param batch_size: Batch size
    :type batch_size: int

    :param start_e: The starting value of epsilon, for exploration
    :type start_e: float

    :param end_e: The end value of epsilon, for exploration
    :type end_e: float

    :param exploration_fraction: The fraction of ``total-timesteps`` it takes from to go from ``start_e`` to ``end_e``.
    :type exploration_fraction: float

    :param hidden_dim: Hidden dimension
    :type hidden_dim: int

    :param num_layers: Number of layers
    :type num_layers: int

    :param target_network_update_freq: Update the target network each ``target_network_update_freq`` step in the environment
    :type target_network_update_freq: int

    :param polyak: Polyak coefficient to update the target network
    :type polyak: float

    :param normalize_reward: Normalize the rewards if True
    :type normalize_reward: bool

    :param clip_gradients: ``0<`` for no gradients clipping and ``0>`` if clipping gradients at ``clip_gradients``
    :type clip_gradients: float

    :param log_every: Log rollout stats every ``log_every`` episode
    :type log_every: int

    :param eval_steps: Evaluate the policy each ``eval_steps`` step
    :type eval_steps: int

    :param num_eval_ep: Number of evaluation episodes
    :type num_eval_ep: int

    :param use_wnb: Logging to Weights & Biases if True
    :type use_wnb: bool

    :param wnb_project: Weights & Biases project name
    :type wnb_project: str

    :param wnb_entity: Weights & Biases entity name
    :type wnb_entity: str

    :param device: Device (``cpu``, ``gpu``, ``mps``) *We only support CPU training for now*
    :type device: str

    :param seed: Random seed
    :type seed: int


.. py:class:: cleanmarl.vdn_lstm.Args(env_type="smaclite", env_name="3m", env_family="mpe", agent_ids=True, buffer_size=10000, seq_length=10, burn_in=7, total_timesteps=1000000, gamma=0.99, learning_starts=5000, train_freq=5, optimizer="Adam", learning_rate=0.0007, batch_size=32, start_e=1, end_e=0.05, exploration_fraction=0.01, hidden_dim=64, num_layers=1, normalize_reward=False, target_network_update_freq=1, polyak=0.005, log_every=10, clip_gradients=1, eval_steps=10000, num_eval_ep=10, use_wnb=False, wnb_project="", wnb_entity="", device="cpu", seed=1)

    :param seq_length: Length of the sequence to store in the buffer
    :type seq_length: int

    :param burn_in: Sequences to burn during batch updates
    :type burn_in: int


.. py:class:: cleanmarl.vdn_multienvs.Args(env_type="smaclite", env_name="3m", env_family="mpe", agent_ids=True, num_envs=4, buffer_size=10000, total_timesteps=1000000, gamma=0.99, learning_starts=5000, train_freq=2, optimizer="Adam", learning_rate=0.0005, batch_size=16, clip_gradients=5, start_e=1, end_e=0.05, exploration_fraction=0.05, hidden_dim=64, num_layers=1, target_network_update_freq=1, polyak=0.005, log_every=10, normalize_reward=False, eval_steps=5000, num_eval_ep=5, use_wnb=False, wnb_project="", wnb_entity="", device="mps", seed=1)

    :param num_envs: Number of parallel environments
    :type num_envs: int