Value-Decomposition Networks For Cooperative Multi-Agent Learning

  • Paper link: VDN

Quick facts:
  • VDN is an off-policy and value-based algorithm.

  • VDN works only for discrete actions.

  • Requires a common reward.

  • Uses additive factorization.

Key ideas:
  • VDN learns centralized action-value function \(Q^{tot}\) decomposed into the same of individual \(Q_i\) networks

\[Q(\mathbf{o}, \mathbf{a}) = \sum_{i \in I} Q_i(o_i, a_i).\]
  • This factorization allows us to have decentralized policies.

  • \(Q_i\) networks are refereed to as utility networks rather than action-value function, as they don’t satisfy the Bellman equation. Instead, \(Q^{tot}\) is a true action-value function.

Background

VDN is based on Q-learning and works in settings with a common reward \(r\). Let’s first forget about VDN for a while and focus on how Q-learning can solve the cooperative MARL problem. There are two main approaches, each with its pros and cons.

The first approach is to use a single-agent RL algorithm. This treats the system as one central agent that receives the joint observation \(\mathbf{o}_t\), and whose action is the joint action \(\mathbf{a}_t\). We then estimate \(Q(\mathbf{o}_t, \mathbf{a}_t; \theta)\) as in DQN. The loss function is:

\[L(\theta) = (y_t - Q(\mathbf{o}_t, \mathbf{a}_t; \theta))^2 \tag{1}\]

where:

\[y_t = r_t + \gamma(1-done) \max_{\mathbf{a'}_t} Q(\mathbf{o}_{t+1}, \mathbf{a'}_t; \theta^{-})\]
  • Pros: The loss function backpropagates using the team reward, which is strongly related to the input of the Q-network: the joint action \(a\).

  • Cons: The Q-network takes as input the joint observation \(o\) and outputs the joint action \(a\). This leads to extremely large inputs, and the output size grows exponentially with the number of agents.

The second approach is independent Q-learning (IQL). Each agent trains its own Q-learning algorithm using only its local observation \(o_i\) and local actions \(a_i\). We therefore have \(n\) independent loss functions to optimize:

\[L_i(\theta) = (y_i^t - Q_i(o_i^t, a_i^t; \theta))^2 \tag{3}\]
\[y_i^t =r_t + \gamma(1-done)\max_{a'_i} Q(o_i^{t+1}, a'_i; \theta^{-})\]
  • Pros: Each Q-network is trained using local observations \(o_i\) and actions \(a_i\), making training more efficient and easier to deploy.

  • Cons: Each Q-network backpropagates a reward signal that depends on the actions of other agents rather than its own.

The idea of VDN is to combine these two approaches. We want to train decentralized networks using local observations \(o_i\) and actions \(a_i\) while still backpropagating the common reward through a loss function that aggregates all agents.

To achieve this, we assume a centralized Q-network \(Q(\mathbf{o}^t, \mathbf{a}^t; \theta)\) that can be written as:

\[Q(\mathbf{o}^t, \mathbf{a}^t; \theta) = \sum_{i \in I} Q_i(o_i^t, a_i^t; \theta) \tag{5}\]

and we use the following loss function:

\[L(\theta) = \left( r^t + \gamma \max_{\mathbf{a'} \in A} Q(\mathbf{o}^{t+1}, \mathbf{a'}; \theta^{-}) - Q(\mathbf{o}^{t}, \mathbf{a}^t; \theta) \right)^2 \tag{6}\]

with

\[Q(\mathbf{o}^{t}, \mathbf{a}^t; \theta) = \sum_{i \in I} Q_i(o_i^t, a_i^t; \theta) \tag{7}\]

and

\[\max_{\mathbf{a'} \in A} Q(\mathbf{o}^{t+1}, \mathbf{a'}; \theta^{-}) = \sum_{i \in I} \max_{a'_i \in A_i} Q_i(o^{t+1}_i, a'_i;\theta) \tag{8}\]

We don’t update each individual Q-network separately; instead, we backpropagate through the sum of the individual Q-networks.

It’s important to note that \(Q(.; \theta)\) is not an actual neural network. Only the individual networks \(Q_i(.; \theta)\) are instantiated. In practice, we can give each agent its own parameters \(\theta_i\). Sharing weights among agents is a common practice in MARL.

Pseudocode

Architecture diagram

Implementations

We implemented three variants of VDN:

  • vdn.py: VDN with single environment and MLP neural networks.

  • vdn_multienvs.py: VDN with parallel environments and MLP neural networks.

  • vdn_lstm.py: VDN with single environment and recurrent neural networks.

Additional details:

  • Replay buffer: For MLP-based implementations, we store transitions (obs, actions,reward,done,next_obs,next_avail_action). We need to store the next_avail_action in order to accurately compute the TD targets as we need the action-value of the best available next action. For the RNN-based implementation, we store sequences of transitions (seq_obs, seq_actions,seq_reward,seq_done,seq_next_obs,seq_next_avail_action) , and during the training we use the first burn_in transitions to compute the hidden state h, and use the remaining of the sequence to update the network.

Logging

We record the following metrics:

  • rollout/ep_reward : Mean episode reward during environment rollouts.

  • rollout/ep_length : Mean episode length during rollouts.

  • rollout/epsilon : Current exploration epsilon.

  • rollout/battle_won (SMAClite only): Fraction of battle won by SMAC agents

  • train/loss : Training loss at the current optimization step.

  • train/grads : Magnitude of gradients of the VDN networks.

  • eval/ep_reward : Mean episode reward during evaluation.

  • eval/std_ep_reward : Standard deviation of episode rewards during evaluation.

  • eval/ep_length : Mean episode length during evaluation.

  • eval/battle_won ( SMAClite only): Fraction of battles won during evaluation episodes.

Documentation

class cleanmarl.vdn.Args(env_type='smaclite', env_name='3m', env_family='sisl', agent_ids=True, buffer_size=10000, total_timesteps=1000000, gamma=0.99, learning_starts=5000, train_freq=5, optimizer='Adam', learning_rate=0.0005, batch_size=32, start_e=1, end_e=0.05, exploration_fraction=0.05, hidden_dim=64, num_layers=1, target_network_update_freq=5, polyak=0.005, normalize_reward=False, clip_gradients=5, log_every=10, eval_steps=5000, num_eval_ep=10, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)
Parameters:
  • env_type (str) – Type of the environment: smaclite, pz for PettingZoo, lbf for Level-based Foraging.

  • env_name (str) – Name of the environment (3m, simple_spread_v3 Foraging-2s-10x10-4p-2f-v3 …)

  • env_family (str) – Env family when using a PettingZoo environment (sisl, mpe …)

  • agent_ids (bool) – Include agent IDs (one-hot vector) in observations

  • buffer_size (int) – The size of the replay buffer

  • total_timesteps (int) – Total steps of the environment during the training

  • gamma (float) – Discount factor

  • learning_starts (int) – Number of environment steps to initialize the replay buffer

  • train_freq (int) – Train the network each train_freq step in the environment

  • optimizer (str) – The optimizer

  • learning_rate (float) – Learning rate

  • batch_size (int) – Batch size

  • start_e (float) – The starting value of epsilon, for exploration

  • end_e (float) – The end value of epsilon, for exploration

  • exploration_fraction (float) – The fraction of total-timesteps it takes from to go from start_e to end_e.

  • hidden_dim (int) – Hidden dimension

  • num_layers (int) – Number of layers

  • target_network_update_freq (int) – Update the target network each target_network_update_freq step in the environment

  • polyak (float) – Polyak coefficient to update the target network

  • normalize_reward (bool) – Normalize the rewards if True

  • clip_gradients (float) – 0< for no gradients clipping and 0> if clipping gradients at clip_gradients

  • log_every (int) – Log rollout stats every log_every episode

  • eval_steps (int) – Evaluate the policy each eval_steps step

  • num_eval_ep (int) – Number of evaluation episodes

  • use_wnb (bool) – Logging to Weights & Biases if True

  • wnb_project (str) – Weights & Biases project name

  • wnb_entity (str) – Weights & Biases entity name

  • device (str) – Device (cpu, gpu, mps) We only support CPU training for now

  • seed (int) – Random seed

class cleanmarl.vdn_lstm.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, buffer_size=10000, seq_length=10, burn_in=7, total_timesteps=1000000, gamma=0.99, learning_starts=5000, train_freq=5, optimizer='Adam', learning_rate=0.0007, batch_size=32, start_e=1, end_e=0.05, exploration_fraction=0.01, hidden_dim=64, num_layers=1, normalize_reward=False, target_network_update_freq=1, polyak=0.005, log_every=10, clip_gradients=1, eval_steps=10000, num_eval_ep=10, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)
Parameters:
  • seq_length (int) – Length of the sequence to store in the buffer

  • burn_in (int) – Sequences to burn during batch updates

class cleanmarl.vdn_multienvs.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, num_envs=4, buffer_size=10000, total_timesteps=1000000, gamma=0.99, learning_starts=5000, train_freq=2, optimizer='Adam', learning_rate=0.0005, batch_size=16, clip_gradients=5, start_e=1, end_e=0.05, exploration_fraction=0.05, hidden_dim=64, num_layers=1, target_network_update_freq=1, polyak=0.005, log_every=10, normalize_reward=False, eval_steps=5000, num_eval_ep=5, use_wnb=False, wnb_project='', wnb_entity='', device='mps', seed=1)
Parameters:

num_envs (int) – Number of parallel environments