Value-Decomposition Networks For Cooperative Multi-Agent Learning

Paper link: VDN

Quick facts:

VDN is an off-policy and value-based algorithm.
VDN works only for discrete actions.
Requires a common reward.
Uses additive factorization.

Key ideas:

VDN learns centralized action-value function \(Q^{tot}\) decomposed into the same of individual \(Q_i\) networks

\[Q(\mathbf{o}, \mathbf{a}) = \sum_{i \in I} Q_i(o_i, a_i).\]

This factorization allows us to have decentralized policies.
\(Q_i\) networks are refereed to as utility networks rather than action-value function, as they don’t satisfy the Bellman equation. Instead, \(Q^{tot}\) is a true action-value function.

Background

VDN is based on Q-learning and works in settings with a common reward \(r\). Let’s first forget about VDN for a while and focus on how Q-learning can solve the cooperative MARL problem. There are two main approaches, each with its pros and cons.

The first approach is to use a single-agent RL algorithm. This treats the system as one central agent that receives the joint observation \(\mathbf{o}_t\), and whose action is the joint action \(\mathbf{a}_t\). We then estimate \(Q(\mathbf{o}_t, \mathbf{a}_t; \theta)\) as in DQN. The loss function is:

\[L(\theta) = (y_t - Q(\mathbf{o}_t, \mathbf{a}_t; \theta))^2 \tag{1}\]

where:

\[y_t = r_t + \gamma(1-done) \max_{\mathbf{a'}_t} Q(\mathbf{o}_{t+1}, \mathbf{a'}_t; \theta^{-})\]

Pros: The loss function backpropagates using the team reward, which is strongly related to the input of the Q-network: the joint action \(a\).
Cons: The Q-network takes as input the joint observation \(o\) and outputs the joint action \(a\). This leads to extremely large inputs, and the output size grows exponentially with the number of agents.

The second approach is independent Q-learning (IQL). Each agent trains its own Q-learning algorithm using only its local observation \(o_i\) and local actions \(a_i\). We therefore have \(n\) independent loss functions to optimize:

\[L_i(\theta) = (y_i^t - Q_i(o_i^t, a_i^t; \theta))^2 \tag{3}\]

\[y_i^t =r_t + \gamma(1-done)\max_{a'_i} Q(o_i^{t+1}, a'_i; \theta^{-})\]

Pros: Each Q-network is trained using local observations \(o_i\) and actions \(a_i\), making training more efficient and easier to deploy.
Cons: Each Q-network backpropagates a reward signal that depends on the actions of other agents rather than its own.

The idea of VDN is to combine these two approaches. We want to train decentralized networks using local observations \(o_i\) and actions \(a_i\) while still backpropagating the common reward through a loss function that aggregates all agents.

To achieve this, we assume a centralized Q-network \(Q(\mathbf{o}^t, \mathbf{a}^t; \theta)\) that can be written as:

\[Q(\mathbf{o}^t, \mathbf{a}^t; \theta) = \sum_{i \in I} Q_i(o_i^t, a_i^t; \theta) \tag{5}\]

and we use the following loss function:

\[L(\theta) = \left( r^t + \gamma \max_{\mathbf{a'} \in A} Q(\mathbf{o}^{t+1}, \mathbf{a'}; \theta^{-}) - Q(\mathbf{o}^{t}, \mathbf{a}^t; \theta) \right)^2 \tag{6}\]

with

\[Q(\mathbf{o}^{t}, \mathbf{a}^t; \theta) = \sum_{i \in I} Q_i(o_i^t, a_i^t; \theta) \tag{7}\]

and

\[\max_{\mathbf{a'} \in A} Q(\mathbf{o}^{t+1}, \mathbf{a'}; \theta^{-}) = \sum_{i \in I} \max_{a'_i \in A_i} Q_i(o^{t+1}_i, a'_i;\theta) \tag{8}\]

We don’t update each individual Q-network separately; instead, we backpropagate through the sum of the individual Q-networks.

It’s important to note that \(Q(.; \theta)\) is not an actual neural network. Only the individual networks \(Q_i(.; \theta)\) are instantiated. In practice, we can give each agent its own parameters \(\theta_i\). Sharing weights among agents is a common practice in MARL.

Pseudocode

Implementations

We implemented three variants of VDN:

vdn.py: VDN with single environment and MLP neural networks.
vdn_multienvs.py: VDN with parallel environments and MLP neural networks.
vdn_lstm.py: VDN with single environment and recurrent neural networks.

Additional details:

Replay buffer: For MLP-based implementations, we store transitions (obs, actions,reward,done,next_obs,next_avail_action). We need to store the next_avail_action in order to accurately compute the TD targets as we need the action-value of the best available next action. For the RNN-based implementation, we store sequences of transitions (seq_obs, seq_actions,seq_reward,seq_done,seq_next_obs,seq_next_avail_action) , and during the training we use the first burn_in transitions to compute the hidden state h, and use the remaining of the sequence to update the network.

Logging

We record the following metrics:

rollout/ep_reward : Mean episode reward during environment rollouts.
rollout/ep_length : Mean episode length during rollouts.
rollout/epsilon : Current exploration epsilon.
rollout/battle_won (SMAClite only): Fraction of battle won by SMAC agents
train/loss : Training loss at the current optimization step.
train/grads : Magnitude of gradients of the VDN networks.
eval/ep_reward : Mean episode reward during evaluation.
eval/std_ep_reward : Standard deviation of episode rewards during evaluation.
eval/ep_length : Mean episode length during evaluation.
eval/battle_won ( SMAClite only): Fraction of battles won during evaluation episodes.

Documentation

class cleanmarl.vdn.Args(env_type='smaclite', env_name='3m', env_family='sisl', agent_ids=True, buffer_size=10000, total_timesteps=1000000, gamma=0.99, learning_starts=5000, train_freq=5, optimizer='Adam', learning_rate=0.0005, batch_size=32, start_e=1, end_e=0.05, exploration_fraction=0.05, hidden_dim=64, num_layers=1, target_network_update_freq=5, polyak=0.005, normalize_reward=False, clip_gradients=5, log_every=10, eval_steps=5000, num_eval_ep=10, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)

Parameters:

env_type (str) – Type of the environment: smaclite, pz for PettingZoo, lbf for Level-based Foraging.
env_name (str) – Name of the environment (3m, simple_spread_v3 Foraging-2s-10x10-4p-2f-v3 …)
env_family (str) – Env family when using a PettingZoo environment (sisl, mpe …)
agent_ids (bool) – Include agent IDs (one-hot vector) in observations
buffer_size (int) – The size of the replay buffer
total_timesteps (int) – Total steps of the environment during the training
gamma (float) – Discount factor
learning_starts (int) – Number of environment steps to initialize the replay buffer
train_freq (int) – Train the network each train_freq step in the environment
optimizer (str) – The optimizer
learning_rate (float) – Learning rate
batch_size (int) – Batch size
start_e (float) – The starting value of epsilon, for exploration
end_e (float) – The end value of epsilon, for exploration
exploration_fraction (float) – The fraction of total-timesteps it takes from to go from start_e to end_e.
hidden_dim (int) – Hidden dimension
num_layers (int) – Number of layers
target_network_update_freq (int) – Update the target network each target_network_update_freq step in the environment
polyak (float) – Polyak coefficient to update the target network
normalize_reward (bool) – Normalize the rewards if True
clip_gradients (float) – 0< for no gradients clipping and 0> if clipping gradients at clip_gradients
log_every (int) – Log rollout stats every log_every episode
eval_steps (int) – Evaluate the policy each eval_steps step
num_eval_ep (int) – Number of evaluation episodes
use_wnb (bool) – Logging to Weights & Biases if True
wnb_project (str) – Weights & Biases project name
wnb_entity (str) – Weights & Biases entity name
device (str) – Device (cpu, gpu, mps) We only support CPU training for now
seed (int) – Random seed

class cleanmarl.vdn_lstm.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, buffer_size=10000, seq_length=10, burn_in=7, total_timesteps=1000000, gamma=0.99, learning_starts=5000, train_freq=5, optimizer='Adam', learning_rate=0.0007, batch_size=32, start_e=1, end_e=0.05, exploration_fraction=0.01, hidden_dim=64, num_layers=1, normalize_reward=False, target_network_update_freq=1, polyak=0.005, log_every=10, clip_gradients=1, eval_steps=10000, num_eval_ep=10, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)

Parameters:

seq_length (int) – Length of the sequence to store in the buffer
burn_in (int) – Sequences to burn during batch updates

class cleanmarl.vdn_multienvs.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, num_envs=4, buffer_size=10000, total_timesteps=1000000, gamma=0.99, learning_starts=5000, train_freq=2, optimizer='Adam', learning_rate=0.0005, batch_size=16, clip_gradients=5, start_e=1, end_e=0.05, exploration_fraction=0.05, hidden_dim=64, num_layers=1, target_network_update_freq=1, polyak=0.005, log_every=10, normalize_reward=False, eval_steps=5000, num_eval_ep=5, use_wnb=False, wnb_project='', wnb_entity='', device='mps', seed=1)

Parameters:: num_envs (int) – Number of parallel environments