Value-Decomposition Networks For Cooperative Multi-Agent Learning
Paper link: VDN
- Quick facts:
VDN is an off-policy and value-based algorithm.
VDN works only for discrete actions.
Requires a common reward.
Uses additive factorization.
- Key ideas:
VDN learns centralized action-value function \(Q^{tot}\) decomposed into the same of individual \(Q_i\) networks
\[Q(\mathbf{o}, \mathbf{a}) = \sum_{i \in I} Q_i(o_i, a_i).\]This factorization allows us to have decentralized policies.
\(Q_i\) networks are refereed to as utility networks rather than action-value function, as they don’t satisfy the Bellman equation. Instead, \(Q^{tot}\) is a true action-value function.
Background
VDN is based on Q-learning and works in settings with a common reward \(r\). Let’s first forget about VDN for a while and focus on how Q-learning can solve the cooperative MARL problem. There are two main approaches, each with its pros and cons.
The first approach is to use a single-agent RL algorithm. This treats the system as one central agent that receives the joint observation \(\mathbf{o}_t\), and whose action is the joint action \(\mathbf{a}_t\). We then estimate \(Q(\mathbf{o}_t, \mathbf{a}_t; \theta)\) as in DQN. The loss function is:
where:
Pros: The loss function backpropagates using the team reward, which is strongly related to the input of the Q-network: the joint action \(a\).
Cons: The Q-network takes as input the joint observation \(o\) and outputs the joint action \(a\). This leads to extremely large inputs, and the output size grows exponentially with the number of agents.
The second approach is independent Q-learning (IQL). Each agent trains its own Q-learning algorithm using only its local observation \(o_i\) and local actions \(a_i\). We therefore have \(n\) independent loss functions to optimize:
Pros: Each Q-network is trained using local observations \(o_i\) and actions \(a_i\), making training more efficient and easier to deploy.
Cons: Each Q-network backpropagates a reward signal that depends on the actions of other agents rather than its own.
The idea of VDN is to combine these two approaches. We want to train decentralized networks using local observations \(o_i\) and actions \(a_i\) while still backpropagating the common reward through a loss function that aggregates all agents.
To achieve this, we assume a centralized Q-network \(Q(\mathbf{o}^t, \mathbf{a}^t; \theta)\) that can be written as:
and we use the following loss function:
with
and
We don’t update each individual Q-network separately; instead, we backpropagate through the sum of the individual Q-networks.
It’s important to note that \(Q(.; \theta)\) is not an actual neural network. Only the individual networks \(Q_i(.; \theta)\) are instantiated. In practice, we can give each agent its own parameters \(\theta_i\). Sharing weights among agents is a common practice in MARL.
Pseudocode
Implementations
We implemented three variants of VDN:
vdn.py: VDN with single environment and MLP neural networks.vdn_multienvs.py: VDN with parallel environments and MLP neural networks.vdn_lstm.py: VDN with single environment and recurrent neural networks.
Additional details:
Replay buffer: For MLP-based implementations, we store transitions
(obs, actions,reward,done,next_obs,next_avail_action). We need to store thenext_avail_actionin order to accurately compute the TD targets as we need the action-value of the best available next action. For the RNN-based implementation, we store sequences of transitions(seq_obs, seq_actions,seq_reward,seq_done,seq_next_obs,seq_next_avail_action), and during the training we use the firstburn_intransitions to compute the hidden stateh, and use the remaining of the sequence to update the network.
Logging
We record the following metrics:
rollout/ep_reward : Mean episode reward during environment rollouts.
rollout/ep_length : Mean episode length during rollouts.
rollout/epsilon : Current exploration epsilon.
rollout/battle_won (SMAClite only): Fraction of battle won by SMAC agents
train/loss : Training loss at the current optimization step.
train/grads : Magnitude of gradients of the VDN networks.
eval/ep_reward : Mean episode reward during evaluation.
eval/std_ep_reward : Standard deviation of episode rewards during evaluation.
eval/ep_length : Mean episode length during evaluation.
eval/battle_won ( SMAClite only): Fraction of battles won during evaluation episodes.
Documentation
- class cleanmarl.vdn.Args(env_type='smaclite', env_name='3m', env_family='sisl', agent_ids=True, buffer_size=10000, total_timesteps=1000000, gamma=0.99, learning_starts=5000, train_freq=5, optimizer='Adam', learning_rate=0.0005, batch_size=32, start_e=1, end_e=0.05, exploration_fraction=0.05, hidden_dim=64, num_layers=1, target_network_update_freq=5, polyak=0.005, normalize_reward=False, clip_gradients=5, log_every=10, eval_steps=5000, num_eval_ep=10, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)
- Parameters:
env_type (str) – Type of the environment:
smaclite,pzfor PettingZoo,lbffor Level-based Foraging.env_name (str) – Name of the environment (
3m,simple_spread_v3Foraging-2s-10x10-4p-2f-v3…)env_family (str) – Env family when using a PettingZoo environment (
sisl,mpe…)agent_ids (bool) – Include agent IDs (one-hot vector) in observations
buffer_size (int) – The size of the replay buffer
total_timesteps (int) – Total steps of the environment during the training
gamma (float) – Discount factor
learning_starts (int) – Number of environment steps to initialize the replay buffer
train_freq (int) – Train the network each
train_freqstep in the environmentoptimizer (str) – The optimizer
learning_rate (float) – Learning rate
batch_size (int) – Batch size
start_e (float) – The starting value of epsilon, for exploration
end_e (float) – The end value of epsilon, for exploration
exploration_fraction (float) – The fraction of
total-timestepsit takes from to go fromstart_etoend_e.hidden_dim (int) – Hidden dimension
num_layers (int) – Number of layers
target_network_update_freq (int) – Update the target network each
target_network_update_freqstep in the environmentpolyak (float) – Polyak coefficient to update the target network
normalize_reward (bool) – Normalize the rewards if True
clip_gradients (float) –
0<for no gradients clipping and0>if clipping gradients atclip_gradientslog_every (int) – Log rollout stats every
log_everyepisodeeval_steps (int) – Evaluate the policy each
eval_stepsstepnum_eval_ep (int) – Number of evaluation episodes
use_wnb (bool) – Logging to Weights & Biases if True
wnb_project (str) – Weights & Biases project name
wnb_entity (str) – Weights & Biases entity name
device (str) – Device (
cpu,gpu,mps) We only support CPU training for nowseed (int) – Random seed
- class cleanmarl.vdn_lstm.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, buffer_size=10000, seq_length=10, burn_in=7, total_timesteps=1000000, gamma=0.99, learning_starts=5000, train_freq=5, optimizer='Adam', learning_rate=0.0007, batch_size=32, start_e=1, end_e=0.05, exploration_fraction=0.01, hidden_dim=64, num_layers=1, normalize_reward=False, target_network_update_freq=1, polyak=0.005, log_every=10, clip_gradients=1, eval_steps=10000, num_eval_ep=10, use_wnb=False, wnb_project='', wnb_entity='', device='cpu', seed=1)
- class cleanmarl.vdn_multienvs.Args(env_type='smaclite', env_name='3m', env_family='mpe', agent_ids=True, num_envs=4, buffer_size=10000, total_timesteps=1000000, gamma=0.99, learning_starts=5000, train_freq=2, optimizer='Adam', learning_rate=0.0005, batch_size=16, clip_gradients=5, start_e=1, end_e=0.05, exploration_fraction=0.05, hidden_dim=64, num_layers=1, target_network_update_freq=1, polyak=0.005, log_every=10, normalize_reward=False, eval_steps=5000, num_eval_ep=5, use_wnb=False, wnb_project='', wnb_entity='', device='mps', seed=1)
- Parameters:
num_envs (int) – Number of parallel environments