Counterfactual Multi-Agent ========================== - Paper link: `COMA `_ Quick facts: - COMA is an on-policy actor-critic algorithm. - COMA uses a centralized critic with a decentralized actors. Background ---------- A straightforward adaptation of single-agent actor-critic algorithms to multi-agent RL would be to have, for each agent, an independent actor :math:`\pi_i(a_i| o_i; \theta)` and an independent critic :math:`Q_i(o_i,a_i;\phi)`, and use the following losses: .. math:: \mathcal{L}^{actor}_i(\theta) = - A(o_i,a_i;\phi) log(\pi_i(a_i| o_i; \theta)) .. math:: \mathcal{L}^{critic}_i(\phi) = (y - Q_i(o_i,a_i;\phi))^2 The two main ideas of COMA are: 1. To use only one centralized critic :math:`Q(\mathbf{s}, \mathbf{o},\mathbf{a})` 2. Replace the standard advantage :math:`A(o_i,a_i)` with an individual *counterfactual advantage* that compares the action-value of an agent's action :math:`a_i` to the expected action-value if the agent had selected other actions, while other agents' actions remain fixed. CConcretely, the counterfactual advantage for each agent :math:`i \in \mathcal{I}` is computed as: .. math:: A_i(\mathbf{s}, \mathbf{o},\mathbf{a}) = Q(\mathbf{s}, \mathbf{o},\mathbf{a}) - \sum_{a'_i} \pi_i(a'_i|o_i) Q(\mathbf{s}, \mathbf{o},(\mathbf{a}_{-i},a'_i)) Implementation-wise, the centralized Q-network takes as input the global state :math:`\mathcal{s}`, the agent's observation :math:`o_i`, and the other :math:`n-1` agents' actions :math:`\mathbf{a}_{-i}`. The output of the network has the same size as the action space, giving a value for each possible action :math:`a_i`. This architecture allows easy computation of the counterfactual advantage. .. image:: ../_static/coma_network.png :alt: Architecture diagram :width: 700px :align: center Pseudocode ---------- .. image:: ../_static/coma_algorithm.svg :alt: Architecture diagram :width: 100% :align: center Implementations --------------- We implemented five variants of COMA: - ``coma.py``: COMA with a single environment and MLP neural networks. - ``coma_multienvs.py``: COMA with parallel environments and MLP neural networks. - ``coma_lstm.py``: COMA with single environment and recurrent neural networks. - ``coma_lstm_multienvs.py``: COMA with parallel environments and recurrent neural networks. - ``coma_lbf.py``: COMA with a single environment and MLP neural networks with additional implementation tricks. This script was added to see if COMA can learn Level-Based Foraging environment. We add two things: (1) we use individual rewards and, (2) we correct TD target when the environment is truncated (i.e. time-out) rather than completed. Additional details: - **Rollout buffer**: we store episodes rather than transitions ``{"obs": [],"actions":[],"reward":[],"states":[],"done":[],"avail_actions":[]}``. Storing ``avail_actions`` is importing to compute the correct critic and actor losses - **Parallel environment**: we run ``batch_size`` environments in parallel - **Parallel environments with RNNs**: When using multiple environments in parallel, some episodes may complete before others. We track *alive environments* at each timestep. This is critical for RNN policies, as the hidden state is initially sized ``(num_envs x num_agents, hidden_dim)`` but only updated for ``(num_alive_envs x num_agents, hidden_dim)`` when some episodes finish. - **RNN training** : We use truncated backpropagation through time (TBPTT) to train the RNN network. You can set the length of the sequence using ``tbptt``. - **TD(λ) return**: we use the recursive formula from `Reconciling λ-Returns with Experience Replay (Equation 3) `_ . We start by :math:`R^{\lambda}_T = 0` .. math:: \begin{align} R^{\lambda}_t &= R^{(1)}_t + \gamma \lambda \Big[ R^{\lambda}_{t+1} - \max_{a' \in \mathcal{A}} Q(\hat{s}_{t+1}, a') \Big] \\ &= r_t + \gamma \Big[ \lambda R^{\lambda}_{t+1} + (1-\lambda) \max_{a' \in \mathcal{A}} Q(\hat{s}_{t+1}, a') \Big] \end{align} - **Exploration**: We use the exploration strategy suggested in COMA paper. :math:`ε` is linearly annealed across a number of training steps. .. math:: \pi(a_i) = (1 - \varepsilon) \, \text{softmax}(z_i) + \frac{\varepsilon}{|\mathcal{A}_i|}. - **Individual rewards**: As COMA support individual rewards, we implement a COMA variants that support this configuration for LBF environments. You can allow individual rewards by setting ``reward_aggr=None`` for LBF environments. By default, LBF returns individual rewards .. code-block:: python obs, reward, terminated, truncated, info = self.env.step(actions) ... if self.reward_aggr == "sum": reward = np.sum(reward) elif self.reward_aggr == "mean": reward = np.mean(reward) ## When reward_aggr == None, we keep it as a list of per-agent rewards. Logging ------- We record the following metrics: - **rollout/ep_reward** : Mean episode reward during environment rollout. - **rollout/ep_length** : Mean episode length during rollout. - **rollout/epsilon** : Current exploration epsilon. - **rollout/num_episodes** : Total number of completed episodes until the current step. - **rollout/battle_won** (SMAClite only): Fraction of battle won by SMAC agents. - **train/critic_loss** : The critic loss at the current optimization step. - **train/actor_loss** : The actor loss at the current optimization step. - **train/entropy** : The average entropy per-agent at the current optimization step - **train/actor_gradients** : Magnitude of gradients of actor network. - **train/critic_gradients** : Magnitude of gradients of critic network. - **train/num_updates** : Total number of network updates until the current step. - **eval/ep_reward** : Mean episode reward during evaluation. - **eval/std_ep_reward** : Standard deviation of episode rewards during evaluation. - **eval/ep_length** : Mean episode length during evaluation. - **eval/battle_won** ( SMAClite only): Fraction of battles won during evaluation episodes. Documentation ------------- .. py:class:: cleanmarl.coma.Args(env_type="smaclite", env_name="3m", env_family="mpe", agent_ids=True, batch_size=3, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=128, critic_num_layers=1, optimizer="Adam", learning_rate_actor=0.0005, learning_rate_critic=0.0005, total_timesteps=1000000, gamma=0.99, td_lambda=0.8, normalize_reward=False, normalize_advantage=True, normalize_return=False, target_network_update_freq=1, polyak=0.005, entropy_coef=0.001, use_tdlamda=True, nsteps=1, start_e=0.5, end_e=0.002, exploration_fraction=750, clip_gradients=-1, log_every=10, eval_steps=50, num_eval_ep=5, use_wnb=False, wnb_project="", wnb_entity="", device="cpu", seed=1) :param env_type: Type of the environment: ``smaclite``, ``pz`` for PettingZoo, ``lbf`` for Level-based Foraging. :type env_type: str :param env_name: Name of the environment (``3m``, ``simple_spread_v3``, ``Foraging-2s-10x10-4p-2f-v3`` ...). :type env_name: str :param env_family: Env family when using a PettingZoo environment (``sisl``, ``mpe`` ...). :type env_family: str :param agent_ids: Include agent IDs (one-hot vector) in observations. :type agent_ids: bool :param batch_size: Number of episodes to collect in each rollout. :type batch_size: int :param actor_hidden_dim: Hidden dimension of the actor network. :type actor_hidden_dim: int :param actor_num_layers: Number of hidden layers of the actor network. :type actor_num_layers: int :param critic_hidden_dim: Hidden dimension of the critic network. :type critic_hidden_dim: int :param critic_num_layers: Number of hidden layers of the critic network. :type critic_num_layers: int :param optimizer: The optimizer. :type optimizer: str :param learning_rate_actor: Learning rate for the actor. :type learning_rate_actor: float :param learning_rate_critic: Learning rate for the critic. :type learning_rate_critic: float :param total_timesteps: Total steps in the environment during training. :type total_timesteps: int :param gamma: Discount factor. :type gamma: float :param use_tdlamda: Whether to use TD(λ) as a target for the critic. If False, n-step returns (``n=nsteps``) are used instead. :type use_tdlamda: bool :param td_lambda: TD(λ) discount factor. :type td_lambda: float :param nsteps: Number of steps when using n-step returns as a target for the critic. :type nsteps: int :param normalize_reward: Normalize the rewards if True. :type normalize_reward: bool :param normalize_advantage: Normalize the advantage if True. :type normalize_advantage: bool :param normalize_return: Normalize the returns if True. :type normalize_return: bool :param target_network_update_freq: Update the target network each ``target_network_update_freq`` training step. :type target_network_update_freq: int :param polyak: Polyak coefficient when using polyak averaging for target network update. :type polyak: float :param entropy_coef: Entropy coefficient used to encourage exploration. :type entropy_coef: float :param start_e: The starting value of epsilon. See *Architecture & Training* in COMA’s paper (Sec. 5). :type start_e: float :param end_e: The end value of epsilon. See *Architecture & Training* in COMA’s paper (Sec. 5). :type end_e: float :param exploration_fraction: Number of training steps required to linearly decay epsilon from ``start_e`` to ``end_e``. :type exploration_fraction: int :param clip_gradients: ``0<`` for no gradient clipping and ``0>`` if clipping gradients at ``clip_gradients``. :type clip_gradients: float :param log_every: Log rollout stats every ``log_every`` episode. :type log_every: int :param eval_steps: Evaluate the policy every ``eval_steps`` training steps. :type eval_steps: int :param num_eval_ep: Number of evaluation episodes. :type num_eval_ep: int :param use_wnb: Logging to Weights & Biases if True. :type use_wnb: bool :param wnb_project: Weights & Biases project name. :type wnb_project: str :param wnb_entity: Weights & Biases entity name. :type wnb_entity: str :param device: Device (``cpu``, ``gpu``, ``mps``). *We only support CPU training for now.* :type device: str :param seed: Random seed. :type seed: int .. py:class:: cleanmarl.coma_multienvs.Args(env_type="smaclite", env_name="3m", env_family="mpe", agent_ids=True, batch_size=3, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=128, critic_num_layers=1, optimizer="Adam", learning_rate_actor=0.0005, learning_rate_critic=0.0005, total_timesteps=1000000, gamma=0.99, td_lambda=0.8, normalize_reward=False, normalize_advantage=True, normalize_return=False, target_network_update_freq=1, polyak=0.005, entropy_coef=0.001, use_tdlamda=True, nsteps=1, start_e=0.5, end_e=0.002, exploration_fraction=750, clip_gradients=-1, log_every=10, eval_steps=10, num_eval_ep=10, use_wnb=False, wnb_project="", wnb_entity="", device="cpu", seed=1) .. py:class:: cleanmarl.coma_lstm.Args(env_type="smaclite", env_name="3m", env_family="mpe", agent_ids=True, batch_size=5, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=128, critic_num_layers=1, optimizer="Adam", learning_rate_actor=0.0005, learning_rate_critic=0.0005, total_timesteps=1000000, gamma=0.99, td_lambda=0.8, normalize_reward=False, normalize_advantage=True, normalize_return=False, target_network_update_freq=1, polyak=0.005, eval_steps=10, use_tdlamda=True, nsteps=1, start_e=0.5, end_e=0.002, exploration_fraction=750, clip_gradients=-1, tbptt=10, log_every=10, num_eval_ep=10, entropy_coef=0.001, use_wnb=False, wnb_project="", wnb_entity="", device="cpu", seed=1) :param tbptt: Chunk size for Truncated Backpropagation Through Time (TBPTT). :type tbptt: int .. py:class:: cleanmarl.coma_lstm_multienvs.Args(env_type="smaclite", env_name="3m", env_family="mpe", agent_ids=True, batch_size=3, actor_hidden_dim=32, actor_num_layers=1, critic_hidden_dim=128, critic_num_layers=1, optimizer="Adam", learning_rate_actor=0.0005, learning_rate_critic=0.0005, total_timesteps=1000000, gamma=0.99, td_lambda=0.8, normalize_reward=False, normalize_advantage=True, normalize_return=False, target_network_update_freq=1, polyak=0.005, entropy_coef=0.001, use_tdlamda=True, nsteps=1, start_e=0.5, end_e=0.002, exploration_fraction=750, clip_gradients=-1, tbptt=10, log_every=10, eval_steps=50, num_eval_ep=10, use_wnb=False, wnb_project="", wnb_entity="", device="cpu", seed=1) :param tbptt: Chunk size for Truncated Backpropagation Through Time (TBPTT). :type tbptt: int