Graduate Thesis Or Dissertation


Multiagent Learning via Dynamic Skill Selection Public Deposited

Downloadable Content

Download PDF


Attribute NameValues
  • Multiagent coordination has many real-world applications such as self-driving cars, inventory management, search and rescue, package delivery, traffic management, warehouse management, and transportation. These tasks are generally character-ized by a global team objective that is often temporally sparse - realized only upon completing an episode. The sparsity of the shared team objective often makes it an inadequate learning signal to learn effective strategies. Moreover, this reward signal does not capture the marginal contribution of each agent towards the global objective. This leads to the problem of structural credit assignment in multia-gent systems. Furthermore, due to a lack of accurate understanding of desired task behaviors, it is often challenging to manually design agent-specific rewards to improved coordination. While learning these undefined local objectives is very critical for a successful coordination, it is extremely challenging due to these two core challenges. Firstly, due to interaction among agents in an environment, the complexity of the problem may rise exponentially with the number of agents, and their behavioral sophisti-cation. An agent perceives the environment as non-stationary, due to all learn-ing concurrently. This leads to an agent perceiving the coordination objective as extremely noisy. Secondly, the goal information required to learn coordination behavior is distributed among agents. This makes it difficult for agents to learn undefined desired behaviors that optimizes a team objective. The key contribution of this work is to address the credit assignment problem in multiagent coordination using several semantically meaningful local rewards. We argue that real-world multiagent coordination tasks can be decomposed into several meaningful skills. Further, we introduce MADyS, a framework that can optimize a global reward by learning to dynamically select the most optimal skill from semantically meaningful skills, characterized by their local rewards, without requiring any form of reward shaping. Here, each local reward describes a basic skill and is designed based on domain knowledge. MADyS combines gradient-based optimization to maximize dense local rewards and gradient-free optimization to maximize the sparse team-based reward. Each local reward is used to train a local policy learner using policy gradient (PG) - and an evolutionary algorithm (EA) that searches in a population of policies to maximize the global objective by picking the most optimal local reward at each time step of an episode. While these two processes occur concurrently, the experiences collected by the EA population are stored in a replay buffer and utilized by the PG based local rewards optimizer for better sample efficiency. Our experimental results show that MADyS outperforms several baselines. We also visualize the complex coordination behaviors by studying the temporal distri-bution shifts of the selected local rewards. By visualizing these shifts throughout an episode, we gain insight into how agents learn to (i) decompose a complex task into various sub-tasks, (ii) dynamically configure sub-teams, and (iii) assign the selected sub-tasks to the sub-teams to optimize as a team on the global objective.
Resource Type
Date Issued
Degree Level
Degree Name
Degree Field
Degree Grantor
Commencement Year
Committee Member
Rights Statement
Peer Reviewed



This work has no parents.

In Collection: