- Multiagent coordination has many real-world applications such as self-driving cars, inventory management, search and rescue, package delivery, traﬃc management, warehouse management, and transportation. These tasks are generally character-ized by a global team objective that is often temporally sparse - realized only upon completing an episode. The sparsity of the shared team objective often makes it an inadequate learning signal to learn eﬀective strategies. Moreover, this reward signal does not capture the marginal contribution of each agent towards the global objective. This leads to the problem of structural credit assignment in multia-gent systems. Furthermore, due to a lack of accurate understanding of desired task behaviors, it is often challenging to manually design agent-speciﬁc rewards to improved coordination.
While learning these undeﬁned local objectives is very critical for a successful coordination, it is extremely challenging due to these two core challenges. Firstly, due to interaction among agents in an environment, the complexity of the problem may rise exponentially with the number of agents, and their behavioral sophisti-cation. An agent perceives the environment as non-stationary, due to all learn-ing concurrently. This leads to an agent perceiving the coordination objective as extremely noisy. Secondly, the goal information required to learn coordination behavior is distributed among agents. This makes it diﬃcult for agents to learn undeﬁned desired behaviors that optimizes a team objective.
The key contribution of this work is to address the credit assignment problem in multiagent coordination using several semantically meaningful local rewards. We argue that real-world multiagent coordination tasks can be decomposed into several meaningful skills. Further, we introduce MADyS, a framework that can optimize a global reward by learning to dynamically select the most optimal skill from semantically meaningful skills, characterized by their local rewards, without requiring any form of reward shaping. Here, each local reward describes a basic skill and is designed based on domain knowledge. MADyS combines gradient-based optimization to maximize dense local rewards and gradient-free optimization to maximize the sparse team-based reward. Each local reward is used to train a local policy learner using policy gradient (PG) - and an evolutionary algorithm (EA) that searches in a population of policies to maximize the global objective by picking the most optimal local reward at each time step of an episode. While these two processes occur concurrently, the experiences collected by the EA population are stored in a replay buﬀer and utilized by the PG based local rewards optimizer for better sample eﬃciency.
Our experimental results show that MADyS outperforms several baselines. We also visualize the complex coordination behaviors by studying the temporal distri-bution shifts of the selected local rewards. By visualizing these shifts throughout an episode, we gain insight into how agents learn to (i) decompose a complex task into various sub-tasks, (ii) dynamically conﬁgure sub-teams, and (iii) assign the selected sub-tasks to the sub-teams to optimize as a team on the global objective.