There is growing commercial interest in the use of multiagent systems in real world applications. Some examples include inventory management in warehouses, smart homes, planetary exploration, search and rescue, air-traffic management and autonomous transportation systems. However, multiagent coordination is an extremely challenging problem. First, information relevant for coordination is often distributed across the team members, and fragmented amongst each agent's observation histories (past states). Second, the coordination objective is often sparse and noisy from the perspective of an agent. Designing general mechanisms of generating agent-specific reward functions that incentivizes an agent to collaborate towards the shared global objective is extremely difficult. From a learning perspective, both difficulties can be linked to the difficulty of credit assignment - the process of accurately associating rewards with actions.
The primary contribution of this dissertation is to tackle credit assignment in multiagent systems in order to enable better multiagent coordination. First we leverage memory as a tool in enabling better credit assignment by facilitating associations between rewards and actions separated across time. We achieve this by introducing Modular Memory Units (MMU), a memory-augmented neural architecture that can reliably retain and propagate information over an extended period of time. We then use MMU to augment individual agents' policies in solving dynamic tasks that require adaptive behavior from a distributed multiagent team. We also introduce Distributed MMU (DMMU) which uses memory as a shared knowledge base across a team of distributed agents to enable distributed one-shot decision making.
Switching our attention from the agent to the learning algorithm, we then introduce Evolutionary Reinforcement Learning (ERL), a multilevel optimization framework that blends the strength of policy gradients and evolutionary algorithms to improve learning. We further extend the ERL framework to introduce Collaborative ERL (CERL) which employs a collection of policy gradient learners (portfolio), each optimizing over varying resolution of the same underlying task. This leads to a diverse set of policies that are able to reach diverse regions within the solution space. Results in a range of continuous control benchmarks demonstrate that ERL and CERL significantly outperform their composite learners while remaining overall more sample-efficient.
Finally, we introduce Multiagent ERL (MERL), a hybrid algorithm that leverages the multilevel optimization framework of ERL to enable improved multiagent coordination without requiring explicit alignment between local and global reward functions. MERL uses fast, policy-gradient based learning for each agent by utilizing their dense local rewards. Concurrently, evolution is used to recruit agents into a team by directly optimizing the sparser global objective. Experiments in multiagent coordination benchmarks demonstrate that MERL's integrated approach significantly outperforms the state-of-the-art multiagent policy-gradient algorithms.