In this thesis, we introduce alignment-based algorithms for improving the performance of reinforcement learning solutions for problems where the reward signal cannot be collapsed into a single number. Many real world problems require an agent to balance performance, longevity, and safety, and do so across different timelines. The key to intelligent behavior in such complex domains is to enable the agents to determine ``what matters when."
We introduce the concept of local alignment as a metric that determines whether a sub-reward supports a global reward at a particular state and time. This work introduces three different approaches to using reward alignment for decision making, alignment policy selection, alignment action selection, and alignment based learning. By selecting which policy to follow based on this metric, we show that agents using simple sub-rewards can combine their sub-reward-specific policies and learn a cohesive policy for complex coordination problems that agents trained on a single reward signal cannot learn.
In addition, we show that taking an action for which a reward is maximally aligned with a global reward outperforms learned policies and other methods. Instead of using sub-reward-specific policies or a global policy, agents select a direction in which a sub-reward most supports the global reward. We show that considering only aligned directions achieved rewards between 3 and 10 times better than reinforcement learning policies using the global reward.