Throughput-oriented processors, such as graphics processing units (GPUs), have been increasingly used to accelerate general purpose computing, including machine learning models that are being utilized in numerous disciplines. Thousands of concurrently running threads in a GPU demand a highly efficient memory subsystem for data supply in GPUs. In this dissertation, we have studied the memory architecture of the traditional GPUs and revealed that the traditional memory architecture, initially designed for graphics processing, is less efficient in handling general purpose computing tasks. We propose several memory architecture optimizations for two primary objectives: (1) optimize current memory architecture for more efficient handling of general purpose computing tasks; (2) improve the overall performance of GPUs.
This dissertation has four major parts: (1) The first part deals with the L2 cache inefficiency. A key factor that affects the memory subsystem is the order of memory accesses. While reordering memory accesses at L2 cache has large potential benefits to both cache and DRAM, little work has been conducted to exploit this. In this work, we investigate the largely unexplored opportunity of L2 cache access reordering. We propose Cache Access Reordering Tree (CART), a novel architecture that can improve memory subsystem efficiency by actively reordering memory accesses at L2 cache to be cache-friendly and DRAM-friendly. (2) The second part deals with miss handling architecture (MHA) in GPUs. Conventional MHA is static in sense that it provides a fixed number of MSHR entries to track primary misses, and a fixed number of slots within each entry to track secondary misses. This leads to severe entry or slot under-utilization and poor match to practical workloads, as the number of memory requests to different cache lines can vary significantly. We propose Dynamically Linked MSHR (DL-MSHR), a novel approach that dynamically forms MSHR entries from a pool of available slots. This approach can self-adapt to primary-miss-predominant applications by forming more entries with fewer slots, and self-adapt to secondary-miss-predominant applications by having fewer entries but more slots per entry. (3) The third part aims to improve the performance of Unified Virtual Memory (UVM), which is recently introduced into GPUs. We propose CAPTURE(Capacity-Aware Prefetch with True Usage Reflected Eviction), a novel microarchitecture scheme that implements coordinated prefetch-eviction for GPU UVM management. CAPTURE utilizes GPU memory status and memory access history to dynamically adjust the prefetching and ``capture'' accurate remaining page reusing opportunities for improved eviction. (4) In the fourth part, we propose a comprehensive UVM benchmark suite named UVMBench to facilitate future research on the UVM research.