The high level of parallelism in throughput processors such as GPGPUs has resulted in significantly changed on-chip data traffic behaviors. This demands new research to identify and address the limiting factors of networks-on-chip (NoCs) in the context of throughput processors. In this work, we ﬁrst quantitatively analyze the performance of on-chip networks in GPGPUs, and identify a serious NoC bottleneck where the reply data from memory controllers experience large contention when being injected to the reply network. To remove this reply injection bottleneck, we propose a new scheme that exploits the largely neglected interpose resource in 3D chips. Together with a novel last level cache bank placement, the proposed scheme can supply a fast rate of data traffic from memory controllers to feed the reply injection points, and accelerates the consumption of the injected packets by quickly transferring the packets out of the injection points, thus increasing both supply and consumption of reply traffic injection. A number of optimization techniques are also proposed to further improve performance and reduce cost. Evaluation results on a wide range of benchmarks show that, on average, the proposed scheme can reduce the data stall time in memory controllers by 94%, increase IPC by 18% and reduce energy consumption by 22%.