Various natural language processing (NLP) tasks necessitate deep models that are fast, efficient, and small based on their ultimate application at the edge or elsewhere. While significant investigation has furthered the efficiency and reduced the size of these models, reducing their downstream latency without significant trade-offs remains a difficult task....
Compactness in deep learning can be critical to a model’s viability in low-resource applications, and a common approach to extreme model compression is quantization. We consider Iterative Product Quantization (iPQ) with Quant-Noise [Fan et al., 2020] to be state-of-the-art in this area, but this quantization framework suffers from preventable inference...
Simultaneous speech-to-text translation remains a difficult yet important problem for modern machine learning models whereby a text translation is generated concurrently with receiving partial speech inputs. One state-of-the-art simultaneous speech-to-text model is the augmented memory transformer whose encoder breaks a speech input into fixed-size overlapping segments composed of left, right,...