US 12,079,703 B2
Convolution-augmented transformer models
Anmol Gulati, New York, NY (US); Ruoming Pang, New York, NY (US); Niki Parmar, Mountain View, CA (US); Jiahui Yu, Jersey City, NJ (US); Wei Han, Mountain View, CA (US); Chung-Cheng Chiu, Mountain View, CA (US); Yu Zhang, Mountain View, CA (US); Yonghui Wu, Palo Alto, CA (US); Shibo Wang, Santa Clara, CA (US); Weikeng Qin, Sunnyvale, CA (US); and Zhengdong Zhang, Mountain View, CA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Dec. 31, 2020, as Appl. No. 17/139,525.
Prior Publication US 2022/0207321 A1, Jun. 30, 2022
Int. Cl. G06N 3/04 (2023.01); G06N 20/00 (2019.01); G10L 15/16 (2006.01)
CPC G06N 3/04 (2013.01) [G06N 20/00 (2019.01); G10L 15/16 (2013.01)] 18 Claims
OG exemplary drawing
 
1. A computer-implemented method for efficiently processing data which accounts for both local and global dependencies, the method comprising:
accessing data descriptive of a machine-learned conformer model that comprises one or more conformer blocks, each of the one or more conformer blocks configured to process a block input to generate a block output, each of the one or more conformer blocks comprising:
a first feed-forward block configured to process the block input to generate a first feed-forward output;
a self-attention block configured to perform self-attention to process the first feed-forward output to generate an attention output;
a convolutional block configured to perform convolutions with a convolutional filter to process the attention output of the self-attention block to generate a convolutional output; and
a second feed-forward block configured to process the convolutional output of the convolutional block to generate a second feed-forward output;
obtaining input data, wherein the input data comprises audio data; and
processing the input data with the machine-learned conformer model to generate output data, wherein the output data comprises text data, wherein the machine-learned conformer model comprises the convolutional block that processes outputs of the self-attention block without performing parallel processing with the self-attention block and the convolutional block, and wherein the convolutional block and the self-attention block are between the first feed-forward block and the second feed-forward block.