US 11,741,967 B2
	Systems and methods for automatic speech recognition based on graphics processing units
Yongxiong Ren, San Jose, CA (US); Heng Liu, Tucson, AZ (US); Yang Liu, San Jose, CA (US); Lingzhi Liu, San Jose, CA (US); Jie Li, Beijing (CN); Yuanyuan Zhao, Beijing (CN); and Xiaorui Wang, Beijing (CN)
Assigned to KWAI INC., Palo Alto, CA (US)
Filed by KWAI INC., Palo Alto, CA (US)
Filed on Jan. 4, 2021, as Appl. No. 17/141,179.
Prior Publication US 2022/0215843 A1, Jul. 7, 2022
Int. Cl. G10L 15/00 (2013.01); G10L 15/28 (2013.01); G06T 1/20 (2006.01); G10L 19/00 (2013.01)

CPC G10L 15/28 (2013.01) [G06T 1/20 (2013.01); G10L 19/00 (2013.01)]

22 Claims

1. An automatic speech recognition system, comprising:

an encoder comprising a plurality of encoder layers sequentially executed by one or more graphic processing units (GPUs), wherein at least one encoder layer comprises a plurality of encoder sublayers fused into one or more encoder kernels, wherein the encoder receives one or more audio sequences and generates an encoder output;

a first pair of ping-pong buffers, wherein the one or more encoder kernels connect and respectively read from one of the first pair of ping-pong buffers and write into the other of the first pair of ping-pong buffers; and

a decoder that receives a decoder input based on the encoder output and generates a decoder output comprising an output sequence, wherein the decoder comprises a plurality of decoder layers sequentially executed by one or more GPUs, wherein at least one decoder layer comprises a plurality of decoder sublayers fused into one or more decoder kernels.