CPC G10L 15/28 (2013.01) [G06T 1/20 (2013.01); G10L 19/00 (2013.01)] | 22 Claims |
1. An automatic speech recognition system, comprising:
an encoder comprising a plurality of encoder layers sequentially executed by one or more graphic processing units (GPUs), wherein at least one encoder layer comprises a plurality of encoder sublayers fused into one or more encoder kernels, wherein the encoder receives one or more audio sequences and generates an encoder output;
a first pair of ping-pong buffers, wherein the one or more encoder kernels connect and respectively read from one of the first pair of ping-pong buffers and write into the other of the first pair of ping-pong buffers; and
a decoder that receives a decoder input based on the encoder output and generates a decoder output comprising an output sequence, wherein the decoder comprises a plurality of decoder layers sequentially executed by one or more GPUs, wherein at least one decoder layer comprises a plurality of decoder sublayers fused into one or more decoder kernels.
|