US 12,462,794 B2
	Methods and devices for structured pruning for automatic speech recognition
Yongxiong Ren, San Jose, CA (US); Bingbing Li, Stafford Spring, CT (US); Yang Liu, San Jose, CA (US); and Lingzhi Liu, San Jose, CA (US)
Assigned to BEIJING TRANSTREAMS TECHNOLOGY CO. LTD., Beijing (CN)
Filed by KWAI INC., Palo Alto, CA (US)
Filed on Mar. 25, 2021, as Appl. No. 17/212,891.
Prior Publication US 2022/0310068 A1, Sep. 29, 2022
Int. Cl. G10L 15/16 (2006.01); G06N 3/04 (2023.01); G06N 3/082 (2023.01); G10L 15/22 (2006.01)

CPC G10L 15/16 (2013.01) [G06N 3/04 (2013.01); G06N 3/082 (2013.01); G10L 15/22 (2013.01)]

20 Claims

12. An apparatus for automatic speech recognition, comprising:

one or more processors; and

a memory configured to store instructions executable by the one or more processors;

wherein the one or more processors, upon execution of the instructions, are configured to:

generate a weight matrix for a layer of a plurality of layers in a neural network, wherein the weight matrix comprises a set of weights associated with the layer, the plurality of layers comprises a first layer receiving a first input associated with one or more audio feature sequences, and the plurality of layers are executed on the one or more processors;

transforming the weight matrix organized in a three-dimensional weight tensor to a two-dimensional weight matrix, wherein the three-dimensional tensor has a size that is based on a size of a kernel of the layer and channels of an input of the layer, wherein the size of the three-dimensional tensor is x×y×z, x indicates a square of a width of the kernel, y indicates a depth of the kernel or a number of channels of the input of the layer, z indicates a number of kernels included in the layer, and a size of the two-dimensional weight matrix is (x×y)×z;

divide the two-dimensional weight matrix into a plurality of blocks based on tensor core units of the one or more processors, each block comprising a plurality of weights, wherein the plurality of blocks are directly deployed on the tensor core units of the one or more processors;

select, by a pruning accelerator, a set of blocks from the plurality of blocks for block-wise pruning by minimizing a cost function subject to a pre-determined block-wise constraint, wherein the pre-determined block-wise constraint comprises constraints of hardware implementation, and the cost function comprises regularization terms obtained from a penalty parameter and penalty weights;

add, by the pruning accelerator and using a heuristic algorithm, the pre-determined block-wise constraint to a pruning structure based on the constraints of the hardware implementation, and

adjust a Graphics Processing Unit (GPU) pipeline according to the pruning structure with the pre-determined block-wise constraint added; wherein a block-wise pruned weight matrix is generated by setting one or more weights in the set of blocks to zero,

wherein the heuristic algorithm adds the pre-determined block-wise constraint on a number of non-zero elements, wherein the pruning accelerator prunes the neural network by selecting neurons to be pruned based on the pre-determined block-wise constraint;

and

wherein the one or more audio feature sequences are generated from an external audio signal received from an audio component comprising a microphone.