| CPC G10L 15/16 (2013.01) [G06N 3/04 (2013.01); G06N 3/082 (2013.01); G10L 15/22 (2013.01)] | 20 Claims |

|
12. An apparatus for automatic speech recognition, comprising:
one or more processors; and
a memory configured to store instructions executable by the one or more processors;
wherein the one or more processors, upon execution of the instructions, are configured to:
generate a weight matrix for a layer of a plurality of layers in a neural network, wherein the weight matrix comprises a set of weights associated with the layer, the plurality of layers comprises a first layer receiving a first input associated with one or more audio feature sequences, and the plurality of layers are executed on the one or more processors;
transforming the weight matrix organized in a three-dimensional weight tensor to a two-dimensional weight matrix, wherein the three-dimensional tensor has a size that is based on a size of a kernel of the layer and channels of an input of the layer, wherein the size of the three-dimensional tensor is x×y×z, x indicates a square of a width of the kernel, y indicates a depth of the kernel or a number of channels of the input of the layer, z indicates a number of kernels included in the layer, and a size of the two-dimensional weight matrix is (x×y)×z;
divide the two-dimensional weight matrix into a plurality of blocks based on tensor core units of the one or more processors, each block comprising a plurality of weights, wherein the plurality of blocks are directly deployed on the tensor core units of the one or more processors;
select, by a pruning accelerator, a set of blocks from the plurality of blocks for block-wise pruning by minimizing a cost function subject to a pre-determined block-wise constraint, wherein the pre-determined block-wise constraint comprises constraints of hardware implementation, and the cost function comprises regularization terms obtained from a penalty parameter and penalty weights;
add, by the pruning accelerator and using a heuristic algorithm, the pre-determined block-wise constraint to a pruning structure based on the constraints of the hardware implementation, and
adjust a Graphics Processing Unit (GPU) pipeline according to the pruning structure with the pre-determined block-wise constraint added; wherein a block-wise pruned weight matrix is generated by setting one or more weights in the set of blocks to zero,
wherein the heuristic algorithm adds the pre-determined block-wise constraint on a number of non-zero elements, wherein the pruning accelerator prunes the neural network by selecting neurons to be pruned based on the pre-determined block-wise constraint;
and
wherein the one or more audio feature sequences are generated from an external audio signal received from an audio component comprising a microphone.
|