US 12,444,404 B2
Streaming end-to-end speech recognition method, apparatus and electronic device
Shiliang Zhang, Hangzhou (CN); and Zhifu Gao, Hangzhou (CN)
Assigned to Alibaba Group Holding Limited, Grand Cayman (KY)
Filed by Alibaba Group Holding Limited, Grand Cayman (KY)
Filed on Oct. 28, 2022, as Appl. No. 17/976,464.
Application 17/976,464 is a continuation of application No. PCT/CN2021/089556, filed on Apr. 25, 2021.
Claims priority of application No. 20201036690.7 (CN), filed on Apr. 30, 2020.
Prior Publication US 2023/0064756 A1, Mar. 2, 2023
Int. Cl. G10L 15/00 (2013.01); G10L 15/02 (2006.01); G10L 15/06 (2013.01)
CPC G10L 15/063 (2013.01) [G10L 15/02 (2013.01)] 18 Claims
OG exemplary drawing
 
1. A method implemented by a computing device, comprising:
extracting and encoding speech acoustic features of a received voice stream in units of frames;
performing block processing on encoded frames, and predicting a number of activation points included in a same block that need to be encoded and outputted; and
determining a position of at least one activation point that needs to be decoded and outputted according to a prediction result, to allow a decoder to perform decoding at the position of the at least one activation point and output a recognition result, wherein determining the position of the at least one activation point that needs to be decoded and outputted according to the prediction result, comprises:
comparing Attention coefficients of each frame in the same block and sort the Attention coefficients in order of magnitudes, the Attention coefficients being used to describe probabilities that respective frames need to be decoded and outputted; and
determining positions of frames associated with a corresponding number of first few highest Attention coefficients among encoding results of each frame included in the same block as the position of the at least one activation point according to the number of activation points included in the same block.
 
17. A system comprising:
one or more processors; and
memory storing executable instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising:
extracting and encoding speech acoustic features of a received voice stream in units of frames;
performing block processing on encoded frames, and predicting a number of activation points included in a same block that need to be encoded and outputted; and
determining a position of at least one activation point that needs to be decoded and outputted according to a prediction result, to allow a decoder to perform decoding at the position of the at least one activation point and output a recognition result, wherein determining the position of the at least one activation point that needs to be decoded and outputted according to the prediction result, comprises:
comparing Attention coefficients of each frame in the same block and sort the Attention coefficients in order of magnitudes, the Attention coefficients being used to describe probabilities that respective frames need to be decoded and outputted; and
determining positions of frames associated with a corresponding number of first few highest Attention coefficients among encoding results of each frame included in the same block as the position of the at least one activation point according to the number of activation points included in the same block.