US 12,260,852 B2
	Method of training speech recognition model, electronic device and storage medium
Zengwei Yao, Beijing (CN); Liyong Guo, Beijing (CN); Povey Daniel, Beijing (CN); Long Lin, Beijing (CN); Fangjun Kuang, Beijing (CN); Wei Kang, Beijing (CN); Mingshuang Luo, Beijing (CN); Quandong Wang, Beijing (CN); and Yuxiang Kong, Beijing (CN)
Assigned to BEIJING XIAOMI MOBILE SOFTWARE CO., LTD., Beijing (CN)
Filed by BEIJING XIAOMI MOBILE SOFTWARE CO., LTD., Beijing (CN)
Filed on Dec. 9, 2022, as Appl. No. 18/078,460.
Claims priority of application No. 202210613726.7 (CN), filed on May 31, 2022.
Prior Publication US 2023/0386448 A1, Nov. 30, 2023
Int. Cl. G10L 15/06 (2013.01); G10L 15/16 (2006.01); G10L 15/22 (2006.01); G10L 19/00 (2013.01); G10L 19/032 (2013.01)

CPC G10L 15/063 (2013.01) [G10L 15/16 (2013.01); G10L 15/22 (2013.01); G10L 19/032 (2013.01); G10L 2019/0002 (2013.01)]

20 Claims

12. An electronic device, comprising:

a memory and one or more processors;

wherein the memory is configured to store a computer program executable by the one or more processors; and

wherein the one or more processors are configured to execute the computer program in the memory to implement acts comprising:

for each of a plurality of training samples,

inputting speech data of a training sample into a teacher model and a to-be-trained speech recognition model separately,

obtaining an embedding outputted by the teacher model and encoded data outputted by the to-be-trained speech recognition model, wherein the embedding comprises a floating-point vector holding D floating-point numbers,

obtaining quantized codebook data by performing a multi-codebook quantization on the embedding, wherein the quantized codebook data comprises N integers corresponding to the speech data and each integer indicates a codebook index, wherein N is a positive integer,

calculating a loss based on the encoded data, the quantized codebook data, and text data in the training sample, and

obtaining a trained speech recognition model by stopping training the to-be-trained speech recognition model in response to determining at least one of followings: the loss being less than or equal to a preset loss threshold, or trained times being greater than preset trained times.