US 12,437,749 B2
	Training data sequence for RNN-T based global English model
Takashi Fukuda, Tokyo (JP)
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION, Armonk, NY (US)
Filed by INTERNATIONAL BUSINESS MACHINES CORPORATION, Armonk, NY (US)
Filed on Nov. 3, 2021, as Appl. No. 17/518,027.
Prior Publication US 2023/0136842 A1, May 4, 2023
Int. Cl. G10L 15/06 (2013.01); G06N 3/08 (2023.01); G10L 15/02 (2006.01); G10L 15/10 (2006.01); G10L 15/16 (2006.01)

CPC G10L 15/063 (2013.01) [G06N 3/08 (2013.01); G10L 15/02 (2013.01); G10L 15/10 (2013.01); G10L 15/16 (2013.01)]

25 Claims

1. A computer-implemented method for preparing training data for a speech recognition model, the method comprising:

obtaining a plurality of audio data sets, each audio data set having a different acoustic feature; and

training a recurrent neural network transducer speech recognition model by sorting sentences from the plurality of audio data sets so that similar sentences from different audio data sets are positioned closely as a primary constraint by utilizing a similarity-score dependent penalty imposed for composed dissimilar data based on distances between sentences on a word vector and at least two hyperparameters, while imposing a secondary constraint on audio length by comparing audio distances between the sentences from utterances extracted from the plurality of audio data sets.