US 12,014,725 B2
	Large-scale language model data selection for rare-word speech recognition
Ronny Huang, Mountain View, CA (US); and Tara N. Sainath, Jersey City, NJ (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Dec. 13, 2021, as Appl. No. 17/643,861.
Claims priority of provisional application 63/261,946, filed on Sep. 30, 2021.
Prior Publication US 2023/0096821 A1, Mar. 30, 2023
Int. Cl. G10L 15/16 (2006.01); G06N 3/02 (2006.01); G10L 15/06 (2013.01); G10L 15/197 (2013.01); G10L 15/22 (2006.01)

CPC G10L 15/063 (2013.01) [G06N 3/02 (2013.01); G10L 15/16 (2013.01); G10L 15/197 (2013.01); G10L 15/22 (2013.01)]

24 Claims

1. A computer-implemented method for training an external language model to recognize rare words in speech, the computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising:

receiving a corpus of training text samples, each training text sample in the corpus of training text samples comprising a corresponding sentence;

determining a frequency distribution of the corpus of training text samples that identifies a corresponding frequency that each training text sample in the corpus of training text samples occurs relative to the corresponding frequencies of the other training text samples in the corpus of training text samples;

executing a resampling function on the corpus of training text samples that downsamples the frequency distribution of the corpus of training text samples by:

matching the frequency distribution of the corpus of training text samples up to a threshold frequency; and

applying logarithmic scaling on the frequency distribution of the corpus of training text samples after the threshold frequency to identify high frequency training text samples as the training text samples from the corpus of training text samples that have corresponding frequencies exceeding the threshold frequency;

obtaining a set of training text samples by removing the identified high frequency training text samples from the corpus of training text samples;

obtaining a set of training utterances used for training an automatic speech recognition (ASR) model, each training utterance in the set of training utterances comprising audio data corresponding to an utterance and a corresponding transcription of the utterance;

applying rare word filtering on the set of training text samples to identify a subset of rare-word training text samples that include words that do not appear in the transcriptions from the set of training utterances or appear in the transcriptions from the set of training utterances less than a threshold number of times; and

training the external language model on the transcriptions from the set of training utterances and the identified subset of rare-word training text samples.