US 12,437,752 B2
	Large-scale language model data selection for rare-word speech recognition
Wenqian Ronny Huang, Mountain View, CA (US); and Tara N. Sainath, Jersey City, NJ (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on May 10, 2024, as Appl. No. 18/660,655.
Application 18/660,655 is a continuation of application No. 17/643,861, filed on Dec. 13, 2021, granted, now 12,014,725.
Claims priority of provisional application 63/261,946, filed on Sep. 30, 2021.
Prior Publication US 2024/0290323 A1, Aug. 29, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 15/06 (2013.01); G06N 3/02 (2006.01); G10L 15/16 (2006.01); G10L 15/197 (2013.01); G10L 15/22 (2006.01)

CPC G10L 15/063 (2013.01) [G06N 3/02 (2013.01); G10L 15/16 (2013.01); G10L 15/197 (2013.01); G10L 15/22 (2013.01)]

20 Claims

1. A computer-implemented method executing on data processing hardware that causes the data processing hardware to perform operations comprising:

receiving a corpus of training text samples, each training text sample in the corpus of training text samples comprising a corresponding sentence;

executing a resampling function on the corpus of training text samples that downsamples a frequency distribution of the corpus of training text samples by:

matching the frequency distribution of the corpus of training text samples up to a threshold frequency; and

applying logarithmic scaling on the frequency distribution of the corpus of training text samples after the threshold frequency to identify high frequency training text samples as the training text samples from the corpus of training text samples that have corresponding frequencies exceeding the threshold frequency;

obtaining a set of training text samples by removing the identified high frequency training text samples from the corpus of training text samples;

applying rare word filtering on the training text samples to identify a subset of rare-word training text samples; and

training a language model on the identified subset of rare-word training text samples.