| CPC G10L 15/063 (2013.01) [G06N 3/02 (2013.01); G10L 15/16 (2013.01); G10L 15/197 (2013.01); G10L 15/22 (2013.01)] | 20 Claims |

|
1. A computer-implemented method executing on data processing hardware that causes the data processing hardware to perform operations comprising:
receiving a corpus of training text samples, each training text sample in the corpus of training text samples comprising a corresponding sentence;
executing a resampling function on the corpus of training text samples that downsamples a frequency distribution of the corpus of training text samples by:
matching the frequency distribution of the corpus of training text samples up to a threshold frequency; and
applying logarithmic scaling on the frequency distribution of the corpus of training text samples after the threshold frequency to identify high frequency training text samples as the training text samples from the corpus of training text samples that have corresponding frequencies exceeding the threshold frequency;
obtaining a set of training text samples by removing the identified high frequency training text samples from the corpus of training text samples;
applying rare word filtering on the training text samples to identify a subset of rare-word training text samples; and
training a language model on the identified subset of rare-word training text samples.
|