CPC G10L 15/16 (2013.01) [G10L 15/02 (2013.01); G10L 15/063 (2013.01); G10L 15/22 (2013.01); G10L 2015/025 (2013.01); G10L 2015/088 (2013.01); G10L 2015/223 (2013.01)] | 20 Claims |
1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations for training an end-to-end keyword spotting model, the operations comprising:
receiving a training input audio sequence that contains a keyword;
generating a plurality of sequential encoder windows over an expected location of the keyword contained in the training input audio sequence;
generating a decoder window in a time interval that includes an endpoint of the hotword;
for each encoder window in the plurality of sequential encoder windows, determining a max pooling loss at the corresponding encoder window;
determining a max pooling loss for the decoder window; and
optimizing the end-to-end keyword spotting model based on the max pooling losses determined for the plurality of sequential encoder windows and the max pooling loss determined for the decoder window.
|