US 11,783,849 B2
Enhanced multi-channel acoustic models
Ehsan Variani, Mountain View, CA (US); Kevin William Wilson, Cambridge, MA (US); Ron J. Weiss, New York, NY (US); Tara N. Sainath, Jersey City, NJ (US); and Arun Narayanan, Santa Clara, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Jun. 8, 2021, as Appl. No. 17/303,822.
Application 17/303,822 is a continuation of application No. 16/278,830, filed on Feb. 19, 2019, granted, now 11,062,725.
Application 16/278,830 is a continuation of application No. 15/350,293, filed on Nov. 14, 2016, granted, now 10,224,058, issued on Mar. 5, 2019.
Claims priority of provisional application 62/384,461, filed on Sep. 7, 2016.
Prior Publication US 2021/0295859 A1, Sep. 23, 2021
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 15/16 (2006.01); G10L 25/30 (2013.01); G10L 21/028 (2013.01); G10L 21/0388 (2013.01); G10L 19/008 (2013.01); G10L 15/20 (2006.01); G10L 21/0208 (2013.01); G10L 21/0216 (2013.01)
CPC G10L 25/30 (2013.01) [G10L 15/16 (2013.01); G10L 15/20 (2013.01); G10L 19/008 (2013.01); G10L 21/028 (2013.01); G10L 21/0388 (2013.01); G10L 2021/02087 (2013.01); G10L 2021/02166 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising:
receiving multi-channel audio data representing an utterance captured by multiple microphones during a same period of time, the multi-channel audio input comprising multiple time-domain audio signals each obtained from a respective one of the multiple microphones, the multiple microphones located at different spatial position with respect to a user that spoke the utterance;
for each of multiple spatial directions, generating a corresponding spatial filtered output by processing each time-domain audio signal among the multiple time-domain audio signals of the multi-channel audio input;
predicting sub-word units encoded in the time-domain audio signals for respective portions of the utterance by processing a frequency-domain representation of the corresponding spatial filtered output generated for each of the multiple spatial direction; and
generating a transcription for the utterance based on the predicted sub-word units encoded in the time-domain audio signal for the respective portions of the utterance.