CPC G10L 15/22 (2013.01) [G10L 15/02 (2013.01); G10L 21/0208 (2013.01); G10L 21/0272 (2013.01); G10L 25/78 (2013.01); G10L 25/87 (2013.01)] | 20 Claims |
1. A computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising:
receiving raw audio data corresponding to an utterance of audible contents spoken by a user and captured by an assistant-enabled device, the raw audio data capturing one or more additional sounds that are not spoken by the user;
receiving, from an image capture device in communication with the data processing hardware, image data capturing the user while speaking the utterance of the audible contents;
extracting, from the image data, a facial image for the user;
extracting, from the raw audio data, audio features synchronized with lips of the user moving in the extracted facial image; and
processing, using the extracted audio features, the raw audio data to generate enhanced audio data that isolates the utterance of the audible contents spoken by the user and excludes at least a portion of the one or more additional sounds that are not spoken by the user.
|