| CPC H04N 5/067 (2013.01) [G10L 25/57 (2013.01); H04N 7/04 (2013.01)] | 20 Claims |

|
1. A processor-implemented method, the method comprising:
capturing input, including at least one visual input and at least one audio input, to a first device;
training a machine learning model to recognize a visual cue indicative of a user desire to speak and predict a volume at which a user will speak based on a visual input from the at least one visual input and an audio input, synchronized to the visual input, from the at least one audio input;
marking one or more timestamps which are determined, using the model, to correspond to speech in the at least one audio input; and
transmitting an audio input from within the at least one audio input corresponding to the one or more marked timestamps from the first device to a second device.
|