| CPC G10L 13/027 (2013.01) [G06F 40/56 (2020.01); G10L 13/08 (2013.01); G10L 15/16 (2013.01); G10L 15/1815 (2013.01); G10L 25/69 (2013.01); H04N 7/157 (2013.01); G10L 15/1822 (2013.01); G10L 15/183 (2013.01); G10L 15/20 (2013.01); G10L 15/26 (2013.01); G10L 25/30 (2013.01); G10L 25/57 (2013.01); G10L 25/60 (2013.01)] | 20 Claims |

|
1. A method, comprising:
receiving, by a device, video data that includes a text transcript, audio sequences, and image frames utilized in a virtual communication;
detecting, by the device, a network fluctuation based on the video data;
processing, by the device, the text transcript, based on the network fluctuation and with a language model, to generate a new phrase;
generating, by the device, a response phoneme based on the new phrase;
utilizing, by the device, a text embedding model to generate a text embedding based on the response phoneme;
processing, by the device, the audio sequences, based on the network fluctuation and with the language model, to generate a target voice sequence;
utilizing, by the device, an audio embedding model to generate an audio embedding based on the target voice sequence;
processing, by the device, the image frames, based on the network fluctuation and with an image model, to generate a target image sequence;
utilizing, by the device, an image embedding model to generate an image embedding based on the target image sequence;
combining, by the device, the text embedding, the audio embedding, and the image embedding to generate an embedding input vector;
processing, by the device, the embedding input vector, with an audio synthesis model, to generate a final voice response;
processing, by the device, the embedding input vector, with a frame synthesis model, to generate a final video; and
providing, by the device, the video data, the final voice response, and the final video via the virtual communication.
|