US 12,394,405 B2
Systems and methods for reconstructing video data using contextually-aware multi-modal generation during signal loss
Subham Biswas, Thane (IN); and Saurabh Tahiliani, Noida (IN)
Assigned to Verizon Patent and Licensing Inc., Basking Ridge, NJ (US)
Filed by Verizon Patent and Licensing Inc., Basking Ridge, NJ (US)
Filed on Mar. 24, 2023, as Appl. No. 18/126,212.
Prior Publication US 2024/0321260 A1, Sep. 26, 2024
Int. Cl. G10L 13/08 (2013.01); G06F 40/56 (2020.01); G10L 13/027 (2013.01); G10L 15/16 (2006.01); G10L 15/18 (2013.01); G10L 15/26 (2006.01); G10L 25/69 (2013.01); H04N 7/15 (2006.01); G10L 15/183 (2013.01); G10L 15/20 (2006.01); G10L 25/30 (2013.01); G10L 25/57 (2013.01); G10L 25/60 (2013.01)
CPC G10L 13/027 (2013.01) [G06F 40/56 (2020.01); G10L 13/08 (2013.01); G10L 15/16 (2013.01); G10L 15/1815 (2013.01); G10L 25/69 (2013.01); H04N 7/157 (2013.01); G10L 15/1822 (2013.01); G10L 15/183 (2013.01); G10L 15/20 (2013.01); G10L 15/26 (2013.01); G10L 25/30 (2013.01); G10L 25/57 (2013.01); G10L 25/60 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method, comprising:
receiving, by a device, video data that includes a text transcript, audio sequences, and image frames utilized in a virtual communication;
detecting, by the device, a network fluctuation based on the video data;
processing, by the device, the text transcript, based on the network fluctuation and with a language model, to generate a new phrase;
generating, by the device, a response phoneme based on the new phrase;
utilizing, by the device, a text embedding model to generate a text embedding based on the response phoneme;
processing, by the device, the audio sequences, based on the network fluctuation and with the language model, to generate a target voice sequence;
utilizing, by the device, an audio embedding model to generate an audio embedding based on the target voice sequence;
processing, by the device, the image frames, based on the network fluctuation and with an image model, to generate a target image sequence;
utilizing, by the device, an image embedding model to generate an image embedding based on the target image sequence;
combining, by the device, the text embedding, the audio embedding, and the image embedding to generate an embedding input vector;
processing, by the device, the embedding input vector, with an audio synthesis model, to generate a final voice response;
processing, by the device, the embedding input vector, with a frame synthesis model, to generate a final video; and
providing, by the device, the video data, the final voice response, and the final video via the virtual communication.