US 12,334,048 B2
Systems and methods for reconstructing voice packets using natural language generation during signal loss
Saurabh Tahiliani, Noida (IN); and Subham Biswas, Thane (IN)
Assigned to Verizon Patent and Licensing Inc., Basking Ridge, NJ (US)
Filed by Verizon Patent and Licensing Inc., Basking Ridge, NJ (US)
Filed on Oct. 12, 2022, as Appl. No. 18/045,893.
Prior Publication US 2024/0127790 A1, Apr. 18, 2024
Int. Cl. G10L 13/047 (2013.01); G06F 40/30 (2020.01); G10L 13/07 (2013.01); G10L 13/08 (2013.01); G10L 15/16 (2006.01); G10L 15/187 (2013.01); G10L 15/22 (2006.01); G10L 19/005 (2013.01); G10L 25/18 (2013.01)
CPC G10L 13/08 (2013.01) [G06F 40/30 (2020.01); G10L 13/047 (2013.01); G10L 13/07 (2013.01); G10L 15/16 (2013.01); G10L 15/187 (2013.01); G10L 15/22 (2013.01); G10L 19/005 (2013.01); G10L 25/18 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method, comprising:
receiving, by a device, audio data that includes voice packets transmitted during a virtual meeting;
converting, by the device, the audio data to text data in real-time;
detecting, by the device and based on the audio data, a network fluctuation that causes missing voice packets in the audio data;
processing, by the device, partial text and context of the text data, based on the network fluctuation and with a language model, to generate a new phrase;
generating, by the device, a response phoneme based on processing the new phrase with a phoneme generation model;
utilizing, by the device, a text embedding model to generate a text embedding based on the response phoneme,
wherein the text embedding model is a text classification neural network model without a dense layer and an output layer;
processing, by the device, the audio data, based on the network fluctuation and with the language model, to generate a target voice sequence;
utilizing, by the device, an audio embedding model to generate an audio embedding based on the target voice sequence;
combining, by the device, the text embedding and the audio embedding to generate an embedding input vector;
processing, by the device, the embedding input vector, with an audio synthesis model, to generate a final voice response; and
providing, by the device, the audio data and the final voice response via the virtual meeting.