| CPC G10L 13/08 (2013.01) [G06F 40/30 (2020.01); G10L 13/047 (2013.01); G10L 13/07 (2013.01); G10L 15/16 (2013.01); G10L 15/187 (2013.01); G10L 15/22 (2013.01); G10L 19/005 (2013.01); G10L 25/18 (2013.01)] | 20 Claims |

|
1. A method, comprising:
receiving, by a device, audio data that includes voice packets transmitted during a virtual meeting;
converting, by the device, the audio data to text data in real-time;
detecting, by the device and based on the audio data, a network fluctuation that causes missing voice packets in the audio data;
processing, by the device, partial text and context of the text data, based on the network fluctuation and with a language model, to generate a new phrase;
generating, by the device, a response phoneme based on processing the new phrase with a phoneme generation model;
utilizing, by the device, a text embedding model to generate a text embedding based on the response phoneme,
wherein the text embedding model is a text classification neural network model without a dense layer and an output layer;
processing, by the device, the audio data, based on the network fluctuation and with the language model, to generate a target voice sequence;
utilizing, by the device, an audio embedding model to generate an audio embedding based on the target voice sequence;
combining, by the device, the text embedding and the audio embedding to generate an embedding input vector;
processing, by the device, the embedding input vector, with an audio synthesis model, to generate a final voice response; and
providing, by the device, the audio data and the final voice response via the virtual meeting.
|