US 12,380,897 B2
	Real-time packet loss concealment using deep generative networks
Santiago Pascual, Barcelona (ES); Joan Serra, Barcelona (ES); and Jordi Pons Puig, Olot (ES)
Assigned to DOLBY INTERNATIONAL AB, Dublin (IE)
Appl. No. 18/248,359
Filed by DOLBY INTERNATIONAL AB, Dublin (IE)
PCT Filed Oct. 14, 2021, PCT No. PCT/EP2021/078443 § 371(c)(1), (2) Date Apr. 7, 2023, PCT Pub. No. WO2022/079164, PCT Pub. Date Apr. 21, 2022.
Claims priority of provisional application 63/126,123, filed on Dec. 16, 2020.
Claims priority of provisional application 63/195,831, filed on Jun. 2, 2021.
Claims priority of application No. ES202031040 (ES), filed on Oct. 15, 2020; and application No. ES202130258 (ES), filed on Mar. 24, 2021.
Prior Publication US 2023/0377584 A1, Nov. 23, 2023
Int. Cl. G10L 19/005 (2013.01); G06N 3/0455 (2023.01); G06N 3/0475 (2023.01); G06N 3/094 (2023.01); G10L 19/038 (2013.01); G10L 25/30 (2013.01)

CPC G10L 19/005 (2013.01) [G06N 3/0455 (2023.01); G06N 3/0475 (2023.01); G06N 3/094 (2023.01); G10L 19/038 (2013.01); G10L 25/30 (2013.01)]

20 Claims

1. A method for packet loss concealment of an incomplete audio signal, the incomplete audio signal comprising a substitute signal portion replacing an original signal portion of a complete audio signal, the method comprising:

obtaining a representation of the incomplete audio signal;

inputting the representation of the incomplete audio signal to an encoder neural network trained to predict a latent representation of a complete audio signal given a representation of an incomplete audio signal;

outputting, by the encoder neural network, a latent representation of a predicted complete audio signal;

quantizing the latent representation of the complete audio signal to obtain a quantized latent representation, wherein the quantized latent representation is formed by selecting a set of tokens out of a predetermined vocabulary set of tokens;

conditioning, with at least one token of the quantized latent representation, a generative neural network, wherein the generative neural network is trained to predict a token of the set of tokens provided at least one different token of the set of tokens;

outputting by the generative neural network a predicted token of the latent representation and a confidence metric associated with the predicted token;

based on the confidence metric of the predicted token, replacing a corresponding token of the quantized latent representation with the predicted token,

inputting the quantized latent representation of the predicted complete audio signal to a decoder neural network trained to predict a representation of a complete audio signal given a latent representation of a complete audio signal; and

outputting, by the decoder neural network, a representation of the predicted complete audio signal comprising a reconstruction of the original portion of the complete audio signal, wherein said encoder neural network and said decoder neural network have been trained with an adversarial neural network.