US 12,475,911 B2
Method for learning an audio quality metric combining labeled and unlabeled data
Joan Serra, Barcelona (ES); Jordi Pons Puig, Barcelona (ES); and Santiago Pascual, Barcelona (ES)
Assigned to Dolby International AB, Dublin (IE)
Appl. No. 18/012,256
Filed by Dolby International AB, Dublin (IE)
PCT Filed Jun. 21, 2021, PCT No. PCT/EP2021/066786
§ 371(c)(1), (2) Date Dec. 22, 2022,
PCT Pub. No. WO2021/259842, PCT Pub. Date Dec. 30, 2021.
Claims priority of provisional application 63/090,919, filed on Oct. 13, 2020.
Claims priority of provisional application 63/072,787, filed on Aug. 31, 2020.
Claims priority of application No. ES202030605 (ES), filed on Jun. 22, 2020; and application No. 20203277 (EP), filed on Oct. 22, 2020.
Prior Publication US 2023/0245674 A1, Aug. 3, 2023
Int. Cl. G10L 25/30 (2013.01); G10L 25/60 (2013.01)
CPC G10L 25/30 (2013.01) [G10L 25/60 (2013.01)] 17 Claims
OG exemplary drawing
 
1. A method of training a neural-network-based system for determining an indication of an audio quality of an audio input, the method comprising:
obtaining, as input, at least one training set comprising audio samples, wherein the audio samples comprise audio samples of a first type and audio samples of a second type, wherein each of the first type of audio samples is labelled with information indicative of a respective predetermined audio quality metric, and wherein each of the second type of audio samples is labelled with information indicative of a respective audio quality metric relative to that of a reference audio sample;
inputting the audio samples of the training set to the neural-network-based system including:
a neural-network-based encoder configured to map an audio sample to a respective latent vector in a latent space, and
a neural-network-based assessment stage including a first assessment head and one or more second assessment heads, the first assessment head being configured to generate a respective quality score corresponding to the audio sample based on the respective latent vector; and
iteratively training the neural-network-based system to predict the respective label information of the audio samples in the training set based on a plurality of loss functions,
wherein each of the loss functions is configured to reflect differences between the respective label information of the audio samples in the training set and the respective predictions thereof generated by a respective one of the assessment heads; and
wherein the one or more second assessment heads and a corresponding subset of the plurality of loss functions are configured to regularize the latent space based on a plurality of respective latent vectors to which the audio samples of the training set are mapped during the iterative training.