US 12,406,686 B2
	Techniques for training a multitask learning model to assess perceived audio quality
Chih-Wei Wu, Los Gatos, CA (US); Phillip A. Williams, Los Gatos, CA (US); and William Francis Wolcott, IV, Los Gatos, CA (US)
Assigned to NETFLIX, INC., Los Gatos, CA (US)
Filed by NETFLIX, INC., Los Gatos, CA (US)
Filed on Jun. 18, 2020, as Appl. No. 16/905,793.
Claims priority of provisional application 63/021,635, filed on May 7, 2020.
Prior Publication US 2021/0350819 A1, Nov. 11, 2021
Int. Cl. G10L 25/60 (2013.01); G06F 17/18 (2006.01); G06F 18/2113 (2023.01); G06F 18/214 (2023.01); G06N 20/00 (2019.01); G10L 25/27 (2013.01)

CPC G10L 25/60 (2013.01) [G06F 18/2113 (2023.01); G06F 18/214 (2023.01); G06N 20/00 (2019.01); G10L 25/27 (2013.01); G06F 17/18 (2013.01)]

20 Claims

1. A computer-implemented method for training a multitask learning model to assess perceived audio quality, the method comprising:

inputting a first training audio clip to a plurality of audio quality models that are separate from the multitask learning model to compute a first plurality of pseudo labels, wherein the first training audio clip is derived from a separate training reference audio clip, wherein each pseudo label included in the first plurality of pseudo labels specifies a metric value that is relevant to audio quality as measured by an audio quality model of the plurality of audio quality models;

performing one or more first scaling operations to scale each pseudo label included in the first plurality of pseudo labels based on a theoretical range of an output of a corresponding audio quality model included in the plurality of audio quality models to generate a first plurality of scaled pseudo labels;

computing a first set of feature values for a set of audio features based on the first training audio clip;

performing one or more second scaling operations to scale the first set of feature values using a set of scaling parameters computed based on the first set of feature values to generate a first set of scaled feature values;

training the multitask learning model based on the first set of scaled feature values and the first plurality of scaled pseudo labels to generate a trained multitask learning model that estimates the first plurality of scaled pseudo labels based on the first set of scaled feature values; and

transmitting the trained multitask learning model along with the set of scaling parameters to at least one compute instance, wherein the at least one compute instance, in operation, extracts a set of target features from a second audio clip, scales the set of target features based on the set of scaling parameters received with the trained multitask learning model to generate a second set of scaled feature values, and inputs the second set of scaled feature values to the trained multitask learning model, wherein the trained multitask learning model, in operation, maps, for the second audio clip, the second set of scaled feature values for the set of audio features based on the second audio clip to a plurality of predicted labels, and wherein the plurality of predicted labels specifies estimated metric values for a plurality of metrics relevant to audio quality that would be computed according to the plurality of audio quality models for the second audio clip.