CPC G10L 15/26 (2013.01) [G10L 15/02 (2013.01); G10L 15/04 (2013.01); G10L 25/30 (2013.01); G10L 25/78 (2013.01); G10L 2025/783 (2013.01)] | 26 Claims |
1. A computer-program product embodied in a non-transitory machine-readable storage medium storing computer instructions that, when executed by one or more processors, perform operations comprising:
constructing a transcript adaptation training data corpus comprising a plurality of transcript normalization training data samples, wherein each of the plurality of transcript normalization training data samples includes:
a training sample pairing between (i) a predicted audio transcript that includes at least one numerical expression and (ii) an adapted audio transcript that includes an alphabetic representation of the at least one numerical expression;
a transcript normalization identifier that, when applied to a model input comprising a target audio transcript, defines a text-to-text transformation objective causing a numeric-to-alphabetic expression machine learning model to predict an alphabetic-equivalent audio transcript that represents each numerical expression included in the target audio transcript in one or more alphabetic tokens;
configuring the numeric-to-alphabetic expression machine learning model based on a training of a machine learning text-to-text transformer model using the transcript adaptation training data corpus; and
executing the numeric-to-alphabetic expression machine learning model within a speech-to-text post-processing sequence of a speech-to-text service based on the numeric-to-alphabetic expression machine learning model satisfying a minimum audio transcript adaptation efficacy value;
obtaining audio data comprising one or more utterances;
generating, via a speech-to-text machine learning model, a probable audio transcript based on an input of the audio data, wherein the probable audio transcript includes a plurality of numerical expressions;
generating, via the numeric-to-alphabetic expression machine learning model, an adjusted audio transcript of the probable audio transcript based on an input of a task-specific instruction to the numeric-to-alphabetic expression machine learning model, wherein the task-specific instruction includes:
an instructional prefix component comprising the transcript normalization identifier, wherein the numeric-to-alphabetic expression machine learning model identifies a task type of the instructional prefix component, wherein the task type of the instructional prefix component corresponds to the transcript normalization identifier; and
an input text string comprising the probable audio transcript; and
obtaining, from a memory, a set of weights and biases generated from the training of the machine learning text-to-text transformer model that corresponds to the transcript normalization identifier, wherein the executing the numeric-to-alphabetic expression machine learning model includes using the set of weights and biases to generate the adjusted audio transcript.
|