| CPC G10L 21/12 (2013.01) [G10L 25/30 (2013.01); G10L 25/60 (2013.01)] | 20 Claims |

|
1. A voice conversion system, comprising memory having instructions stored thereon and one or more processors coupled to the memory and configured to execute the instructions to:
convert a candidate utterance and a reference utterance in obtained audio data into first and second time series sequence representations, respectively, using acoustic features and linguistic features;
perform a cross-correlation of the first and second time series sequence representations to generate a result representing a first degree of similarity between the first and second time series sequence representations;
align the first and second time series sequence representations to generate an aligned version of the first and second time series sequence representations;
after the alignment, adjust the aligned version of the first and second time series sequence representations based on a phoneme-based alignment;
generate an alignment difference of path-based distances between the reference and candidate speech utterances based on the adjusted aligned version of the first and second time series sequences representations;
generate a quality metric based on the result of the cross-correlation of the first and second time series sequence representations and the generated alignment difference of path-based distances between the reference and candidate speech utterances; and
output the generated quality metric, wherein the generated quality metric is indicative of a second degree of similarity between the candidate and reference utterances.
|