US 12,406,685 B2
Methods and systems for cross-correlating and aligning parallel speech utterances to improve quality assurance
Lukas Pfeifenberger, Salzburg (AT); and Shawn Zhang, Palo Alto, CA (US)
Assigned to SANAS.AI INC., Palo Alto, CA (US)
Filed by Sanas.ai Inc., Palo Alto, CA (US)
Filed on Mar. 22, 2024, as Appl. No. 18/613,833.
Claims priority of provisional application 63/462,002, filed on Apr. 26, 2023.
Prior Publication US 2024/0363135 A1, Oct. 31, 2024
Int. Cl. G10L 21/12 (2013.01); G10L 25/30 (2013.01); G10L 25/60 (2013.01)
CPC G10L 21/12 (2013.01) [G10L 25/30 (2013.01); G10L 25/60 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A voice conversion system, comprising memory having instructions stored thereon and one or more processors coupled to the memory and configured to execute the instructions to:
convert a candidate utterance and a reference utterance in obtained audio data into first and second time series sequence representations, respectively, using acoustic features and linguistic features;
perform a cross-correlation of the first and second time series sequence representations to generate a result representing a first degree of similarity between the first and second time series sequence representations;
align the first and second time series sequence representations to generate an aligned version of the first and second time series sequence representations;
after the alignment, adjust the aligned version of the first and second time series sequence representations based on a phoneme-based alignment;
generate an alignment difference of path-based distances between the reference and candidate speech utterances based on the adjusted aligned version of the first and second time series sequences representations;
generate a quality metric based on the result of the cross-correlation of the first and second time series sequence representations and the generated alignment difference of path-based distances between the reference and candidate speech utterances; and
output the generated quality metric, wherein the generated quality metric is indicative of a second degree of similarity between the candidate and reference utterances.