US 12,456,180 B2
System and method for an audio-visual avatar evaluation
Ilya Baimetov, Redmond, WA (US); Denis Parkhomenko, Moscow (RU); Marcel de Korte, Stockholm (SE); Ivan Kirillov, Moscow (RU); Dmitriy Obukhov, Istanbul (TR); Alexey Rybak, Istanbul (TR); Laurent Dedenis, Singapore (SG); Serg Bell, Singapore (SG); and Stanislav Protasov, Singapore (SG)
Assigned to Constructor Technology AG, Schaffhausen (CH)
Filed by Constructor Technology AG, Schaffhausen (CH); and Constructor Education and Research Genossenschaft, Schaffhausen (CH)
Filed on Nov. 28, 2022, as Appl. No. 18/059,395.
Prior Publication US 2024/0177283 A1, May 30, 2024
Int. Cl. G10L 15/22 (2006.01); G06T 7/00 (2017.01); G10L 15/02 (2006.01); G10L 25/57 (2013.01); G10L 25/60 (2013.01); G10L 25/84 (2013.01); G10L 25/90 (2013.01)
CPC G06T 7/0002 (2013.01) [G10L 15/02 (2013.01); G10L 15/22 (2013.01); G10L 25/57 (2013.01); G10L 25/60 (2013.01); G10L 25/84 (2013.01); G10L 25/90 (2013.01); G06T 2207/10016 (2013.01); G06T 2207/20081 (2013.01); G06T 2207/30196 (2013.01)] 16 Claims
OG exemplary drawing
 
1. A method for automated evaluation of an avatar generated by an avatar generator comprising the steps of:
obtaining, by an audio evaluator, a speech generated by a text-to-speech module;
obtaining, by a video evaluator, a video clip generated by a video generator;
obtaining, by the audio evaluator, the audio features of the target person;
obtaining, by the video evaluator, the video features of the target person;
comparing the speech with the audio features of the target person using a set of audio metrics, and generating an audio evaluation score for the speech;
wherein generating the audio evaluation score comprises evaluating speech intelligibility using automatic-speech-recognition (ASR) based evaluation metrics, evaluating audio noise level using voice-activity-detection (VAD) based evaluation metrics, evaluating naturalness of speech intonation using pitch-based metrics, evaluating voice similarities using equal-error-rate (EER) and cosine (COS) metrics, and evaluating speech pronunciation statistics;
comparing the video clip with the video features of the target person using a set of video metrics, and generating a video evaluation score for the video clip;
combining the audio evaluation score and the video evaluation score; and
generating a combined naturalness score for the avatar generator based on the combined score of the audio evaluation score and the video evaluation score.