| CPC G06T 7/0002 (2013.01) [G10L 15/02 (2013.01); G10L 15/22 (2013.01); G10L 25/57 (2013.01); G10L 25/60 (2013.01); G10L 25/84 (2013.01); G10L 25/90 (2013.01); G06T 2207/10016 (2013.01); G06T 2207/20081 (2013.01); G06T 2207/30196 (2013.01)] | 16 Claims |

|
1. A method for automated evaluation of an avatar generated by an avatar generator comprising the steps of:
obtaining, by an audio evaluator, a speech generated by a text-to-speech module;
obtaining, by a video evaluator, a video clip generated by a video generator;
obtaining, by the audio evaluator, the audio features of the target person;
obtaining, by the video evaluator, the video features of the target person;
comparing the speech with the audio features of the target person using a set of audio metrics, and generating an audio evaluation score for the speech;
wherein generating the audio evaluation score comprises evaluating speech intelligibility using automatic-speech-recognition (ASR) based evaluation metrics, evaluating audio noise level using voice-activity-detection (VAD) based evaluation metrics, evaluating naturalness of speech intonation using pitch-based metrics, evaluating voice similarities using equal-error-rate (EER) and cosine (COS) metrics, and evaluating speech pronunciation statistics;
comparing the video clip with the video features of the target person using a set of video metrics, and generating a video evaluation score for the video clip;
combining the audio evaluation score and the video evaluation score; and
generating a combined naturalness score for the avatar generator based on the combined score of the audio evaluation score and the video evaluation score.
|