| CPC G10L 15/25 (2013.01) [G06V 40/176 (2022.01); G10L 15/02 (2013.01); G10L 15/063 (2013.01)] | 13 Claims |

|
1. A voice device comprising one or more processors configured to perform operations comprising:
receiving a voice signal;
extracting linguistic information corresponding to utterance content from the voice signal;
receiving a captured image of a person;
extracting appearance features expressing features related to the look of the person's face from the captured image;
determining a target timbre conforming to the appearance features of the person in the captured image; and
generating a converted voice based on the linguistic information and the appearance features, wherein the converted voice is in the target timbre that conforms to the appearance features of the person in the captured image.
|