US 12,217,755 B2
	Voice conversion apparatus, voice conversion learning apparatus, image generation apparatus, image generation learning apparatus, voice conversion method, voice conversion learning method, image generation method, image generation learning method, and computer program
Hirokazu Kameoka, Musashino (JP); Ko Tanaka, Musashino (JP); Yasunori Oishi, Musashino (JP); Takuhiro Kaneko, Musashino (JP); and Aaron Valero Puche, Musashino (JP)
Assigned to Nippon Telegraph and Telephone Corporation, Tokyo (JP)
Appl. No. 17/640,221
Filed by Nippon Telegraph and Telephone Corporation, Tokyo (JP)
PCT Filed Sep. 4, 2020, PCT No. PCT/JP2020/033607 § 371(c)(1), (2) Date Mar. 3, 2022, PCT Pub. No. WO2021/045194, PCT Pub. Date Mar. 11, 2021.
Claims priority of application No. 2019-163418 (JP), filed on Sep. 6, 2019.
Prior Publication US 2022/0335944 A1, Oct. 20, 2022
Int. Cl. G10L 15/25 (2013.01); G06V 40/16 (2022.01); G10L 15/02 (2006.01); G10L 15/06 (2013.01)

CPC G10L 15/25 (2013.01) [G06V 40/176 (2022.01); G10L 15/02 (2013.01); G10L 15/063 (2013.01)]

13 Claims

1. A voice device comprising one or more processors configured to perform operations comprising:

receiving a voice signal;

extracting linguistic information corresponding to utterance content from the voice signal;

receiving a captured image of a person;

extracting appearance features expressing features related to the look of the person's face from the captured image;

determining a target timbre conforming to the appearance features of the person in the captured image; and

generating a converted voice based on the linguistic information and the appearance features, wherein the converted voice is in the target timbre that conforms to the appearance features of the person in the captured image.