US 11,869,486 B2
	Voice conversion learning device, voice conversion device, method, and program
Hirokazu Kameoka, Tokyo (JP); and Takuhiro Kaneko, Tokyo (JP)
Assigned to NIPPON TELEGRAPH AND TELEPHONE CORPORATION, Tokyo (JP)
Appl. No. 17/268,053
Filed by NIPPON TELEGRAPH AND TELEPHONE CORPORATION, Tokyo (JP)
PCT Filed Aug. 13, 2019, PCT No. PCT/JP2019/031844 § 371(c)(1), (2) Date Feb. 11, 2021, PCT Pub. No. WO2020/036178, PCT Pub. Date Feb. 20, 2020.
Claims priority of application No. 2018-152394 (JP), filed on Aug. 13, 2018.
Prior Publication US 2022/0122591 A1, Apr. 21, 2022
Int. Cl. G10L 15/16 (2006.01); G10L 15/06 (2013.01); G10L 15/10 (2006.01); G10L 15/22 (2006.01)

CPC G10L 15/16 (2013.01) [G10L 15/063 (2013.01); G10L 15/10 (2013.01); G10L 15/22 (2013.01); G10L 2015/0631 (2013.01); G10L 2015/227 (2013.01)]

20 Claims

1. A voice conversion learning device comprising:

a learner configured to learn, on the basis of a sound feature value series for each of conversion-source voice signals with different attributions, and attribution codes indicating each attribution of the conversion-source voice signals, a converter configured to convert, for input of a sound feature value series and an attribution code, to a sound feature value series of a voice signal of an attribution indicated by the attribution code,

the learner learning the converter to minimize a value of a learning criterion represented using:

real voice similarity of a sound feature value series converted by the converter for input of any attribution code, the real voice similarity being associated with the any attribution code, the real voice similarity being identified a voice identifier for identifying, for input of an attribution code, whether a voice is a real voice with an attribution indicated by the attribution code or a synthetic voice,

attribution code similarity of a sound feature value series converted by the converter for input of any attribution code, the attribution code similarity being similarity to the any attribution code identified by an attribution identifier,

an error between a sound feature value series reconverted from the sound feature value series converted by the converter for input of an attribution code different from the attribution code of the conversion-source voice signal, the reconversion being done by the converter for input of the attribution code of the conversion-source voice signal, and the sound feature value series of the conversion-source voice signal, and

a distance between the sound feature value series converted by the converter for input of the attribution code of the conversion-source voice signal and the sound feature value series of the conversion-source voice signal, the learner learning the voice identifier to minimize a value of a learning criterion represented using:

real voice similarity of a sound feature value series converted by the converter for input of any attribution code, the real voice similarity being associated with the any attribution code, the real voice similarity being identified by the voice identifier for identifying, for input of an attribution code, whether a voice is a real voice with an attribution indicated by the attribution code or a synthetic voice, and

real voice similarity indicated by the attribution code of the sound feature value series of the conversion-source voice signal, the real voice similarity being identified by the voice identifier for input of the attribution code of the conversion-source voice signal, and

the learner learning the attribution identifier to minimize a value of a learning criterion represented using attribution code similarity of the sound feature value series of the conversion-source voice signal, the attribution code similarity being of the conversion-source voice signal identified by the attribution identifier.