US 12,249,320 B2
	Utterance evaluation apparatus, utterance evaluation, and program
Sadao Hiroya, Tokyo (JP)
Assigned to NIPPON TELEGRAPH AND TELEPHONE CORPORATION, Tokyo (JP)
Appl. No. 17/622,675
Filed by NIPPON TELEGRAPH AND TELEPHONE CORPORATION, Tokyo (JP)
PCT Filed Jun. 25, 2019, PCT No. PCT/JP2019/025048 § 371(c)(1), (2) Date Dec. 23, 2021, PCT Pub. No. WO2020/261357, PCT Pub. Date Dec. 30, 2020.
Prior Publication US 2022/0366895 A1, Nov. 17, 2022
Int. Cl. G10L 21/003 (2013.01); G09B 19/06 (2006.01); G10L 15/01 (2013.01); G10L 15/02 (2006.01); G10L 15/16 (2006.01); G10L 25/30 (2013.01); G10L 25/60 (2013.01); G10L 25/90 (2013.01)

CPC G10L 15/16 (2013.01) [G10L 15/01 (2013.01); G10L 15/02 (2013.01); G10L 2015/025 (2013.01)]

20 Claims

1. A speech evaluation device comprising circuitry configured to execute a method comprising:

extracting an acoustic feature from an input voice signal of speech spoken by a speaker in a first group;

converting using a neural network, the acoustic feature of the input voice signal to an acoustic feature when a speaker in a second group speaks the same text as text of the input voice signal; and

determining a score indicating a higher evaluation as a distance between the acoustic feature before the conversion and the acoustic feature after the conversion becomes shorter, wherein

the neural network obtains an estimated value of a converted speech rhythm using the input voice signal and updates a plurality of parameters of the neural network based on a comparison result between the estimated value of the converted speech rhythm and speech rhythm information in a learning data,

storing a Gaussian mixture model representing an acoustic feature conversion rule vector learned from a first acoustic feature extracted from a first voice signal of speech spoken by a speaker in the first group and a second acoustic feature extracted from a second voice signal of speech spoken by a speaker in the second group; and

obtaining, by using a Gaussian mixture model of a dimension corresponding to the first acoustic feature as a first Gaussian mixture model, a weight of the first Gaussian mixture model such that the first Gaussian mixture model applies best to the acoustic feature extracted from the input voice signal.