US 12,249,320 B2
Utterance evaluation apparatus, utterance evaluation, and program
Sadao Hiroya, Tokyo (JP)
Assigned to NIPPON TELEGRAPH AND TELEPHONE CORPORATION, Tokyo (JP)
Appl. No. 17/622,675
Filed by NIPPON TELEGRAPH AND TELEPHONE CORPORATION, Tokyo (JP)
PCT Filed Jun. 25, 2019, PCT No. PCT/JP2019/025048
§ 371(c)(1), (2) Date Dec. 23, 2021,
PCT Pub. No. WO2020/261357, PCT Pub. Date Dec. 30, 2020.
Prior Publication US 2022/0366895 A1, Nov. 17, 2022
Int. Cl. G10L 21/003 (2013.01); G09B 19/06 (2006.01); G10L 15/01 (2013.01); G10L 15/02 (2006.01); G10L 15/16 (2006.01); G10L 25/30 (2013.01); G10L 25/60 (2013.01); G10L 25/90 (2013.01)
CPC G10L 15/16 (2013.01) [G10L 15/01 (2013.01); G10L 15/02 (2013.01); G10L 2015/025 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A speech evaluation device comprising circuitry configured to execute a method comprising:
extracting an acoustic feature from an input voice signal of speech spoken by a speaker in a first group;
converting using a neural network, the acoustic feature of the input voice signal to an acoustic feature when a speaker in a second group speaks the same text as text of the input voice signal; and
determining a score indicating a higher evaluation as a distance between the acoustic feature before the conversion and the acoustic feature after the conversion becomes shorter, wherein
the neural network obtains an estimated value of a converted speech rhythm using the input voice signal and updates a plurality of parameters of the neural network based on a comparison result between the estimated value of the converted speech rhythm and speech rhythm information in a learning data,
storing a Gaussian mixture model representing an acoustic feature conversion rule vector learned from a first acoustic feature extracted from a first voice signal of speech spoken by a speaker in the first group and a second acoustic feature extracted from a second voice signal of speech spoken by a speaker in the second group; and
obtaining, by using a Gaussian mixture model of a dimension corresponding to the first acoustic feature as a first Gaussian mixture model, a weight of the first Gaussian mixture model such that the first Gaussian mixture model applies best to the acoustic feature extracted from the input voice signal.