CPC G10L 15/16 (2013.01) [G10L 15/01 (2013.01); G10L 15/02 (2013.01); G10L 2015/025 (2013.01)] | 20 Claims |
1. A speech evaluation device comprising circuitry configured to execute a method comprising:
extracting an acoustic feature from an input voice signal of speech spoken by a speaker in a first group;
converting using a neural network, the acoustic feature of the input voice signal to an acoustic feature when a speaker in a second group speaks the same text as text of the input voice signal; and
determining a score indicating a higher evaluation as a distance between the acoustic feature before the conversion and the acoustic feature after the conversion becomes shorter, wherein
the neural network obtains an estimated value of a converted speech rhythm using the input voice signal and updates a plurality of parameters of the neural network based on a comparison result between the estimated value of the converted speech rhythm and speech rhythm information in a learning data,
storing a Gaussian mixture model representing an acoustic feature conversion rule vector learned from a first acoustic feature extracted from a first voice signal of speech spoken by a speaker in the first group and a second acoustic feature extracted from a second voice signal of speech spoken by a speaker in the second group; and
obtaining, by using a Gaussian mixture model of a dimension corresponding to the first acoustic feature as a first Gaussian mixture model, a weight of the first Gaussian mixture model such that the first Gaussian mixture model applies best to the acoustic feature extracted from the input voice signal.
|