| CPC G10L 15/197 (2013.01) [G10L 15/02 (2013.01); G10L 15/04 (2013.01)] | 7 Claims |

|
1. A speech recognition system comprising:
a processor and a memory, the processor coupled to the memory,
the processor is configured to:
input an audio signal;
calculate an acoustic feature for each subframe of the audio signal;
calculate, by using a first model, a hidden state series for each frame consisting of multiple subframes on the basis of the acoustic feature;
specify, by using a second model, whether a voice segment or a non-voice segment for each block on the basis of the hidden state series, the block consisting of a plurality of frames;
calculate, by using a third model, a probability for an utterance content candidate on the basis of a sequence of the hidden state provided series for each block having a single voice segment to specify an utterance content; and
train the third model to calculate the probability for the utterance content candidate on the basis of hidden state series; wherein
the processor is configured to:
specify a first frame subsequent to the non-voice segment as a beginning of the voice segment,
specify a second frame prior to a succeeding non-voice segment as an end of the voice segment,
adjust block arrangement of the audio signal, by concatenating one or more frames up to the end of the voice segment in a first block with the end of the voice segment to a second block proceeding to the first block, concatenating one or more frames from the beginning of the voice segment in a third block with the beginning of the voice segment to a fourth block subsequent to the third block,
search for recognition results indicating an utterance content for each block arrangement of the audio signal based on the probability for the utterance content candidate calculated by using the third model, and
output the recognition results to an external device.
|