US 12,482,459 B2
Speech recognition system, acoustic processing method, and non-temporary computer-readable medium
Yui Sudo, Wako (JP); Kazuhiro Nakadai, Wako (JP); and Muhammad Shakeel, Wako (JP)
Assigned to HONDA MOTOR CO., LTD., Tokyo (JP)
Filed by HONDA MOTOR CO., LTD., Tokyo (JP)
Filed on Aug. 29, 2022, as Appl. No. 17/897,352.
Prior Publication US 2024/0071379 A1, Feb. 29, 2024
Int. Cl. G10L 15/02 (2006.01); G10L 15/04 (2013.01); G10L 15/16 (2006.01); G10L 15/197 (2013.01); G10L 25/78 (2013.01)
CPC G10L 15/197 (2013.01) [G10L 15/02 (2013.01); G10L 15/04 (2013.01)] 7 Claims
OG exemplary drawing
 
1. A speech recognition system comprising:
a processor and a memory, the processor coupled to the memory,
the processor is configured to:
input an audio signal;
calculate an acoustic feature for each subframe of the audio signal;
calculate, by using a first model, a hidden state series for each frame consisting of multiple subframes on the basis of the acoustic feature;
specify, by using a second model, whether a voice segment or a non-voice segment for each block on the basis of the hidden state series, the block consisting of a plurality of frames;
calculate, by using a third model, a probability for an utterance content candidate on the basis of a sequence of the hidden state provided series for each block having a single voice segment to specify an utterance content; and
train the third model to calculate the probability for the utterance content candidate on the basis of hidden state series; wherein
the processor is configured to:
specify a first frame subsequent to the non-voice segment as a beginning of the voice segment,
specify a second frame prior to a succeeding non-voice segment as an end of the voice segment,
adjust block arrangement of the audio signal, by concatenating one or more frames up to the end of the voice segment in a first block with the end of the voice segment to a second block proceeding to the first block, concatenating one or more frames from the beginning of the voice segment in a third block with the beginning of the voice segment to a fourth block subsequent to the third block,
search for recognition results indicating an utterance content for each block arrangement of the audio signal based on the probability for the utterance content candidate calculated by using the third model, and
output the recognition results to an external device.