US 12,136,435 B2
Utterance section detection device, utterance section detection method, and program
Ryo Masumura, Tokyo (JP); Takanobu Oba, Tokyo (JP); and Kiyoaki Matsui, Tokyo (JP)
Assigned to NIPPON TELEGRAPH AND TELEPHONE CORPORATION, Tokyo (JP)
Appl. No. 17/628,045
Filed by NIPPON TELEGRAPH AND TELEPHONE CORPORATION, Tokyo (JP)
PCT Filed Jul. 24, 2019, PCT No. PCT/JP2019/029035
§ 371(c)(1), (2) Date Jan. 18, 2022,
PCT Pub. No. WO2021/014612, PCT Pub. Date Jan. 28, 2021.
Prior Publication US 2022/0270637 A1, Aug. 25, 2022
Int. Cl. G10L 25/78 (2013.01); G10L 25/93 (2013.01)
CPC G10L 25/78 (2013.01) [G10L 25/93 (2013.01); G10L 2025/783 (2013.01)] 8 Claims
OG exemplary drawing
 
1. An utterance section detection device comprising:
processing circuitry configured to:
obtain a sequence of acoustic feature amounts for each short time frame of an acoustic signal and perform speech/non-speech determination which is determination as to whether each of the short time frame of the acoustic signal is speech or non-speech and generate a speech/non-speech label sequence for the acoustic signal;
obtain a sequence of acoustic feature amounts of a certain section determined as corresponding to speech frames as a result of the speech/non-speech determination and perform utterance end determination which is determination as to whether or not an end of the certain section is an end of utterance and generate a probability of an end of the certain section being an end of utterance;
based on the probability of the end of the certain section being the end of utterance, determine a threshold for a duration immediately after the certain section of a non-speech section on a basis of a result of the utterance end determination and generate a threshold for a duration of a non-speech section immediately after the certain section;
obtain the speech/non-speech label sequence, the threshold for a duration of a non-speech section immediately after the certain section and detect an utterance section by comparing the duration of a non-speech section immediately after the certain section with the corresponding threshold and generate an utterance section label sequence; and
determine the non-speech section immediately after the certain section as a non-speech section within an utterance section in case where the duration of the non-speech section is less than the corresponding threshold, and determine the non-speech section immediately after the certain section as a non-speech section outside an utterance section in case where the duration of the non-speech section is equal to or greater than the corresponding threshold.