US 12,340,807 B2
Speech recognition apparatus, control method, and non-transitory storage medium
Shuji Komeiji, Tokyo (JP); and Hitoshi Yamamoto, Tokyo (JP)
Assigned to NEC CORPORATION, Tokyo (JP)
Appl. No. 17/908,292
Filed by NEC Corporation, Tokyo (JP)
PCT Filed Mar. 9, 2020, PCT No. PCT/JP2020/009979
§ 371(c)(1), (2) Date Aug. 31, 2022,
PCT Pub. No. WO2021/181451, PCT Pub. Date Sep. 16, 2021.
Prior Publication US 2023/0109867 A1, Apr. 13, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 15/26 (2006.01); G06F 40/166 (2020.01)
CPC G10L 15/26 (2013.01) [G06F 40/166 (2020.01)] 11 Claims
OG exemplary drawing
 
1. A speech recognition apparatus comprising:
at least one memory configured to store instructions; and
at least one processor configured to execute the instructions to perform operations comprising:
converting a source audio signal including an utterance into a text string by generating a time-series data of a plurality of audio frames from the source audio signal and converting each of the plurality of audio frames into a text; and
generating a concatenated text representing a content of the utterance by concatenating texts adjacent to each other in the text string, wherein
wherein parts of audio signals corresponding to texts adjacent to each other in the text string overlap each other on a time axis,
wherein generating the concatenated text comprises, at a time of concatenating a preceding text and a succeeding text adjacent to each other in the text string, eliminating, from a preceding text, a part including a trailing portion of the preceding text, and eliminating, from a succeeding text, a part including a leading portion of the succeeding text, and
wherein generating the concatenated text comprises:
regarding a time section in which a preceding text and a succeeding text overlap, detecting a point of time at which a character of the preceding text and a character of the succeeding text do not match;
when a first difference being a difference between the detected point of time and an end point of time of an audio signal corresponding to the preceding text is more than a second difference being a difference between the detected point of time and a start point of time of an audio signal corresponding to the succeeding text, using, as a character of the concatenated text corresponding to the detected point of time, a character of the preceding text; and
when the first difference is less than the second difference, using, as a character of the concatenated text corresponding to the detected point of time, a character of the succeeding text.
 
6. A control method executed by a computer, comprising:
converting a source audio signal including an utterance into a text string by generating a time-series data of a plurality of audio frames from the source audio signal and converting each of the plurality of audio frames into a text; and
generating a concatenated text representing a content of the utterance by concatenating texts adjacent to each other in the text string, wherein
parts of audio signals corresponding to texts adjacent to each other in the text string overlap each other on a time axis, and
generating the concatenated text comprises, at a time of concatenating a preceding text and a succeeding text adjacent to each other in the text string, eliminating, from a preceding text, a part including a trailing portion of the preceding text, and eliminating, from a succeeding text, a part including a leading portion of the succeeding text,
wherein the generating the concatenated text comprises:
regarding a time section in which a preceding text and a succeeding text overlap, detecting a point of time at which a character of the preceding text and a character of the succeeding text do not match;
when a first difference being a difference between the detected point of time and an end point of time of an audio signal corresponding to the preceding text is more than a second difference being a difference between the detected point of time and a start point of time of an audio signal corresponding to the succeeding text, using, as a character of the concatenated text corresponding to the detected point of time, a character of the preceding text; and
when the first difference is less than the second difference, using, as a character of the concatenated text corresponding to the detected point of time, a character of the succeeding text.
 
11. A non-transitory storage medium storing a program causing a computer to execute a control method, the control method comprising:
converting a source audio signal including an utterance into a text string by generating a time-series data of a plurality of audio frames from the source audio signal and converting each of the plurality of audio frames into a text; and
generating a concatenated text representing a content of the utterance by concatenating texts adjacent to each other in the text string, wherein
parts of audio signals corresponding to texts adjacent to each other in the text string overlap each other on a time axis, and
generating the concatenated text comprises, at a time of concatenating a preceding text and a succeeding text adjacent to each other in the text string, eliminating, from a preceding text, a part including a trailing portion of the preceding text, and eliminating, from a succeeding text, a part including a leading portion of the succeeding text,
wherein the generating the concatenated text comprises:
regarding a time section in which a preceding text and a succeeding text overlap, detecting a point of time at which a character of the preceding text and a character of the succeeding text do not match;
when a first difference being a difference between the detected point of time and an end point of time of an audio signal corresponding to the preceding text is more than a second difference being a difference between the detected point of time and a start point of time of an audio signal corresponding to the succeeding text, using, as a character of the concatenated text corresponding to the detected point of time, a character of the preceding text; and
when the first difference is less than the second difference, using, as a character of the concatenated text corresponding to the detected point of time, a character of the succeeding text.