| CPC G10L 19/022 (2013.01) [G10L 15/22 (2013.01); G10L 25/87 (2013.01)] | 8 Claims |

|
1. A voice conversation reconstruction method performed by a voice conversation reconstruction apparatus, the method comprising:
acquiring a plurality of speaker-specific voice recognition data corresponding to a plurality of speakers about voice conversation;
dividing each of the plurality of the speaker-specific voice recognition data into a plurality of blocks using a boundary between tokens such that each of the divided plurality of the speaker-specific voice recognition data includes voice data only by a single speaker, wherein the divided plurality of the speaker-specific voice recognition data are not in chronological order;
arranging the plurality of blocks of all the speaker-specific voice recognition data in chronological order without distinction of speaker;
among the arranged plurality of blocks, merging blocks when the blocks are neighbor and the speaker of the blocks are the same such that the speaker-specific voice recognition data in each of the merged blocks are in chronological order and include voice data only by the same speaker; and
reconstructing the plurality of blocks subjected to the merging in a conversation format in chronological order and based on a speaker step by step such that the speaker-specific voice recognition data in each of the reconstructed blocks are in chronological order and include voice data only by the same speaker,
wherein the steps are performed in order,
wherein acquiring the plurality of speaker-specific voice recognition data includes:
acquiring a first speaker-specific recognition result generated on an End Point Detection (EPD), and a second speaker-specific recognition result generated every preset time, and
collecting the first speaker-specific recognition result and the second speaker-specific recognition result without overlap and redundance therebetween to generate the speaker-specific voice recognition data, and
wherein the second speaker-specific recognition result is generated after a last EPD at which the first speaker-specific recognition result is generated occurs.
|