| CPC G10L 15/02 (2013.01) [G06F 16/68 (2019.01); G10L 15/04 (2013.01); G10L 15/08 (2013.01)] | 14 Claims |

|
1. A speech recognition method comprising:
receiving a speech signal generated by an utterance of a user;
identifying a named entity from the received speech signal;
determining a speech signal portion, which corresponds to the identified named entity, from the received speech signal;
generating a first acoustic embedding vector corresponding to the speech signal portion, based on an acoustic embedding model;
determining an acoustic embedding vector closet to the first acoustic embedding vector from among a plurality of acoustic embedding vectors corresponding to a plurality of named entities included in an acoustic embedding database (DB), as a second acoustic embedding vector, based on the acoustic embedding model;
determining a corrected named entity corresponding to the second acoustic embedding vector, from among the plurality of named entities included in the acoustic embedding DB;
displaying the corrected named entity and a result of speech recognition with respect to the speech signal, based on the corrected named entity;
determining at least one candidate embedding vector including an acoustic embedding vector second closest to the first acoustic embedding vector and an acoustic embedding vector third closest to the first acoustic embedding vector from among the plurality of acoustic embedding vectors, based on the acoustic embedding model;
displaying at least one candidate named entity corresponding to the at least one candidate embedding vector from among the plurality of named entities included in the acoustic embedding DB; and
based on receiving a user input for selecting one of the at least one candidate named entity, displaying a result of the speech recognition corresponding to the selected candidate named entity and storing the first acoustic embedding vector in the acoustic embedding DB to correspond to the selected candidate named entity,
wherein the acoustic embedding model is trained using training data, the training data comprising texts representing one or more named entities, respective phoneme labels corresponding to the texts, and one or more speech signals corresponding to utterances of the one or more named entities; and wherein the training comprises:
performing a first training of the acoustic embedding model using the training data including the texts, the respective phoneme labels, and the one or more speech signals; and
performing a second training of the acoustic embedding model using a subset of the training data, the subset including the texts and the one or more speech signals.
|