US 12,444,402 B2
Speech recognition device and operating method thereof
Jakub Hoscilowicz, Warsaw (PL); and Kornel Jankowski, Warsaw (PL)
Assigned to SAMSUNG ELECTRONICS CO., LTD., Suwon-si (KR)
Filed by SAMSUNG ELECTRONICS CO., LTD., Suwon-si (KR)
Filed on Jun. 23, 2022, as Appl. No. 17/847,469.
Application 17/847,469 is a continuation of application No. PCT/KR2022/008311, filed on Jun. 13, 2022.
Claims priority of application No. 10-2021-0126707 (KR), filed on Sep. 24, 2021.
Prior Publication US 2023/0115538 A1, Apr. 13, 2023
Int. Cl. G10L 15/02 (2006.01); G06F 16/68 (2019.01); G06F 40/295 (2020.01); G10L 15/04 (2013.01); G10L 15/08 (2006.01)
CPC G10L 15/02 (2013.01) [G06F 16/68 (2019.01); G10L 15/04 (2013.01); G10L 15/08 (2013.01)] 14 Claims
OG exemplary drawing
 
1. A speech recognition method comprising:
receiving a speech signal generated by an utterance of a user;
identifying a named entity from the received speech signal;
determining a speech signal portion, which corresponds to the identified named entity, from the received speech signal;
generating a first acoustic embedding vector corresponding to the speech signal portion, based on an acoustic embedding model;
determining an acoustic embedding vector closet to the first acoustic embedding vector from among a plurality of acoustic embedding vectors corresponding to a plurality of named entities included in an acoustic embedding database (DB), as a second acoustic embedding vector, based on the acoustic embedding model;
determining a corrected named entity corresponding to the second acoustic embedding vector, from among the plurality of named entities included in the acoustic embedding DB;
displaying the corrected named entity and a result of speech recognition with respect to the speech signal, based on the corrected named entity;
determining at least one candidate embedding vector including an acoustic embedding vector second closest to the first acoustic embedding vector and an acoustic embedding vector third closest to the first acoustic embedding vector from among the plurality of acoustic embedding vectors, based on the acoustic embedding model;
displaying at least one candidate named entity corresponding to the at least one candidate embedding vector from among the plurality of named entities included in the acoustic embedding DB; and
based on receiving a user input for selecting one of the at least one candidate named entity, displaying a result of the speech recognition corresponding to the selected candidate named entity and storing the first acoustic embedding vector in the acoustic embedding DB to correspond to the selected candidate named entity,
wherein the acoustic embedding model is trained using training data, the training data comprising texts representing one or more named entities, respective phoneme labels corresponding to the texts, and one or more speech signals corresponding to utterances of the one or more named entities; and wherein the training comprises:
performing a first training of the acoustic embedding model using the training data including the texts, the respective phoneme labels, and the one or more speech signals; and
performing a second training of the acoustic embedding model using a subset of the training data, the subset including the texts and the one or more speech signals.