US 12,469,487 B2
	Method and apparatus for multilingual speech recognition based on artificial intelligence models
Geun Bae Lee, Pohang-si (KR); and Won Jun Lee, Pohang-si (KR)
Assigned to POSTECH RESEARCH AND BUSINESS DEVELOPMENT FOUNDATION, Pohang-si (KR)
Filed by POSTECH Research and Business Development Foundation, Pohang-si (KR)
Filed on Sep. 27, 2022, as Appl. No. 17/954,185.
Claims priority of application No. 10-2021-0161358 (KR), filed on Nov. 22, 2021; and application No. 10-2022-0076334 (KR), filed on Jun. 22, 2022.
Prior Publication US 2023/0162727 A1, May 25, 2023
Int. Cl. G10L 15/16 (2006.01); G06N 3/08 (2023.01); G10L 15/00 (2013.01); G10L 15/02 (2006.01); G10L 15/22 (2006.01)

CPC G10L 15/16 (2013.01) [G10L 15/005 (2013.01); G10L 15/02 (2013.01); G10L 15/22 (2013.01)]

20 Claims

1. A method for automatic multilingual speech recognition based on artificial intelligence of a single model and performed by a speech recognition apparatus comprising at least one processor and a memory storing instructions executed by the processor, the method comprising:

recognizing, by a speech recognizer including a convolutional neural network (CNN)-based feature extractor and a transformer encoder, input audio data and converting the input audio data into feature representations;

classifying, by a speech language classifier connected to the speech recognizer, a language corresponding to the input audio data based on the feature representations, and generating language classification information including a confidence level;

selecting and activating, by an output layer selector operatively coupled to the speech recognizer, one projection output layer from among a plurality of projection output layers based on the language classification information, wherein each projection output layer is configured to correspond to a specific language and includes a set of language-specific character units represented in a byte format; and

generating, by the activated projection output layer, output values in units of bytes corresponding to the classified language to produce a speech recognition result for the input audio data,

wherein the plurality of projection output layers includes a general-purpose projection output layer configured to be activated when the confidence level is below a predetermined threshold.