| CPC G10L 15/16 (2013.01) [G10L 15/005 (2013.01); G10L 15/02 (2013.01); G10L 15/22 (2013.01)] | 20 Claims |

|
1. A method for automatic multilingual speech recognition based on artificial intelligence of a single model and performed by a speech recognition apparatus comprising at least one processor and a memory storing instructions executed by the processor, the method comprising:
recognizing, by a speech recognizer including a convolutional neural network (CNN)-based feature extractor and a transformer encoder, input audio data and converting the input audio data into feature representations;
classifying, by a speech language classifier connected to the speech recognizer, a language corresponding to the input audio data based on the feature representations, and generating language classification information including a confidence level;
selecting and activating, by an output layer selector operatively coupled to the speech recognizer, one projection output layer from among a plurality of projection output layers based on the language classification information, wherein each projection output layer is configured to correspond to a specific language and includes a set of language-specific character units represented in a byte format; and
generating, by the activated projection output layer, output values in units of bytes corresponding to the classified language to produce a speech recognition result for the input audio data,
wherein the plurality of projection output layers includes a general-purpose projection output layer configured to be activated when the confidence level is below a predetermined threshold.
|