| CPC G10L 15/063 (2013.01) [G10L 15/16 (2013.01)] | 20 Claims |

|
1. A system comprising:
a processor that executes computer-executable components stored in a non-transitory computer-readable memory, wherein the computer-executable components comprise:
a deriving component that derives one or more speech-based embeddings from an utterance via a speech encoder;
a cross-attention component that aligns, at a token level, one or more Large Language Model (LLM) based sentence embeddings with the one or more speech-based embeddings;
a loss component that combines an alignment loss and an Automatic Speech Recognition (ASR) loss; and
a training component that trains an ASR system with an end-to-end framework using the loss component and the cross-attention component to produce one or more enriched embeddings.
|