US 12,444,405 B2
	Textual knowledge transfer for improved speech recognition and understanding
Samuel Thomas, White Plains, NY (US); Vishal Sunder, Columbus, OH (US); Hong-Kwang Kuo, Pleasantville, NY (US); Brian E. D. Kingsbury, Cortlandt Manor, NY (US); Eric Fosler-Lussier, Columbus, OH (US); and George Andrei Saon, Stamford, CT (US)
Assigned to International Business Machines Corporation, Armonk, NY (US); and Ohio State Innovation Foundation, Columbus, OH (US)
Filed by International Business Machines Corporation, Armonk, NY (US); and Ohio State Innovation Foundation, Columbus, OH (US)
Filed on May 2, 2023, as Appl. No. 18/310,598.
Prior Publication US 2024/0371361 A1, Nov. 7, 2024
Int. Cl. G10L 15/06 (2013.01); G10L 15/16 (2006.01)

CPC G10L 15/063 (2013.01) [G10L 15/16 (2013.01)]

20 Claims

1. A system comprising:

a processor that executes computer-executable components stored in a non-transitory computer-readable memory, wherein the computer-executable components comprise:

a deriving component that derives one or more speech-based embeddings from an utterance via a speech encoder;

a cross-attention component that aligns, at a token level, one or more Large Language Model (LLM) based sentence embeddings with the one or more speech-based embeddings;

a loss component that combines an alignment loss and an Automatic Speech Recognition (ASR) loss; and

a training component that trains an ASR system with an end-to-end framework using the loss component and the cross-attention component to produce one or more enriched embeddings.