US 11,915,686 B2
	Speaker adaptation for attention-based encoder-decoder
Zhong Meng, Seattle, WA (US); Yashesh Gaur, Redmond, WA (US); Jinyu Li, Redmond, WA (US); and Yifan Gong, Sammamish, WA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Jan. 5, 2022, as Appl. No. 17/568,875.
Application 17/568,875 is a continuation of application No. 16/675,515, filed on Nov. 6, 2019, granted, now 11,232,782.
Claims priority of provisional application 62/893,967, filed on Aug. 30, 2019.
Prior Publication US 2022/0130376 A1, Apr. 28, 2022
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 15/065 (2013.01); G10L 15/06 (2013.01); G10L 15/22 (2006.01); G10L 19/00 (2013.01)

CPC G10L 15/065 (2013.01) [G10L 15/063 (2013.01); G10L 15/22 (2013.01); G10L 19/00 (2013.01)]

20 Claims

1. A system comprising:

a processing unit; and

a memory storage device including program code that when executed by the processing unit causes to the system to:

input first speech frames of a target speaker to an adapted speaker-independent attention-based encoder-decoder model; and

output token posteriors corresponding to the input first speech frames from the adapted speaker-independent attention-based encoder-decoder model,

the adapted speaker-independent attention-based encoder-decoder model having been generated by training a speaker-independent attention-based encoder-decoder model to classify output units based on second input speech frames, the trained speaker-independent attention-based encoder-decoder model associated with a first output distribution, and by adapting the trained speaker-independent attention-based encoder-decoder model to classify output tokens based on input speech frames of the target speaker while simultaneously training the trained speaker-independent attention-based encoder-decoder model to maintain a similarity between the first output distribution and a second output distribution of the adapted speaker-independent attention-based encoder-decoder model.