US 12,230,259 B2
Array geometry agnostic multi-channel personalized speech enhancement
Sefik Emre Eskimez, Bellevue, WA (US); Takuya Yoshioka, Bellevue, WA (US); Huaming Wang, Clyde Hill, WA (US); Hassan Taherian, Columbus, OH (US); Zhuo Chen, Redmond, WA (US); and Xuedong Huang, Redmond, WA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Dec. 17, 2021, as Appl. No. 17/555,332.
Claims priority of provisional application 63/252,493, filed on Oct. 5, 2021.
Prior Publication US 2023/0116052 A1, Apr. 13, 2023
Int. Cl. G10L 15/20 (2006.01); G10L 15/22 (2006.01); G10L 21/0208 (2013.01)
CPC G10L 15/20 (2013.01) [G10L 15/22 (2013.01); G10L 21/0208 (2013.01)] 18 Claims
OG exemplary drawing
 
1. A system comprising:
a processor; and
a computer-readable medium storing instructions that are operative upon execution by the processor to:
extract speaker embeddings from enrollment data for a first target speaker;
extract spatial features from input audio captured by a microphone array, the input audio including a mixture of speech data of the first target speaker and an interfering speaker;
provide the input audio, the extracted speaker embeddings, and the extracted spatial features to a geometry-agnostic personalized speech enhancement (PSE) model trained with a virtual microphone signal comprising a combination of outputs of microphones in a multi-channel microphone array having different array geometries; and
produce output data using the geometry-agnostic PSE model without geometry information for the microphone array, the output data comprising estimated clean speech data of the first target speaker with a reduction of speech data of the interfering speaker.