| CPC G10L 15/20 (2013.01) [G10L 15/22 (2013.01); G10L 21/0208 (2013.01)] | 18 Claims |

|
1. A system comprising:
a processor; and
a computer-readable medium storing instructions that are operative upon execution by the processor to:
extract speaker embeddings from enrollment data for a first target speaker;
extract spatial features from input audio captured by a microphone array, the input audio including a mixture of speech data of the first target speaker and an interfering speaker;
provide the input audio, the extracted speaker embeddings, and the extracted spatial features to a geometry-agnostic personalized speech enhancement (PSE) model trained with a virtual microphone signal comprising a combination of outputs of microphones in a multi-channel microphone array having different array geometries; and
produce output data using the geometry-agnostic PSE model without geometry information for the microphone array, the output data comprising estimated clean speech data of the first target speaker with a reduction of speech data of the interfering speaker.
|