| CPC G10L 21/0224 (2013.01) [G10L 21/0216 (2013.01); G10L 2021/02166 (2013.01)] | 23 Claims |

|
1. A method for voice control, comprising:
transforming, using a short-time Fourier transform (STFT) applied to data in each of a plurality of windows aligned across each input channel of a multichannel audio stream, the multichannel audio stream into a complex valued frequency-domain representation,
wherein for a current one of the plurality of windows, the method comprises:
updating a first complex-valued covariance matrix corresponding to a slowly-adapting beamformer and forming a single-channel denoised estimate for each frequency band in the STFT;
calculating a voice activity detection (VAD) estimate for each frequency band in the STFT by comparing a magnitude of the single-channel denoised estimate to a magnitude of each input channel of the multichannel audio stream; and
selectively updating or refraining from updating, responsive to the VAD estimate respectively indicating a presence or an absence of speech, a second complex-valued covariance matrix corresponding to a quickly-adapting beamformer; and
controlling, by a hardware processor, a voice user interface based device to perform a user perceptible action, responsive to an output of at least the quickly-adapting beamformer.
|