| CPC G10L 21/0216 (2013.01) [G10L 2021/02166 (2013.01)] | 9 Claims |

|
1. An artificial intelligence device comprising:
a plurality of microphones; and
a processor configured to:
receive a video signal and a plurality of voice signals each respectively input from a corresponding microphone among the plurality of microphones;
obtain, based on the received video signal, an angle between a reference microphone and a specific speaker corresponding to a specific speaker image from the received video signal;
determine a first output value by performing adaptive beamforming based on the received plurality of voice signals and the obtained angle;
determine a second output value by performing fixed beamforming based on two voice signals input through two preset microphones among the received plurality of voice signals and the obtained angle;
generate a mask value based on the determined first output value, the determined second output value, and a video zooming magnification;
generate an enhancement signal based on the generated mask value and a phase of the second output value;
convert each of the two voice signals into a power spectrum;
obtain the second output value by performing the fixed beamforming to increase power of a point corresponding to the obtained angle from the converted power spectrum; and
generate the mask value according to Equation 1 below:
![]() wherein E_Adaptive(k,l) denotes the first output value according to a k-th frequency and an l-th adaptive beamformer,
|E_Adaptive(k,l)| denotes a square root value of gain of the first output value,
E_fixed(k,l) denotes the second output value according to a k-th frequency and an l-th fixed beamformer,
|E_Fixed (k,l)| denotes a square root value of gain of the second output value,
β is set to 0 in case of a minimum magnification, β=|E_Fixed (k,l)| in case of a maximum magnification, and MAX (α)/α in the other case, and
α denotes the video zooming magnification.
|