| CPC G10L 21/0232 (2013.01) [G06N 3/0455 (2023.01); G10L 21/02 (2013.01); G10L 21/038 (2013.01)] | 30 Claims |

|
1. A device to perform speech enhancement, the device comprising:
one or more processors configured to:
process image data to detect at least one of an emotion, a speaker identification, or a noise type;
generate context data that represents the at least one of the emotion, the speaker identification, or the noise type;
obtain input spectral data based on an input signal that corresponds to the image data, the input signal representing sound that includes speech;
provide the input spectral data to a first encoder of a multi-encoder transformer to generate first encoded data;
provide the context data to at least a second encoder of the multi-encoder transformer to generate second encoded data;
provide the first encoded data and the second encoded data to a decoder of the multi-encoder transformer to generate output spectral data that represents a speech enhanced version of the input signal; and
perform speech synthesis on the output spectral data to generate an output waveform corresponding to an enhanced version of the speech.
|