US 11,670,292 B2
	Electronic device, method and computer program
Fabien Cardinaux, Stuttgart (DE); and Marc Ferras Font, Stuttgart (DE)
Assigned to SONY CORPORATION, Tokyo (JP)
Filed by Sony Corporation, Tokyo (JP)
Filed on Feb. 6, 2020, as Appl. No. 16/783,183.
Claims priority of application No. 19166137 (EP), filed on Mar. 29, 2019.
Prior Publication US 2020/0312322 A1, Oct. 1, 2020
Int. Cl. G10L 15/22 (2006.01); G10L 25/24 (2013.01); G10L 19/00 (2013.01); G06N 3/08 (2023.01); G10L 15/16 (2006.01); G10L 15/02 (2006.01); G10L 13/00 (2006.01)

CPC G10L 15/22 (2013.01) [G06N 3/08 (2013.01); G10L 13/00 (2013.01); G10L 15/02 (2013.01); G10L 15/16 (2013.01); G10L 19/00 (2013.01); G10L 25/24 (2013.01)]

14 Claims

1. An electronic device, comprising circuitry configured to:

window an audio input signal to extract features from the audio input signal,

generate a first senone symbol sequence by propagating the extracted features through a neural network,

decoding the first senone symbol sequence to obtain a transcript of the audio input signal,

perform forced alignment on the transcript to obtain a second senone symbol sequence,

determine differences between the second senone symbol sequence and the first senone symbol sequence by computing a gradient of a loss function,

modify the gradient of the loss function by multiplying the gradient of the loss function with a predefined multiplication factor,

perform gradient descent based on the extracted features and the modified gradient of the loss function to generate enhanced features for the extracted features from the transcript of the audio input signal, and

perform vocoding based on the enhanced features to produce an enhanced audio signal.