US 11,854,558 B2
	System and method for training a transformer-in-transformer-based neural network model for audio data
Wei Tsung Lu, Los Angeles, CA (US); Ju-Chiang Wang, Los Angeles, CA (US); Minz Won, Los Angeles, CA (US); Keunwoo Choi, Los Angeles, CA (US); and Xuchen Song, Los Angeles, CA (US)
Assigned to Lemon Inc., Grand Cayman (KY)
Filed by Lemon Inc., Grand Cayman (KY)
Filed on Oct. 15, 2021, as Appl. No. 17/502,863.
Prior Publication US 2023/0124006 A1, Apr. 20, 2023
Int. Cl. G10L 19/02 (2013.01); G10L 25/30 (2013.01)

CPC G10L 19/02 (2013.01) [G10L 25/30 (2013.01)]

16 Claims

1. An apparatus comprising:

at least one processor and a non-transitory computer-readable medium storing therein computer program code including instructions for one or more programs that, when executed by the processor, cause the processor to:

obtain audio data;

generate a time-frequency representation of the audio data to be applied as input for a transformer-based neural network model, the transformer-based neural network model including a spectral transformer and a temporal transformer;

determine spectral embeddings and first temporal embeddings of the audio data based on the time-frequency representation of the audio data, the spectral embeddings including a first frequency class token (FCT);

determine each vector of a second FCT by passing each vector of the first FCT in the spectral embeddings through the spectral transformer;

determine second temporal embeddings by adding a linear projection of the second FCT to the first temporal embeddings;

determine third temporal embeddings by passing the second temporal embeddings through the temporal transformer; and

generate music information of the audio data based on the third temporal embeddings.