US 12,236,975 B2
	Bi-directional recurrent encoders with multi-hop attention for speech emotion recognition
Trung Bui, San Jose, CA (US); Subhadeep Dey, Martigny (CH); and Seunghyun Yoon, Seoul (KR)
Assigned to Adobe Inc., San Jose, CA (US)
Filed by Adobe Inc., San Jose, CA (US)
Filed on Nov. 15, 2021, as Appl. No. 17/526,810.
Application 17/526,810 is a continuation of application No. 16/543,342, filed on Aug. 16, 2019, granted, now 11,205,444.
Prior Publication US 2022/0076693 A1, Mar. 10, 2022
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 25/00 (2013.01); G06F 17/16 (2006.01); G06F 17/18 (2006.01); G06N 3/047 (2023.01); G10L 25/30 (2013.01); G10L 25/63 (2013.01)

CPC G10L 25/63 (2013.01) [G06F 17/16 (2013.01); G06F 17/18 (2013.01); G06N 3/047 (2023.01); G10L 25/30 (2013.01)]

20 Claims

1. A system comprising:

one or more memory devices comprising:

an audio bi-directional recurrent encoder that generates an audio feature vector for one or more words in an acoustic sequence;

a textual bi-directional recurrent encoder that generates a textual feature vector for the one or more words in a textual sequence corresponding to the acoustic sequence;

a multi-hop neural attention model that generates an attention output at each hop that alternates from utilizing the textual feature vector and the audio feature vector as context; and

a hidden feature vector generator that generates a hidden feature vector based on the attention output and one or more of the audio feature vector and the textual feature vector; and

one or more processors configured to cause the system to determine an emotion of the acoustic sequence based on the hidden feature vector.

10. A system comprising:

one or more memory devices comprising:

an audio encoder that generates an audio feature vector for one or more words in an acoustic sequence;

a textual encoder that generates a textual feature vector for the one or more words in a textual sequence corresponding to the acoustic sequence;

a first neural attention model that generates a first attention output by applying attention to the textual feature vector using the audio feature vector as context;

a first hidden feature vector generator that generates a first hidden feature vector based on the first attention output;

a second neural attention model that generates a second attention output by applying attention to the audio feature vector using the first hidden feature vector as context; and

a second hidden feature vector generator that generates a second hidden feature vector based on the second attention output and the audio feature vector; and

one or more processors configured to cause the system to determine an emotion of the acoustic sequence based on the first hidden feature vector and the second hidden feature vector.