US 12,087,320 B1
	Acoustic event detection
Qin Zhang, Boston, MA (US); Qingming Tang, Cambridge, MA (US); Ming Sun, Winchester, MA (US); Chao Wang, Netwon, MA (US); Steve Mark Lorusso, Weston, MA (US); Andrew Thomas Bydlon, Cambridge, MA (US); James Garnet Droppo, Carnation, WA (US); Viktor Rozgic, Belmont, MA (US); Sripal Mehta, San Francisco, CA (US); and Yang Liu, Los Altos, CA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Feb. 14, 2022, as Appl. No. 17/671,194.
Int. Cl. G10L 25/51 (2013.01); G10L 15/18 (2013.01); G10L 15/22 (2006.01); G10L 15/30 (2013.01)

CPC G10L 25/51 (2013.01) [G10L 15/1815 (2013.01); G10L 15/22 (2013.01); G10L 15/30 (2013.01)]

20 Claims

1. A computer-implemented method comprising:

receiving a first user input representing a first natural language description of a first acoustic event to be detected for a user profile;

determining, using graph data, first audio embedding data corresponding to the first natural language description, the first audio embedding data being determined using audio data that was available prior to receipt of the first user input, and the graph data representing an association between at least the first audio embedding data and at least the first natural language description;

receiving, from a first device associated with the user profile, first audio data;

processing the first audio data with respect to the first audio embedding data to determine first similarity data;

based at least in part on the first similarity data, determining that the first audio data is a first potential sample of the first acoustic event;

in response to determining that the first audio data is the first potential sample, determining acoustic event profile data using the first audio data;

associating the acoustic event profile data with the user profile;

after determining the acoustic event profile data, receiving, from the first device, second audio data;

processing the second audio data with respect to the acoustic event profile data to determine second similarity data;

based at least in part on the second similarity data, determining that the second audio data represents occurrence of the first acoustic event; and

in response to the second audio data representing occurrence of the first acoustic event, sending, to a second device, first output data indicating that the first acoustic event occurred.

5. A computer-implemented method comprising:

receiving a first user input representing a first natural language description corresponding to a first acoustic event to be detected;

determining, based at least in part on stored data, first audio data corresponding to the first natural language description, wherein the stored data was available prior to receipt of the first user input;

determining, using the first audio data, first acoustic event profile data corresponding to the first acoustic event, the first acoustic event profile data associated with a user profile;

after determining the first acoustic event profile data, receiving second audio data associated with the user profile;

determining first similarity data using the first acoustic event profile data and the second audio data;

based at least in part on the first similarity data, determining that the second audio data represents occurrence of the first acoustic event; and

causing presentation of first output data indicating occurrence of the first acoustic event.

13. A system comprising:

at least one processor; and

at least one memory including instructions that, when executed by the at least one processor, cause the system to:

receive a first user input representing a first natural language description corresponding to a first acoustic event to be detected;

determine, based at least in part on stored data, first audio data corresponding to the first natural language description, wherein the stored data was available prior to receipt of the first user input;

determine, using the first audio data, first acoustic event profile data corresponding to the first acoustic event, the first acoustic event profile data associated with a user profile;

after determining the first acoustic event profile data, receive second audio data associated with the user profile;

determine first similarity data using the first acoustic event profile data and the second audio data;

based at least in part on the first similarity data, determine that the second audio data represents occurrence of the first acoustic event; and

cause presentation of first output data indicating occurrence of the first acoustic event.