US 12,272,377 B2
	Audio event detection with window-based prediction
Lihi Ahuva Shiloh Perl, Tel-Aviv (IL); Ben Fishman, Herzelya (IL); Gilad Pundak, Rehovot (IL); and Yonit Hoffman, Herzeliya (IL)
Assigned to MICROSOFT TECHNOLOGY LICENSING, LLC, Redmond, WA (US)
Filed by MICROSOFT TECHNOLOGY LICENSING, LLC, Redmond, WA (US)
Filed on Mar. 5, 2024, as Appl. No. 18/596,075.
Application 18/596,075 is a continuation of application No. 17/647,318, filed on Jan. 6, 2022, granted, now 11,948,599.
Prior Publication US 2024/0363139 A1, Oct. 31, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 25/93 (2013.01); G06N 3/048 (2023.01); G06N 3/08 (2023.01); G10L 25/45 (2013.01)

CPC G10L 25/93 (2013.01) [G06N 3/048 (2023.01); G06N 3/08 (2013.01); G10L 25/45 (2013.01)]

20 Claims

1. A computing system, comprising:

one or more processors; and

memory storing instructions that, when executed, cause the one or more processors to:

tag respective portions of an audio signal with ground truth labels for a plurality of audio event classes;

generate a consolidated audio signal by augmenting the audio signal with semi-synthetic audio signals;

divide the consolidated audio signal into a plurality of segments, wherein:

each segment of the plurality of segments overlaps an adjacent segment by an overlap amount; and

each segment of the plurality of segments that is associated with a respective portion of the audio signal that is tagged with a ground truth label retains the ground truth label;

form a training data set by generating a normalized time domain representation of each segment of the plurality of segments; and

train, based on the training data set and for the normalized time domain representation of each segment of the plurality of segments, an artificial intelligence model to predict a classification score for each audio event class of the plurality of audio event classes.