US 12,249,346 B2
	Method for detecting and classifying coughs or other non-semantic sounds using audio feature set learned from speech
Jacob Garrison, Seattle, WA (US); Jacob Scott Peplinski, Chandler, AZ (US); and Joel Shor, Tokyo (JP)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Nov. 15, 2023, as Appl. No. 18/509,722.
Application 18/509,722 is a continuation of application No. 17/507,461, filed on Oct. 21, 2021, granted, now 11,862,188.
Claims priority of provisional application 63/104,291, filed on Oct. 22, 2020.
Prior Publication US 2024/0161769 A1, May 16, 2024
Int. Cl. G10L 25/66 (2013.01); A61B 5/00 (2006.01); A61B 5/08 (2006.01); G10L 15/02 (2006.01); G10L 15/04 (2013.01); G10L 15/06 (2013.01); G10L 25/30 (2013.01); G10L 25/51 (2013.01); G10L 25/78 (2013.01); G16H 40/67 (2018.01)

CPC G10L 25/66 (2013.01) [A61B 5/0823 (2013.01); A61B 5/4803 (2013.01); A61B 5/7267 (2013.01); A61B 5/7282 (2013.01); G10L 15/02 (2013.01); G10L 15/04 (2013.01); G10L 15/063 (2013.01); G10L 25/30 (2013.01); G10L 25/51 (2013.01); G10L 25/78 (2013.01); G16H 40/67 (2018.01)]

22 Claims

1. A computer-implemented method of detecting a non-semantic and paralinguistic event in an audio stream comprising:

performing one or more pre-processing steps on the audio stream to generate an input audio sequence comprising a plurality of time-separated audio segments;

generating, by a student model, an embedding for the plurality of time-separated audio segments, the student model having been trained using knowledge distillation applied to a self-supervised triplet loss embedding model, the self-supervised triplet loss embedding model having been trained to learn an audio feature set in a self-supervised triplet loss manner from a plurality of speech audio clips; and

providing the embedding for the plurality of audio segments to an inference model performing inference to detect the non-semantic and paralinguistic event.