| CPC G10L 25/66 (2013.01) [A61B 5/0823 (2013.01); A61B 5/4803 (2013.01); A61B 5/7267 (2013.01); A61B 5/7282 (2013.01); G10L 15/02 (2013.01); G10L 15/04 (2013.01); G10L 15/063 (2013.01); G10L 25/30 (2013.01); G10L 25/51 (2013.01); G10L 25/78 (2013.01); G16H 40/67 (2018.01)] | 22 Claims |

|
1. A computer-implemented method of detecting a non-semantic and paralinguistic event in an audio stream comprising:
performing one or more pre-processing steps on the audio stream to generate an input audio sequence comprising a plurality of time-separated audio segments;
generating, by a student model, an embedding for the plurality of time-separated audio segments, the student model having been trained using knowledge distillation applied to a self-supervised triplet loss embedding model, the self-supervised triplet loss embedding model having been trained to learn an audio feature set in a self-supervised triplet loss manner from a plurality of speech audio clips; and
providing the embedding for the plurality of audio segments to an inference model performing inference to detect the non-semantic and paralinguistic event.
|