US 12,437,751 B2
	Systems and methods of speaker-independent embedding for identification and verification from audio
Kedar Phatak, Atlanta, GA (US); and Elie Khoury, Atlanta, GA (US)
Assigned to Pindrop Security, Inc., Atlanta, GA (US)
Filed by PINDROP SECURITY, INC., Atlanta, GA (US)
Filed on Feb. 23, 2024, as Appl. No. 18/585,366.
Application 18/585,366 is a continuation of application No. 17/192,464, filed on Mar. 4, 2021, granted, now 11,948,553.
Claims priority of provisional application 62/985,757, filed on Mar. 5, 2020.
Prior Publication US 2024/0233709 A1, Jul. 11, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 15/06 (2013.01); G06N 3/045 (2023.01); G06N 20/00 (2019.01); G10L 15/16 (2006.01); G10L 25/27 (2013.01)

CPC G10L 15/063 (2013.01) [G06N 3/045 (2023.01); G06N 20/00 (2019.01); G10L 15/16 (2013.01); G10L 25/27 (2013.01)]

20 Claims

1. A computer-implemented method for authenticating audio signals using deep phoneprint (DP) embedding vectors, the method comprising:

executing, by the computer, a plurality of task-specific machine learning models using a plurality of features of speech and non-speech portions of an enrollment audio signal having one or more enrollment speaker-independent characteristics as an input to extract a plurality of enrollment speaker-independent embeddings for the enrollment audio signal using one or more embedding extraction layers of each of the plurality of task-specific machine learning models, the plurality of features of the enrollment audio signal including at least one of a spectro-temporal feature of the enrollment audio signal and metadata associated with the enrollment audio signal;

extracting, by the computer, an enrollment DP vector for the enrollment audio signal based upon the plurality of enrollment speaker-independent embeddings extracted for the enrollment audio signal;

executing, by the computer, the plurality of task-specific machine learning models using a plurality of features of speech and non-speech portions of an inbound audio signal having one or more inbound speaker-independent characteristics as the input to extract a plurality of inbound speaker-independent embeddings for the inbound audio signal using one or more embedding extraction layers of each of the plurality of task-specific machine learning models, the plurality of features of the inbound audio signal including at least one of a spectro-temporal feature of the inbound audio signal and metadata associated with the inbound audio signal;

extracting, by the computer, an inbound DP vector for the inbound audio signal based upon the plurality of inbound speaker-independent embeddings extracted for the inbound audio signal; and

generating, by the computer, one or more similarity scores for the inbound audio signal using the inbound DP vector and the enrollment DP vector for the enrolled audio signal.