US 12,217,549 B1
System and method for matching a voice sample to a facial image based on voice and image metaproperties
Nir Schwartz, Ramat-Gan (IL); and Arkady Krishtul, Zichron Yakov (IL)
Assigned to CORSOUND AI LTD., Tel Aviv (IL)
Filed by Corsound AI Ltd, Tel Aviv (IL)
Filed on Apr. 11, 2024, as Appl. No. 18/632,397.
Int. Cl. G06V 40/70 (2022.01); G06V 10/764 (2022.01); G06V 10/774 (2022.01); G06V 10/776 (2022.01); G06V 40/16 (2022.01); G10L 17/02 (2013.01); G10L 17/04 (2013.01); G10L 17/10 (2013.01)
CPC G06V 40/70 (2022.01) [G06V 10/764 (2022.01); G06V 10/774 (2022.01); G06V 10/776 (2022.01); G06V 40/172 (2022.01); G06V 40/179 (2022.01); G10L 17/02 (2013.01); G10L 17/04 (2013.01); G10L 17/10 (2013.01); G06V 2201/10 (2022.01)] 12 Claims
OG exemplary drawing
 
1. A method for matching a voice sample to a facial image, the method comprising, using a processor:
obtaining a voice sample and a facial image;
calculating a plurality of voice metaproperties from the voice sample;
calculating a plurality of image metaproperties from the facial image, wherein each of the image metaproperties corresponds to one of the voice metaproperties, wherein each of the voice metaproperties and the image metaproperties comprises a probability distribution providing the probabilities that the voice metaproperty or the image metaproperty equals certain values of the metaproperty; and
determining a level of match between the voice sample and the facial image, based on the plurality of voice metaproperties and the plurality of image metaproperties, wherein determining whether the voice sample matches the facial image is performed by:
calculating a distance between each of the voice metaproperties and the corresponding image metaproperty;
calculating weights for a weighted sum operation by training a classifier and deriving the weights from the parameters of the classifier;
calculating the weighted sum of the distances; and
determining that the voice sample matches the facial image if the weighted sum satisfies a threshold condition, and that the voice sample does not match the facial image otherwise,
wherein the classifier is trained by:
obtaining a labelled dataset comprising a plurality of matching pairs labelled as matching pairs, and a plurality of unmatching pairs, labelled as unmatching pairs, wherein each of the matching pairs comprises a matching labelled voice sample and labelled facial image, and each of the unmatching pairs comprises an unmatching labelled voice sample and labelled facial image;
calculating, for each of the labelled voice samples, the plurality of voice metaproperties from the labelled voice sample;
calculating, for each of the labelled facial images, the plurality of image metaproperties from the labelled facial image; and
using the plurality of voice metaproperties and the plurality of image metaproperties of the plurality of matching pairs and the plurality of unmatching pairs, and the associated labels, to train the classifier.