US 12,254,669 B1
	Systems and methods for training a multi-modal machine learning architecture for content generation
Elham Saraee, Medford, MA (US); Jehan Hamedi, Wellesley, MA (US); and Zachary Halloran, Franklin, MA (US)
Assigned to VIZIT LABS, INC., Boston, MA (US)
Filed by VIZIT LABS, INC., Boston, MA (US)
Filed on Nov. 14, 2024, as Appl. No. 18/948,425.
Application 16/537,426 is a division of application No. 15/727,044, filed on Oct. 6, 2017, granted, now 10,380,650.
Application 18/948,425 is a continuation in part of application No. 18/943,693, filed on Nov. 11, 2024.
Application 18/943,693 is a continuation of application No. 18/782,569, filed on Jul. 24, 2024, granted, now 12,142,027.
Application 18/782,569 is a continuation in part of application No. 18/414,148, filed on Jan. 16, 2024, granted, now 12,080,046.
Application 18/414,148 is a continuation in part of application No. 18/494,483, filed on Oct. 25, 2023, granted, now 11,922,675.
Application 18/494,483 is a continuation of application No. 17/833,671, filed on Jun. 6, 2022, granted, now 11,804,028.
Application 17/833,671 is a continuation of application No. 17/548,341, filed on Dec. 10, 2021, granted, now 11,417,085.
Application 17/548,341 is a continuation in part of application No. 16/537,426, filed on Aug. 9, 2019, abandoned.
Claims priority of provisional application 63/606,210, filed on Dec. 5, 2023.
Claims priority of provisional application 63/529,588, filed on Jul. 28, 2023.
Claims priority of provisional application 62/537,428, filed on Jul. 26, 2017.
This patent is subject to a terminal disclaimer.
Int. Cl. G06V 10/82 (2022.01); G06F 16/438 (2019.01); G06N 3/045 (2023.01); G06V 10/40 (2022.01); G06V 10/74 (2022.01)

CPC G06V 10/761 (2022.01) [G06F 16/438 (2019.01); G06N 3/045 (2023.01); G06V 10/40 (2022.01); G06V 10/82 (2022.01)]

30 Claims

1. A method, comprising:

receiving, by one or more processors, a plurality of training images;

executing, by the one or more processors, a feature extraction machine learning model using the plurality of training images as input to generate a plurality of training embeddings for the plurality of training images each in an embedding space;

training, by the one or more processors, a content scoring machine learning model using the plurality of training embeddings for the plurality of training images to generate performance scores for content items based on embeddings in the embedding space;

receiving, by the one or more processors, a set of text;

executing, by the one or more processors, the feature extraction machine learning model using the set of text to generate a text embedding in the same embedding space as the training embeddings for the plurality of training images and corresponding to one or more features of the set of text;

generating, by the one or more processors using the content scoring machine learning model, a text performance score for the set of text using the text embedding in the embedding space and corresponding to the one or more features of the set of text; and

generating, by the one or more processors, a record identifying the text performance score for the set of text.