| CPC G06N 3/0455 (2023.01) [G06F 40/20 (2020.01); G06F 40/295 (2020.01); G06N 3/09 (2023.01); G06Q 30/0202 (2013.01); G06Q 30/0225 (2013.01)] | 10 Claims |

|
1. A computer-implemented method of using machine learning to extract data from electronic communications, the computer-implemented method comprising:
initially training, by at least one processor using a masked language modeling task and a set of training data, a machine learning model, wherein the set of training data comprises a set of training electronic communications labeled with a set of training designated HyperText Markup Language (HTML) tags that define how content included in the set of training electronic communications should be displayed, wherein the content identifies or describes purchased products or services;
further training, by the at least one processor using a named entity recognition task and a task-specific training dataset, the machine learning model, wherein the task-specific training dataset identifies a set of predefined labels associated with the purchased products or services;
training, by the at least one processor using the set of training data, an entity linking model;
creating, using the entity linking model that was trained, a set of distinct groups associated with the purchased products or services;
accessing, by the at least one processor, an electronic communication indicating a purchase of a product or service by an individual;
parsing, by the at least one processor, the electronic communication to extract, from the electronic communication, a set of HTML tags that define how content included in the electronic communication should be displayed;
generating, by the at least one processor, a series of input tokens, including: identifying (i) a portion of the set of HTML tags to remove, and (ii) another
portion of the set of HTML tags to replace with a set of designated HTML tags, removing the portion of the set of HTML tags from the set of HTML tags, replacing the another portion of the set of HTML tags with the set of
designated HTML tags, and
consolidating consecutive HTML tags included in the set of designated HTML tags that replaced the another portion of the set of HTML tags, wherein (i) the removing and replacing enables at least some information to be preserved and enables the machine learning model to learn from different formats of different electronic communications, and (ii) the series of input tokens results from the removing, replacing, and consolidating;
analyzing, by the machine learning model that was trained, the series of input tokens to output a set of token-level predictions respectively corresponding to at least some of the series of input tokens, wherein each of the set of token-level predictions is in a labeled format;
for each token-level prediction in the set of token-level predictions, converting, by the at least one processor, that token-level prediction into a predicted value for a defined category, of a set of defined categories, associated with the purchase of the product or service by the individual; and
determining, by the at least one processor based on (i) the set of predicted values for the set of defined categories, and (ii) the set of distinct groups created using the entity linking model, a digital reward for the purchase of the product or service by the individual.
|