US 11,995,519 B2
	Method of and server for converting categorical feature value into a numeric representation thereof and for generating a split value for the categorical feature
Andrey Vladimirovich Gulin, Moscow region (RU)
Assigned to Direct Cursus Technology L.L.C, Dubai (AE)
Filed by YANDEX EUROPE AG, Lucerne (CH)
Filed on Jun. 6, 2018, as Appl. No. 16/000,977.
Claims priority of application No. RU2017140973 (RU), filed on Nov. 24, 2017.
Prior Publication US 2019/0164085 A1, May 30, 2019
Int. Cl. G06N 20/00 (2019.01); G06F 16/901 (2019.01); G06N 5/045 (2023.01)

CPC G06N 20/00 (2019.01) [G06F 16/9027 (2019.01); G06N 5/045 (2013.01)]

9 Claims

1. A method of converting a value of a categorical feature into a numeric representation thereof, the categorical feature being associated with a training object used for training a Machine Learning Algorithm (MLA), the MLA being executable by a first server to predict a target value for an in-use object, the MLA comprising a set of models hosted by a plurality of second servers, each model of the set of models being based on an ensemble of decision trees, the training object being processed in a node of a given level of a decision tree of the ensemble of decision trees, the decision tree having at least one prior level of the decision tree, the at least one prior level having at least one prior training object having at least one prior categorical feature value having been converted to a prior numeric representation thereof for the at least one prior level of the decision tree, the MLA executable by an electronic device to predict a value for an in-use object, the method comprising:

accessing, from a non-transitory computer-readable medium of the first server, a set of training objects, wherein each training object of the set of training objects contains a document and an event indicator associated with the document, and wherein each document is associated with a categorical feature;

organizing the set of training objects into an ordered list of training objects, wherein when the training objects are associated with an inherent temporal order, the ordered list of training objects are ordered in accordance with the temporal order and when the training objects do not have an inherent temporal order, generating the ordered list of training objects in a random order of the training objects to be used as the ordered list of training objects;

storing the set of training objects in a plurality of databases, wherein each database of the plurality of databases comprises a copy of the set of training objects, and wherein each second server of the plurality of second servers is associated with a respective database of the plurality of databases;

generating, by the plurality of second servers operating in parallel, each second server accessing the set of training objects on its respective database of the plurality of databases, the set of models for the MLA, wherein generating each model of the set of models comprises generating the numeric representation of the categorical feature value by:

retrieving the prior numeric representation of the at least one prior categorical feature value for a given object of the set of training objects at the at least one prior level of the decision tree;

generating, for each combination of the at least one prior categorical feature value at the at least one prior level of the decision tree and at least one of the categorical feature values of the set of training objects, a current numeric representation for the given level of the decision tree, the generating the current numeric representation being done while generating the decision tree, wherein the current numeric representation is generated by:

(i) counting a first number of training objects that precede the training object in the ordered list and have both the at least one of the categorical feature values and event indicators with positive outcomes,

(ii) counting a second number of total training objects that precede the training object in the ordered list with the at least on of the categorical feature values, and

(iii) dividing the first number by the second number; and

after building the set of models for the MLA, transmitting, by the plurality of second servers and to the first server, indications that the set of models for the MLA has been generated.