US 12,353,479 B2
	Classifying documents using a domain-specific natural language processing model
Sameen Mayur Desai, Morris Plains, NJ (US); and Grigoriy Aleksandrovich Serbarinov, Summit, NJ (US)
Assigned to Bristol-Myers Squibb Company, Princeton, NJ (US)
Filed by Bristol-Myers Squibb Company, Summit, NJ (US)
Filed on Dec. 9, 2021, as Appl. No. 17/547,017.
Claims priority of provisional application 63/123,336, filed on Dec. 9, 2020.
Prior Publication US 2022/0179906 A1, Jun. 9, 2022
Int. Cl. G06F 16/906 (2019.01); G06F 16/93 (2019.01); G06F 40/216 (2020.01); G06F 40/295 (2020.01); G06N 3/044 (2023.01); G06N 3/045 (2023.01); G06N 3/08 (2023.01)

CPC G06F 16/906 (2019.01) [G06F 16/93 (2019.01); G06F 40/216 (2020.01); G06F 40/295 (2020.01); G06N 3/044 (2023.01); G06N 3/045 (2023.01); G06N 3/08 (2013.01)]

16 Claims

1. A method comprising:

receiving, by one or more computing devices, a set of documents and metadata for each document in the set of documents, wherein the set of documents correspond to a domain;

generating, by the one or more computing devices, a set of word embeddings for each document of the set of documents, each word embedding including one or more words from a respective document;

tokenizing, by the one or more computing devices, each word embedding of the set of word embeddings into a set of segments, each segment including a word from the word embedding;

training, by the one or more computing devices, a learning model to classify each document of the set of documents of the domain by recursively, during each of a number of iterations of the training:

breaking down, by the one or more computing devices, each of the segments of the set of segments of each document of the set of documents into a set of features;

assigning, by the one or more computing devices, a part-of-speech tag to each of the segments of the set of segments for each document of the set of documents based on predetermined weights assigned to each feature of the set of features of a corresponding segment;

assigning, by the one or more computing devices, a dependency tag to each of the segments of the set of segments of each document of the set of documents based on the part-of-speech tag assigned to the corresponding segment and the predetermined weights assigned to each feature of the set of features of the corresponding segment;

assigning, by the one or more computing devices, a Named Entity Recognition (NER) label from a set of predefined labels corresponding to the domain to each of the segments of the set of segments of each document of the set of documents based on the part-of-speech tag and the dependency tag assigned to the corresponding segment and the predetermined weights assigned to each feature of the set of features of the corresponding segment; and

validating, by the one or more computing devices, the assigned NER labels by comparing the metadata for each document to the assigned NER labels of the respective document.