| CPC G06F 40/205 (2020.01) [G06F 40/289 (2020.01); G06F 40/30 (2020.01)] | 49 Claims |

|
1. A method comprising
receiving, by a computing device, a first entity and a second entity;
accessing a corpus;
preprocessing, by the computing device, the corpus by:
grouping the corpus into a plurality of chunks at a head node;
distributing the plurality of chunks to a plurality of worker nodes configured as parallel processing units within a distributed cluster, wherein each worker node is executed on a separate physical or virtual machine and operates asynchronously;
retrieving a second plurality of sentences from one of the plurality of chunks of the corpus by one of the plurality of worker nodes;
extracting then sending a plurality of extracted entities and extracted relational phrases to the head node;
mapping the extracted relational phrases in a pretrained vector space using a pretrained
model to generate a plurality of extracted relational phrase embeddings;
clustering the extracted relational phrases in the pretrained vector space to generate clustering information for the plurality of extracted relational phrase embeddings; and
storing a mapping of the plurality of extracted entities, the plurality of extracted relational phrase embeddings, and the clustering information for the plurality of extracted relational phrase embeddings;
retrieving, by the computing device, a first plurality of sentences containing the first entity and the second entity from the corpus;
identifying, by the computing device, a plurality of relational phrases by extracting a relational phrase from each of the first plurality of sentences; and
identifying, by the computing device, one or more relationships between the first entity and the second entity.
|