CPC G16H 50/70 (2018.01) [G16H 10/20 (2018.01)] | 20 Claims |
1. A method performed by one or more computers, the method comprising:
obtaining a collection of structured data records, wherein:
each structured data record is generated by processing a corresponding input text sequence using an extraction neural network;
each structured data record represents information from the corresponding input text sequence in a format that is structured with reference to a predefined schema of semantic categories; and
each structured data record comprises, for each semantic category in the schema, a respective text string that expresses information from the corresponding input text string that is relevant to the semantic category;
selecting a semantic category from the schema of semantic categories;
clustering the text strings included in the semantic category across the collection of structured data records, comprising:
generating a set of text strings that comprises, for each structured data record in the collection of structured data records, the text string included in the semantic category in the structured data record;
processing each text string in the set of text strings using a text embedding neural network and in accordance with current values of a set of text embedding neural network parameters to generate an embedding of the text string in a latent space, wherein the text embedding neural network has been trained by a machine learning training technique to perform a text embedding task; and
clustering the embeddings of the text strings in the latent space using an iterative numerical clustering technique to generate a partition of the set of text strings into a plurality of clusters; and
identifying, for each of the plurality of clusters, a text string included in the cluster as a standardized text string representing the cluster; and
normalizing the collection of structured data records based on the clustering, comprising, for each of a plurality of structured data records, updating the semantic category in the structured data record to include a corresponding standardized text string;
receiving a user query; and
processing the normalized collection of structured data records to generate a response to the user query, comprising aggregating over standardized text strings included in the normalized collection of structured data records.
|