US 12,423,521 B2
Using unsupervised clustering and language model to normalize attribute tuples of items in a database
Shih-Ting Lin, Santa Clara, CA (US); Prithvishankar Srinivasan, Seattle, WA (US); Saurav Manchanda, Seattle, WA (US); Shishir Kumar Prasad, Fremont, CA (US); and Min Xie, Santa Clara, CA (US)
Assigned to Maplebear Inc., San Francisco, CA (US)
Filed by Maplebear Inc., San Francisco, CA (US)
Filed on Jun. 28, 2023, as Appl. No. 18/215,505.
Prior Publication US 2025/0005279 A1, Jan. 2, 2025
Int. Cl. G06F 40/247 (2020.01); G06F 16/21 (2019.01); G06F 16/215 (2019.01); G06F 16/28 (2019.01)
CPC G06F 40/247 (2020.01) [G06F 16/211 (2019.01); G06F 16/215 (2019.01); G06F 16/285 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A method comprising, at a computer system comprising a processor and a non-transitory tangible computer-readable medium:
obtaining a plurality of attribute tuples stored in a database, each of the plurality of attribute tuples comprising an attribute type and an attribute value for a corresponding item of a plurality of items;
applying a clustering algorithm to the plurality of attribute tuples to group the plurality of attribute tuples into a first plurality of clusters, wherein applying the clustering algorithm comprises:
obtaining an embedding for each of the plurality of attribute tuples,
computing, using the embedding for each of the plurality of attribute tuples, a similarity score for each pair of the plurality of attribute tuples, and
applying a clustering model to group each pair of the plurality of attribute tuples having the similarity score above a threshold score to form the first plurality of clusters;
generating a plurality of prompts for input into a language model, wherein each prompt of the plurality of prompts is generated to include a respective subset of attribute tuples from the plurality of attribute tuples, and wherein attribute tuples from the respective subset of attribute tuples were grouped into a respective cluster of the first plurality of clusters;
requesting the language model to generate, based on each of the plurality of prompts input into the language model, one or more clusters of a second plurality of clusters, each cluster of the second plurality of clusters including one or more attribute tuples of the plurality of attribute tuples that have a common attribute type and a common attribute value;
generating, for each cluster of the second plurality of clusters, a respective normalized attribute tuple of a plurality of normalized attribute tuples, the respective normalized attribute tuple comprising a normalized attribute type and a normalized attribute value that are based on the common attribute type and the common attribute value;
mapping each of the one or more attribute tuples that belongs to each cluster of the second plurality of clusters to the respective normalized attribute tuple; and
rewriting each of the plurality of attribute tuples in the database to a corresponding normalized attribute tuple of the plurality of normalized attribute tuples to generate a respective rewritten attribute tuple of a plurality of rewritten attribute tuples.