| CPC G06F 16/35 (2019.01) [G06F 16/3347 (2019.01)] | 13 Claims |

|
1. A computer implemented method of determining which of a plurality of stored data items are most similar to a received data item, said method comprising, for each stored data item:
generating a first vector;
generating a second vector;
comparing the first vector and the second vector to generate a similarity score;
classifying one or more stored data items as being similar to the received data item in accordance with the generated similarity scores, wherein the first vector and second vector are defined within the same vector space;
generating the first vector comprises generating a first component indicative of a number of unique N-grams of the received data item relative to the stored data item, and a further component indicative of a number of common N-grams in the received data item and the stored data item, wherein the first component of the first vector is a first integer count value of the number of unique N-grams in the received data item relative to the stored data item, and the further component of the first vector is a further integer count value of the number of common N-grams in the received data item and the stored data item, and the first vector comprises a first 3-component vector ordered as: the first component of the first vector, a null value, and the further component of the first vector;
generating the second vector comprises generating a first component indicative of a number of unique N-grams of the stored data item, and a further component indicative of a number of common N-grams in the received data item and the stored data item, wherein the first component of the second vector is a first integer count value of the number of unique N-grams in the stored data item relative to the received data item, and the further component of the second vector is the integer count value of the number of common N-grams in the received data item and the stored data item, and the second vector comprises a second 3-component vector ordered as a null value, the first component of the second vector and the further component of the second vector.
|