| CPC G06F 40/284 (2020.01) [G06F 40/263 (2020.01); G06V 30/153 (2022.01)] | 20 Claims |

|
1. A method, comprising:
identifying, by a processing device, a document comprising a plurality of words in one or more natural languages;
for each word of at least a subset of words of the document:
generating a plurality of sets of tokens representing the word, wherein each set of tokens of the plurality of sets of tokens represents the word using a corresponding plurality of tokens defined for a corresponding natural language of a set of natural languages, and
identifying, based on the plurality of sets of tokens, a primary natural language associated with the word;
associating each natural language of the set of natural languages with a corresponding word count indicating a number of words of the subset of words for which the natural language has been identified as the primary natural language;
identifying, among the set of natural languages, a natural language associated with a maximum word count; and
associating the identified natural language with the document.
|