US 12,271,692 B2
Systems and methods for language identification in binary file formats
Jason Rayles, Brookline, MA (US)
Assigned to NBCUniversal Media, LLD, New York, NY (US)
Filed by NBCUniversal Media, LLC, New York, NY (US)
Filed on Dec. 13, 2021, as Appl. No. 17/549,469.
Prior Publication US 2023/0186020 A1, Jun. 15, 2023
Int. Cl. G06F 40/263 (2020.01); G06F 16/11 (2019.01); G06F 16/14 (2019.01)
CPC G06F 40/263 (2020.01) [G06F 16/116 (2019.01); G06F 16/148 (2019.01)] 19 Claims
OG exemplary drawing
 
1. A tangible, non-transitory, computer-readable medium, comprising computer-readable instructions that, when executed by one or more processors of an electronic device, cause the electronic device to:
prior to determining an encoding scheme of a file, identify a language of the file, wherein the file is encoded in a language-specific encoding scheme, by:
identifying, for each potential language of the file, a language score of one or more bit sequences of the file using a term frequency (TF) indicating a frequency of the one or more bit sequences within a training document of the potential language and an inverse document frequency (IDF) based upon a frequency of the one or more bit sequences in a collection of training documents of a plurality of the potential languages; and
selecting a language associated with a highest one of the language scores as the language of the file;
associate the language of the file with the file; and
decode the file using a decoding scheme that corresponds to the language associated with the file.