US 11,995,400 B2
	Rapid language detection for characters in images of documents
Zhong Fang Yuan, Xi'an (CN); Tong Liu, Xi'an (CN); Li Juan Gao, Xi'an (CN); Xiang Yu Yang, Xi'an (CN); Qiang He, Ningbo (CN); and Yu Pan, Shanghai (CN)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Sep. 7, 2021, as Appl. No. 17/468,474.
Prior Publication US 2023/0073932 A1, Mar. 9, 2023
Int. Cl. G06F 40/279 (2020.01); G06N 3/08 (2023.01); G06N 20/10 (2019.01); G06V 30/19 (2022.01); G06V 30/41 (2022.01)

CPC G06F 40/279 (2020.01) [G06N 3/08 (2013.01); G06N 20/10 (2019.01); G06V 30/41 (2022.01); G06V 30/19 (2022.01)]

20 Claims

1. A computer-implemented method for rapid language detection of documents, comprising:

receiving an image of a document having characters that correspond to a given language;

using a text recognition algorithm to determine a first language believed to correspond to the characters in the image of the document by:

using a generated mixed language corpus to convert an input sequence of the characters in the image of the document into one or more n-gram sequences, and

computing a hidden layer vector for the one or more n-gram sequences,

wherein the text recognition algorithm is trained using a simple shallow neural network and the generated mixed language corpus,

wherein the generated mixed language corpus is formed by:

randomly sampling one or more libraries having vocabulary and/or characters therein, and

combining the randomly sampled vocabulary and/or characters from the one or more libraries to form the generated mixed language corpus;

computing a first confidence level associated with the first language believed to correspond to the characters in the image of the document;

determining whether the first confidence level associated with the first language is outside a predetermined range; and

in response to determining that the first confidence level associated with the first language is not outside the predetermined range, outputting the first language as the given language.