US 11,853,431 B2
Use of word embeddings to locate sensitive text in computer programming scripts
Vincent Pham, Champaign, IL (US); Kenneth Taylor, Champaign, IL (US); Jeremy Edward Goodsitt, Champaign, IL (US); Fardin Abdi Taghi Abad, Champaign, IL (US); Austin Grant Walters, Savoy, IL (US); Reza Farivar, Champaign, IL (US); Anh Truong, Champaign, IL (US); and Mark Louis Watson, Sedona, AZ (US)
Assigned to Capital One Services, LLC, McLean, VA (US)
Filed by Capital One Services, LLC, McLean, VA (US)
Filed on Aug. 13, 2020, as Appl. No. 16/992,371.
Application 16/992,371 is a continuation of application No. 16/722,867, filed on Dec. 20, 2019, granted, now 10,783,257.
Prior Publication US 2021/0192054 A1, Jun. 24, 2021
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 21/57 (2013.01); G06N 3/08 (2023.01); G06F 21/31 (2013.01); G06N 3/04 (2023.01)
CPC G06F 21/577 (2013.01) [G06F 21/31 (2013.01); G06N 3/04 (2013.01); G06N 3/08 (2013.01); G06F 2221/033 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method, comprising:
training a neural network on a corpus of content, wherein the training results in word embeddings for words in the corpus, wherein each of the word embeddings is a numeric vector in a vector or matrix space;
identifying an initial word of interest;
locating a vector that encodes the initial word of interest in the vector or matrix space;
identifying vectors in the vector or matrix space that lie in a specified proximity to the vector for the initial word of interest and identifying words encoded by the identified vectors as additional words of interest, wherein the identifying comprises one of calculating distances between the vector of the initial word of interest and vectors for other encoded words in the vector or matrix space or calculating cosine values between the vector of the initial word of interest and the vectors for the other encoded words in the vector or matrix space and identifying ones of the vectors in the vector or matrix space that have distances or cosine values within a specified range as being in the specified proximity;
performing a security scan of a set of input to identify instances of the initial word of interest and instances of the additional words of interest in the input; and
generating output that specifies the identified instances of the initial word of interest in the set of input and that specifies that the instances of the additional words of interest in the set of input may be of interest.