US 12,259,917 B2
Method of retrieving document and apparatus for retrieving document
Dong Hwan Kim, Seoul (KR); Hyun Wuk Son, Suwon-si (KR); Hyun Ok Kim, Gwangmyeong-si (KR); You Kyung Kwon, Seoul (KR); In Je Seong, Seoul (KR); Yong Sun Choi, Seoul (KR); and Ha Kyeom Moon, Seoul (KR)
Assigned to 42Maru Inc., Seoul (KR)
Filed by 42Maru Inc., Seoul (KR)
Filed on Nov. 29, 2022, as Appl. No. 18/071,105.
Claims priority of application No. 10-2022-0155113 (KR), filed on Nov. 18, 2022.
Prior Publication US 2024/0168984 A1, May 23, 2024
Int. Cl. G06F 16/34 (2019.01); G06F 16/334 (2025.01); G06F 16/35 (2019.01)
CPC G06F 16/345 (2019.01) [G06F 16/3347 (2019.01); G06F 16/35 (2019.01)] 7 Claims
OG exemplary drawing
 
1. A method of retrieving, by an apparatus for retrieving a document, a document based on a user retrieval query, the method comprising:
acquiring a user retrieval query;
calculating a user inquiry vector in a unit of sentence from the user retrieval query;
acquiring a first document candidate group including first documents from a retrieval database through a bi-encoder type deep learning model based on similarity between the calculated user inquiry vector and an embedding vector of a document stored in the retrieval database;
acquiring a second document candidate group including second documents from the retrieval database through a text matching-based retrieval based on similarity between a text included in the user retrieval query and a text of the document stored in the retrieval database; and
determining a summarization target document by using a cross-encoder type deep learning model and a score calculation algorithm based on a primary document candidate group including the first documents of the first document candidate group and the second documents of the second document candidate group, wherein the determining of the summarization target document comprises determining the summarization target document by inputting a passage of a document in the primary document candidate group and the user retrieval query to a cross-encoder of the cross-encoder type deep learning model, and
wherein the acquiring of the first document candidate group through the bi-encoder type deep learning model includes:
extracting a key sentence of a passage of the document stored in the retrieval database from the passage of the document,
calculating a first similarity score between the user inquiry vector and a sentence vector corresponding to the key sentence extracted from the passage of the document by inputting the user inquiry vector and the sentence vector to a bi-encoder of the bi-encoder type deep learning model,
calculating a second similarity score between the user inquiry vector and a sentence vector corresponding to a sentence summarizing the passage of the document stored in the retrieval database,
generating a question from the passage of the document stored in the retrieval database through a generation model,
calculating a third similarity score between a question vector corresponding to the question generated from the passage stored in the retrieval database and the user inquiry vector, and
calculating a first weighted score based on the first similarity score, the second similarity score, and the third similarity score, and
determining the first document candidate group based on the calculated first weighted score, and
wherein the acquiring of the second document candidate group through the text matching-based retrieval includes:
calculating a first score indicating similarity between the user retrieval query and a passage stored in the retrieval database through a phrase matching;
calculating a second score indicating similarity between key query information including a keyword of the user retrieval query extracted through a user query analysis module and a keyword included in the passage stored in the retrieval database;
calculating a third score indicating similarity between the user retrieval query and the passage stored in the retrieval database through a shingle matching; and
calculating a second weighted score based on the first score, the second score, and the third score and determining the second document candidate group based on the calculated second weighted score.