US 11,874,864 B2
Method and system for creating a domain-specific training corpus from generic domain corpora
Henghui Zhu, Boston, MA (US); Amir Mohammad Tahmasebi Maraghoosh, Arlington, MA (US); and Ioannis Paschalidis, Lincoln, MA (US)
Assigned to Koninklijke Philips N.V., Eindhoven (NL)
Appl. No. 17/290,444
Filed by KONINKLIJKE PHILIPS N.V., Eindhoven (NL); and TRUSTEES OF BOSTON UNIVERSITY, Boston, MA (US)
PCT Filed Nov. 26, 2019, PCT No. PCT/EP2019/082519
§ 371(c)(1), (2) Date Apr. 30, 2021,
PCT Pub. No. WO2020/109277, PCT Pub. Date Jun. 4, 2020.
Claims priority of provisional application 62/772,661, filed on Nov. 29, 2018.
Prior Publication US 2021/0383066 A1, Dec. 9, 2021
Int. Cl. G06F 40/20 (2020.01); G06F 16/33 (2019.01); G06F 16/36 (2019.01); G06N 20/00 (2019.01); G06F 40/284 (2020.01); G06F 40/205 (2020.01); G06F 18/214 (2023.01); G06F 18/10 (2023.01); G06F 40/211 (2020.01); G06F 40/279 (2020.01); G06F 40/295 (2020.01)
CPC G06F 16/3344 (2019.01) [G06F 16/367 (2019.01); G06F 18/214 (2023.01); G06F 40/205 (2020.01); G06F 40/284 (2020.01); G06N 20/00 (2019.01); G06F 16/36 (2019.01); G06F 18/10 (2023.01); G06F 40/211 (2020.01); G06F 40/279 (2020.01); G06F 40/295 (2020.01)] 15 Claims
OG exemplary drawing
 
1. A method for generating a domain-specific training set, comprising:
generating a generic corpus comprising a plurality of tokenized documents obtained from one or more sources, comprising: (i) parsing a document retrieved from the generic corpus or from another source of documents; (ii) preprocessing the parsed document; (iii) tokenizing the preprocessed document; and (iv) storing the tokenized document in the generic corpus;
generating an ontology database of tokenized entries, comprising: (i) parsing an ontology entry retrieved from an ontology; (ii) preprocessing the parsed entry; (iii) tokenizing the preprocessed entry; and (iv) storing the tokenized entry in the ontology database;
querying using one or more domain-specific tokenized entries from the ontology database, the tokenized documents in the generic corpus;
identifying, based on the query, a plurality of tokenized documents specific to the domain; and
storing in a training set database, the identified plurality of tokenized documents as a training set specific to the domain.