US 11,874,864 B2
	Method and system for creating a domain-specific training corpus from generic domain corpora
Henghui Zhu, Boston, MA (US); Amir Mohammad Tahmasebi Maraghoosh, Arlington, MA (US); and Ioannis Paschalidis, Lincoln, MA (US)
Assigned to Koninklijke Philips N.V., Eindhoven (NL)
Appl. No. 17/290,444
Filed by KONINKLIJKE PHILIPS N.V., Eindhoven (NL); and TRUSTEES OF BOSTON UNIVERSITY, Boston, MA (US)
PCT Filed Nov. 26, 2019, PCT No. PCT/EP2019/082519 § 371(c)(1), (2) Date Apr. 30, 2021, PCT Pub. No. WO2020/109277, PCT Pub. Date Jun. 4, 2020.
Claims priority of provisional application 62/772,661, filed on Nov. 29, 2018.
Prior Publication US 2021/0383066 A1, Dec. 9, 2021
Int. Cl. G06F 40/20 (2020.01); G06F 16/33 (2019.01); G06F 16/36 (2019.01); G06N 20/00 (2019.01); G06F 40/284 (2020.01); G06F 40/205 (2020.01); G06F 18/214 (2023.01); G06F 18/10 (2023.01); G06F 40/211 (2020.01); G06F 40/279 (2020.01); G06F 40/295 (2020.01)

CPC G06F 16/3344 (2019.01) [G06F 16/367 (2019.01); G06F 18/214 (2023.01); G06F 40/205 (2020.01); G06F 40/284 (2020.01); G06N 20/00 (2019.01); G06F 16/36 (2019.01); G06F 18/10 (2023.01); G06F 40/211 (2020.01); G06F 40/279 (2020.01); G06F 40/295 (2020.01)]

15 Claims

1. A method for generating a domain-specific training set, comprising:

generating a generic corpus comprising a plurality of tokenized documents obtained from one or more sources, comprising: (i) parsing a document retrieved from the generic corpus or from another source of documents; (ii) preprocessing the parsed document; (iii) tokenizing the preprocessed document; and (iv) storing the tokenized document in the generic corpus;

generating an ontology database of tokenized entries, comprising: (i) parsing an ontology entry retrieved from an ontology; (ii) preprocessing the parsed entry; (iii) tokenizing the preprocessed entry; and (iv) storing the tokenized entry in the ontology database;

querying using one or more domain-specific tokenized entries from the ontology database, the tokenized documents in the generic corpus;

identifying, based on the query, a plurality of tokenized documents specific to the domain; and

storing in a training set database, the identified plurality of tokenized documents as a training set specific to the domain.