CPC G06F 40/284 (2020.01) [G06F 16/35 (2019.01); G06F 40/117 (2020.01); G06F 40/205 (2020.01); G06F 40/221 (2020.01); G06F 40/279 (2020.01); G06F 40/295 (2020.01); G10L 15/18 (2013.01); G10L 15/1815 (2013.01); G10L 15/183 (2013.01)] | 12 Claims |
7. A method for processing natural language performed by a computing device that includes one or more processors and a memory for storing one or more programs executed by the one or more processors, the method comprising:
collecting documents having tags;
extracting text from the collected documents and extracting tag-related information on a tag surrounding each extracted text;
generating tokens of a preset unit by tokenizing each extracted text;
generating token position information for each token in full text of a document; and
setting the token and the token position information as training data by matching in matching with the tag-related information,
wherein the tag-related information includes structural position information of a tag in which each text is positioned,
wherein the structural position information of the tag includes depth information and relative position information of a corresponding tag,
wherein the depth information is information indicating a depth level of the corresponding tag,
wherein the relative position information includes a relative position information value that is assigned to tags having the same depth level to sequentially increase or decrease according to a relative position or order between the tags having the same depth level, and
wherein relative position information values of two adjacent tags having different depth levels do not represent a relative position or order between the two adjacent tags.
|