US 12,242,820 B2
Generating synthetic code-switched data for training language models
Cesa Salaam, Upper Marlboro, MD (US); Seunghyun Yoon, San Jose, CA (US); Trung Huu Bui, San Jose, CA (US); and Franck Dernoncourt, San Jose, CA (US)
Assigned to Adobe Inc., San Jose, CA (US)
Filed by Adobe Inc., San Jose, CA (US)
Filed on Feb. 17, 2022, as Appl. No. 17/651,555.
Prior Publication US 2023/0259718 A1, Aug. 17, 2023
Int. Cl. G10L 15/22 (2006.01); G06F 40/47 (2020.01); G06F 40/58 (2020.01); G06N 3/045 (2023.01); G06N 3/08 (2023.01)
CPC G06F 40/58 (2020.01) [G06F 40/47 (2020.01); G06N 3/045 (2023.01); G06N 3/08 (2013.01)] 15 Claims
OG exemplary drawing
 
1. A method of generating code-switched content for training a language model, the method comprising:
identifying one or more portions within textual content in a first language, the identified one or more portions each comprising one or more of offensive content or non-offensive content, the identifying comprising tagging, based on an output of a first trained language model, the one or more portions with at least one content tag;
translating the tagged one or more portions to a second language using a second trained language model; and
replacing, in the textual content, the tagged one or more portions with the translated one or more portions to generate code-switched textual content;
wherein:
the first trained language model comprises a context-aware language model configured to identify one or more salient portions from the textual content based at least on: one or more simplex or complex noun phrases of the textual content, one or more maximally frequent contiguous portions of the textual content, one or more salient portions of the textual content determined based at least on a training dataset used by the first trained language model, one or more denoised portions of the textual content, or a combination thereof; and
the second trained language model comprises a language model configured to translate between at least the first and second languages.