US 12,230,243 B2
	Using token level context to generate SSML tags
Mikayel Mirzoyan, San Francisco, CA (US); André Aing, Dublin (IE); Aysar Khalid, Redmond, WA (US); Chad Joseph Lynch, Lansdale, PA (US); Graham Michael Reeve, Redmond, WA (US); Sadek Baroudi, Berkeley, CA (US); and Vidush Vishwanath, Santa Clara, CA (US)
Assigned to MICROSOFT TECHNOLOGY LICENSING, LLC, Redmond, WA (US)
Filed by MICROSOFT TECHNOLOGY LICENSING, LLC, Redmond, WA (US)
Filed on Dec. 30, 2021, as Appl. No. 17/566,554.
Prior Publication US 2023/0215417 A1, Jul. 6, 2023
Int. Cl. G10L 13/027 (2013.01); G10L 13/06 (2013.01); G10L 13/08 (2013.01)

CPC G10L 13/027 (2013.01) [G10L 13/06 (2013.01); G10L 13/08 (2013.01)]

19 Claims

1. A method comprising:

training a machine learning module to determine a rule for establishing a value for a particular speech output characteristic based on contextual information extracted from labeled training data;

accessing a corpus of text that includes a plurality of tokens;

analyzing, via natural language processing implemented by one or more processors, the corpus of text to categorize sentiment and part of speech for individual tokens of the plurality of tokens;

generating a data structure that indicates a category of the sentiment and a category of the part of speech for the individual tokens of the plurality of tokens, wherein the category of the part of speech comprises a noun, a verb, an adjective, an adverb, a pronoun, a preposition, a conjunction, or an interjection;

extracting, from the data structure, the category of the sentiment and the category of the part of speech for the individual tokens of the plurality of tokens;

applying an algorithm that uses the category of the sentiment and the category of the part of speech, extracted from the data structure, to produce speech output characteristics for the individual tokens of the plurality of tokens;

establishing, based on the rule used in the application of the algorithm, the value for the particular speech output characteristic based on the category of the sentiment and the category of the part of speech;

generating Speech Synthesis Markup Language (SSML) tags for the individual tokens of the plurality of tokens based on the speech output characteristics; and

providing the SSML tags and the corpus of text to a computing device configured to generate a computer-based voice output, wherein an individual SSML tag is associated with at least one token.