US 12,266,344 B2
Text information processing method and apparatus
Liumeng Xue, Beijing (CN); Wei Song, Beijing (CN); and Zhizheng Wu, Beijing (CN)
Assigned to BEIJING JINGDONG SHANGKE INFORMATION TECHNOLOGY CO., LTD., Beijing (CN); and BEIJING JINGDONG CENTURY TRADING CO., LTD., Beijing (CN)
Appl. No. 17/789,513
Filed by BEIJING JINGDONG SHANGKE INFORMATION TECHNOLOGY CO., LTD., Beijing (CN); and BEIJING JINGDONG CENTURY TRADING CO., LTD., Beijing (CN)
PCT Filed Jan. 15, 2021, PCT No. PCT/CN2021/072016
§ 371(c)(1), (2) Date Jun. 27, 2022,
PCT Pub. No. WO2021/179791, PCT Pub. Date Sep. 16, 2021.
Claims priority of application No. 202010172575.7 (CN), filed on Mar. 12, 2020.
Prior Publication US 2022/0406290 A1, Dec. 22, 2022
Int. Cl. G10L 13/08 (2013.01); G06F 40/30 (2020.01)
CPC G10L 13/08 (2013.01) [G06F 40/30 (2020.01)] 16 Claims
OG exemplary drawing
 
1. A text information processing method, wherein the method is applied in a smart device that synthesizes voice audio according to text information and is implemented by a processor of a text information processing apparatus in the smart device, the method comprising:
acquiring a phoneme vector corresponding to an individual phoneme and a semantic vector corresponding to the individual phoneme in the text information;
acquiring first semantic information output at a previous moment, wherein the first semantic information is semantic information corresponding to part of the text information in the text information, and the part of the text information is text information that has been converted into voice information;
processing the first semantic information and the semantic vector corresponding to the individual phoneme by a first preset model to obtain a semantic matching degree, wherein the first preset model is obtained by learning multiple groups of first samples, and each group of the multiple groups of first samples includes learning semantic information and learning semantic vectors;
determining a semantic context vector according to the semantic matching degree and the semantic vector corresponding to the individual phoneme;
determining a phoneme context vector according to the semantic matching degree and the phoneme vector corresponding to the individual phoneme;
combining the semantic context vector and the phoneme context vector to determine a context vector corresponding to a current moment;
determining voice information at the current moment according to the context vector and the first semantic information; and
performing voice synthesis processing on voice information at all moments to obtain voice audio corresponding to the text information.