US 12,112,739 B2
	Information processing device and information processing method
Juri Yaeda, Tokyo (JP); Saki Yokoyama, Tokyo (JP); and Chiaki Miyazaki, Tokyo (JP)
Assigned to SONY GROUP CORPORATION, Tokyo (JP)
Appl. No. 17/309,314
Filed by SONY GROUP CORPORATION, Tokyo (JP)
PCT Filed Nov. 14, 2019, PCT No. PCT/JP2019/044631 § 371(c)(1), (2) Date May 18, 2021, PCT Pub. No. WO2020/110744, PCT Pub. Date Jun. 4, 2020.
Claims priority of application No. 2018-222407 (JP), filed on Nov. 28, 2018.
Prior Publication US 2022/0028368 A1, Jan. 27, 2022
Int. Cl. G10L 13/08 (2013.01); G10L 15/22 (2006.01); G10L 15/26 (2006.01)

CPC G10L 13/08 (2013.01) [G10L 15/22 (2013.01); G10L 15/26 (2013.01); G10L 2015/225 (2013.01)]

10 Claims

1. An information processing device, comprising:

a database configured to store a first character string and first pronunciation information that indicates a reading of the first character string; and

a dialogue management unit configured to:

acquire an utterance text indicating contents of an utterance of a user;

generate, based on the utterance text, a response text indicating contents of a response to the utterance of the user;

determine the first character string stored in the database is same as a second character string included in the response text;

add, to the response text, the first pronunciation information indicating the reading of the first character string as a reading of the second character string included in the response text,

wherein the addition of the first pronunciation information is based on the determination that the first character string stored in the database is same as the second character string included in the response text;

determine a third character string included in the response text is same as a fourth character string included in the utterance text;

add, to the response text, second pronunciation information as a reading of the third character string included in the response text, wherein

the second pronunciation information is associated with the fourth character string,

the second pronunciation information indicates a reading of the fourth character string as pronounced by the user, and

the addition of the second pronunciation information is based on the determination that the fourth character string in the utterance text is same as the third character string included in the response text;

integrate, into the response text, the first pronunciation information of the second character string and the second pronunciation information of the third character string;

output, using a neural network, the response text to which the first pronunciation information and the second pronunciation information is added, wherein the output of the response text is based on the integration; and

control, based on the output of the response text to which both the first pronunciation information and the second pronunciation information are added, a terminal to output a voice with the reading indicated by both the first pronunciation information and the second pronunciation information.