US 12,456,464 B2
	Electronic device and method for processing speech by classifying speech target
Minjung Park, Suwon-si (KR); Chulkwi Kim, Suwon-si (KR); Juyoung Yu, Suwon-si (KR); and Nammin Jo, Suwon-si (KR)
Assigned to SAMSUNG ELECTRONICS CO., LTD., Suwon-si (KR)
Filed by SAMSUNG ELECTRONICS CO., LTD., Suwon-si (KR)
Filed on Mar. 20, 2023, as Appl. No. 18/123,509.
Application 18/123,509 is a continuation of application No. PCT/KR2022/008593, filed on Jun. 17, 2022.
Claims priority of application No. 10-2021-0113794 (KR), filed on Aug. 27, 2021.
Prior Publication US 2023/0230593 A1, Jul. 20, 2023
Int. Cl. G10L 15/25 (2013.01); G06V 20/50 (2022.01); G06V 40/16 (2022.01); G06V 40/20 (2022.01); G10L 15/22 (2006.01); G10L 15/26 (2006.01); H04N 23/90 (2023.01); H04R 1/40 (2006.01); H04R 3/00 (2006.01)

CPC G10L 15/25 (2013.01) [G06V 20/50 (2022.01); G06V 40/171 (2022.01); G06V 40/20 (2022.01); G10L 15/22 (2013.01); G10L 15/26 (2013.01); H04N 23/90 (2023.01); H04R 1/406 (2013.01); H04R 3/005 (2013.01); G10L 2015/223 (2013.01)]

14 Claims

1. An head mounted electronic device-comprising:

multiple cameras arranged at different positions, wherein the multiple cameras comprise:

at least one first camera obtaining an image of object in a direction toward which a user's face is oriented, when the user wears the head mounted electronic device,

at least one second camera obtaining an image including at least a portion of the user's mouth, when the user wears the head mounted electronic device, and

at least one third camera tracking a gaze of the user, when the user wears the head mounted electronic device,

multiple microphones arranged at different positions;

a memory storing instructions; and

at least one processor operatively connected to at least one of the multiple cameras, the multiple microphones, and the memory,

wherein the instructions, when executed by the at least one processor, individually and/or collectively, cause the head mounted electronic device to:

determine, using the multiple cameras, which one is a speaker between the user wearing the head mounted electronic device and a counterpart having a conversation with the user,

configure directivity of the multiple microphones based on the determination of the speaker,

obtain audio from at least one of the multiple microphones for detecting an audio through the multiple microphones according to a distance and a direction, based on the configured directivity,

obtain a first image including a mouth shape of the counterpart from the at least one first cameras, a second image including a mouth shape of the user from the at least one second camera, and a third image including a gaze of the user from the at least one third camera,

when the user is determined to be a speaker associated with an utterance for performing a function based on the obtained audio and the obtained first, second, and third images, perform deep learning by matching an audio of the user and the mouth shape of the user and perform a function according to an utterance of the user based on the deep learning, and

when the counterpart is determined to be the speaker associated with an utterance for performing a function based on the obtained audio and the obtained first, second, and third images, perform deep learning by matching an audio of the counterpart and the mouth shape of the counterpart and perform a function according to an utterance of the counterpart based on the deep learning.