US 12,142,279 B2
	Speech processing device, speech processing method, and recording medium
Kazuyuki Sasaki, Tokyo (JP)
Assigned to NEC CORPORATION, Tokyo (JP)
Appl. No. 17/630,632
Filed by NEC Corporation, Tokyo (JP)
PCT Filed Jul. 29, 2020, PCT No. PCT/JP2020/028955 § 371(c)(1), (2) Date Jan. 27, 2022, PCT Pub. No. WO2021/024869, PCT Pub. Date Feb. 11, 2021.
Claims priority of application No. 2019-142951 (JP), filed on Aug. 2, 2019.
Prior Publication US 2022/0262363 A1, Aug. 18, 2022
Int. Cl. G10L 15/25 (2013.01); G06V 10/22 (2022.01); G06V 40/16 (2022.01); G10L 15/02 (2006.01); G10L 15/22 (2006.01)

CPC G10L 15/25 (2013.01) [G06V 10/22 (2022.01); G06V 40/171 (2022.01); G10L 15/02 (2013.01); G10L 15/22 (2013.01); G10L 2015/025 (2013.01)]

9 Claims

1. A speech recognition device comprising:

a memory storing a computer program; and

at least one processor configured to run the computer program to execute to:

extract a region of a speaker from among a plurality of speakers in an image, wherein at least two of the plurality of speakers are speaking simultaneously;

generate first utterance data showing contents of utterance of the speaker based on shapes of lips of the speaker;

generate second utterance data showing the contents of utterance of the speaker based on a speech signal being associated to the utterance of the speaker; and

collate the first utterance data and the second utterance data,

wherein the at least one processor is configured to run the computer program to execute to:

generate speaker information to identify the speaker being extracted from the image;

generate a plurality of pieces of the first utterance data based on shapes of lips of the plurality of speakers in the image; and

collate each of the plurality of pieces of the first utterance data and the second utterance data, and

wherein the at least one processor is further configured to run the computer program to execute to:

associate the speaker information pertinent to any one of the plurality of speakers and the second utterance data based on the result of the collation resulting from the collating of the first utterance data and the second utterance data; and

store the associated speaker information and second utterance data in a storage.