| CPC G10L 15/25 (2013.01) [G06V 10/22 (2022.01); G06V 40/171 (2022.01); G10L 15/02 (2013.01); G10L 15/22 (2013.01); G10L 2015/025 (2013.01)] | 9 Claims |

|
1. A speech recognition device comprising:
a memory storing a computer program; and
at least one processor configured to run the computer program to execute to:
extract a region of a speaker from among a plurality of speakers in an image, wherein at least two of the plurality of speakers are speaking simultaneously;
generate first utterance data showing contents of utterance of the speaker based on shapes of lips of the speaker;
generate second utterance data showing the contents of utterance of the speaker based on a speech signal being associated to the utterance of the speaker; and
collate the first utterance data and the second utterance data,
wherein the at least one processor is configured to run the computer program to execute to:
generate speaker information to identify the speaker being extracted from the image;
generate a plurality of pieces of the first utterance data based on shapes of lips of the plurality of speakers in the image; and
collate each of the plurality of pieces of the first utterance data and the second utterance data, and
wherein the at least one processor is further configured to run the computer program to execute to:
associate the speaker information pertinent to any one of the plurality of speakers and the second utterance data based on the result of the collation resulting from the collating of the first utterance data and the second utterance data; and
store the associated speaker information and second utterance data in a storage.
|