US 11,881,224 B2
Multilingual speech recognition and translation method and related system for a conference which determines quantity of attendees according to their distances from their microphones
Yueh-Tung Wu, Taipei (TW); and Jun-Ying Li, Taipei (TW)
Assigned to PEGATRON CORPORATION, Taipei (TW)
Filed by PEGATRON CORPORATION, Taipei (TW)
Filed on Aug. 5, 2021, as Appl. No. 17/394,593.
Claims priority of application No. 109130562 (TW), filed on Sep. 7, 2020.
Prior Publication US 2022/0076679 A1, Mar. 10, 2022
Int. Cl. G10L 15/25 (2013.01); B60T 7/22 (2006.01); G06F 40/263 (2020.01); G06F 16/583 (2019.01); H04N 7/15 (2006.01); G10L 15/26 (2006.01); G10L 15/00 (2013.01); G06V 20/40 (2022.01); G06V 40/20 (2022.01); G06V 40/16 (2022.01)
CPC G10L 15/25 (2013.01) [G06V 20/41 (2022.01); G06V 40/161 (2022.01); G06V 40/171 (2022.01); G06V 40/172 (2022.01); G06V 40/20 (2022.01); G10L 15/005 (2013.01); G10L 15/26 (2013.01); B60T 7/22 (2013.01); G06F 16/583 (2019.01); G06F 40/263 (2020.01); H04N 7/15 (2013.01)] 16 Claims
OG exemplary drawing
 
1. A multilingual speech recognition and translation method for a conference, wherein the conference comprises at least one attendee, and the method comprises:
receiving, at a server, at least one piece of audio data and at least one piece of video data of the at least one attendee from at least one terminal apparatus during the conference;
analyzing the at least one piece of video data to generate a first video recognition result and a second video recognition result, wherein the first video recognition result comprises an attendance and an ethnic of the at least one attendee, and the second video recognition result comprises a body movement and a facial movement of the at least one attendee when talking;
generating at least one language family recognition result based on the first video recognition result and the at least one piece of audio data;
splitting the at least one piece of audio data based on the first video recognition result and the second video recognition result to generate a plurality of audio segments corresponding to the at least one attendee;
performing speech recognition on the audio segments according to the at least one language family recognition result to convert the audio segments to a text content;
translating the text content according to the at least one language family recognition result; and
transmitting the translated text content for displaying on the at least one terminal apparatus;
wherein the method further comprises:
determining a quantity of speakers and a speaking time of each speaker according to a distance, in the at least one piece of video data, between the at least one attendee and a microphone to generate the second video recognition result; and
obtaining the plurality of audio segments according to the second video recognition result.