US 11,961,524 B2
	System and method for extracting and displaying speaker information in an ATC transcription
Jitender Kumar Agarwal, Bangalore (IN); and Mohan M. Thippeswamy, Bangalore (IN)
Assigned to HONEYWELL INTERNATIONAL INC., Charlotte, NC (US)
Filed by HONEYWELL INTERNATIONAL INC., Charlotte, NC (US)
Filed on Jul. 16, 2021, as Appl. No. 17/305,913.
Claims priority of application No. 202111023583 (IN), filed on May 27, 2021.
Prior Publication US 2022/0383879 A1, Dec. 1, 2022
Int. Cl. G10L 17/04 (2013.01); G06F 3/14 (2006.01); G10L 15/04 (2013.01); G10L 15/06 (2013.01); G10L 15/22 (2006.01); G10L 15/26 (2006.01)

CPC G10L 17/04 (2013.01) [G06F 3/14 (2013.01); G10L 15/04 (2013.01); G10L 15/06 (2013.01); G10L 15/22 (2013.01); G10L 15/26 (2013.01); G10L 2015/221 (2013.01)]

20 Claims

1. A flight deck system for extracting speaker information in an ATC (Air Traffic Controller) conversation and displaying the speaker information on a graphical display unit, the flight deck system comprising a controller configured to:

segment a stream of audio received over radio from an ATC and other aircraft into a plurality of chunks, wherein each chunk has a speaker;

extract vocal cord and prosody based features from a chunk;

generate a plurality of similarity scores for the extracted vocal cord and prosody based features for the chunk, wherein each similarity score of the plurality of similarity scores is based on a comparison of the extracted vocal cord and prosody based features for the chunk with a different model file from a plurality of model files in an enrolled speaker database for a plurality of speakers from the enrolled speaker database, wherein the plurality of model files are associated with different speakers from the enrolled speaker database;

when a specific similarity score from the plurality of similarity scores determined based on the comparison of the extracted vocal cord and prosody based features with the plurality of model files in the enrolled speaker database for the plurality of speakers exceeds a threshold level, associate the chunk with a particular speaker from the enrolled speaker database, decode the chunk using a speaker-dependent automatic speech recognition (ASR) model that is specific for the speaker, and tag the chunk with a permanent name for the speaker;

when a specific similarity score from the plurality of similarity scores determined based on the comparison of the extracted vocal cord and prosody based features with the plurality of model files in the enrolled speaker database for the plurality of speakers does not exceed a threshold level, assign a temporary name for the speaker of the chunk, tag the chunk with the temporary name, and decode the chunk using a speaker independent speech recognition model;

format the decoded chunk as text;

signal the graphical display unit to display the formatted text along with an identity for the speaker of the formatted text, the identity comprising the permanent name of the speaker, or the temporary name assigned to the speaker; and

enroll a non-enrolled speaker into the speaker database and create a speaker-dependent ASR model for the non-enrolled speaker after a predetermined number of chunks of audio from the non-enrolled speaker are received.