US 12,334,074 B2
	Method and apparatus for using image data to aid voice recognition
Robert A. Zurek, Antioch, IL (US); Adrian M. Schuster, West Olive, MI (US); Fu-Lin Shau, Lake Zurich, IL (US); and Jincheng Wu, Naperville, IL (US)
Assigned to Google Technology Holdings LLC, Mountain View, CA (US)
Filed by GOOGLE TECHNOLOGY HOLDINGS LLC, Mountain View, CA (US)
Filed on Mar. 15, 2024, as Appl. No. 18/606,066.
Application 18/606,066 is a continuation of application No. 17/147,991, filed on Jan. 13, 2021, granted, now 11,942,087.
Application 17/147,991 is a continuation of application No. 16/416,427, filed on May 20, 2019, granted, now 10,923,124, issued on Feb. 16, 2021.
Application 16/416,427 is a continuation of application No. 15/464,704, filed on Mar. 21, 2017, granted, now 10,311,868, issued on Jun. 4, 2019.
Application 15/464,704 is a continuation of application No. 14/164,354, filed on Jan. 27, 2014, granted, now 9,747,900, issued on Aug. 29, 2017.
Claims priority of provisional application 61/827,048, filed on May 24, 2013.
Prior Publication US 2024/0221745 A1, Jul. 4, 2024
Int. Cl. G10L 15/22 (2006.01); G06F 3/01 (2006.01); G06V 20/59 (2022.01); G06V 40/16 (2022.01); G06V 40/18 (2022.01); G06V 40/19 (2022.01); G06V 40/20 (2022.01); G10L 15/20 (2006.01); G10L 15/24 (2013.01); G10L 15/25 (2013.01); G10L 15/26 (2006.01); G10L 21/0208 (2013.01); G10L 21/0216 (2013.01); G10L 25/78 (2013.01)

CPC G10L 15/22 (2013.01) [G06F 3/013 (2013.01); G06V 20/59 (2022.01); G06V 40/166 (2022.01); G06V 40/19 (2022.01); G06V 40/20 (2022.01); G10L 15/20 (2013.01); G10L 15/25 (2013.01); G10L 15/26 (2013.01); G10L 21/0208 (2013.01); G06V 40/18 (2022.01); G10L 2015/223 (2013.01); G10L 2015/227 (2013.01); G10L 15/24 (2013.01); G10L 2021/02166 (2013.01); G10L 25/78 (2013.01); H04R 2430/20 (2013.01); H04R 2460/07 (2013.01); H04R 2499/11 (2013.01)]

20 Claims

11. A system comprising:

data processing hardware; and

memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising:

obtaining image data comprising a representation of a first user and a second user;

obtaining audio data comprising:

a first voice data corresponding to the first user speaking; and

a second voice data corresponding to the second user speaking;

associating, based on the image data, the first voice data to a first voice-recognition database of the first user speaking and the second voice data to a second voice-recognition database of the second user speaking;

generating, using speech-to-text conversion, a transcription of the audio data;

annotating, based on the first voice data associated with the first voice- recognition database, a first portion of the transcription corresponding to the first voice data with a first annotation identifying the first user; and

annotating, based on the second voice data associated with the second voice-recognition database, a second portion of the transcription corresponding to the second voice data with a second annotation identifying the second user.