US 12,256,173 B1
	Method to control multiple cameras in a conference room based on audio tracking and head detection data
Gisle Langen Enstad, Oslo (NO)
Assigned to CISCO TECHNOLOGY, INC., San Jose, CA (US)
Filed by Cisco Technology, Inc., San Jose, CA (US)
Filed on Sep. 13, 2022, as Appl. No. 17/943,433.
Int. Cl. H04N 7/15 (2006.01); G06T 7/70 (2017.01); G10L 25/06 (2013.01); G10L 25/57 (2013.01); G10L 25/78 (2013.01); H04L 65/403 (2022.01); H04N 5/268 (2006.01); H04R 1/40 (2006.01); H04R 3/00 (2006.01)

CPC H04N 7/15 (2013.01) [G06T 7/70 (2017.01); G10L 25/06 (2013.01); G10L 25/57 (2013.01); G10L 25/78 (2013.01); H04L 65/403 (2013.01); H04N 5/268 (2013.01); H04R 1/406 (2013.01); H04R 3/005 (2013.01); G06T 2207/10016 (2013.01); G06T 2207/30201 (2013.01)]

20 Claims

1. A method performed by a video conference system having cameras and microphone arrays each co-located with a corresponding one of the cameras, the method comprising:

detecting a face of a participant, and estimating orientations of the face relative to the cameras, based on video captured by the cameras;

receiving, from each microphone array, at least two microphone signals that represent detected audio from the participant;

separately correlating the at least two microphone signals from each microphone array against each other using a correlation function to produce a correlation peak that indicates a time difference of arrival between the at least two microphone signals, wherein separately correlating produces correlation peaks for corresponding ones of the microphone arrays;

determining a preferred camera among the cameras based on the correlation peaks and the orientations of the face relative to the cameras; and

transmitting the video captured by the preferred camera to a network.