US 12,361,702 B2
	Automatic composition of a presentation video of shared content and a rendering of a selected presenter
Defne Ayanoǧlu, Prague (CZ); and Nakul Madaan, Prague (CZ)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by MICROSOFT TECHNOLOGY LICENSING, LLC, Redmond, WA (US)
Filed on Dec. 29, 2021, as Appl. No. 17/565,442.
Prior Publication US 2023/0206621 A1, Jun. 29, 2023
Int. Cl. G06V 10/94 (2022.01); G06V 40/16 (2022.01); G06V 40/20 (2022.01); G10L 17/00 (2013.01); H04L 12/18 (2006.01); H04N 7/15 (2006.01)

CPC G06V 10/94 (2022.01) [G06V 40/168 (2022.01); G06V 40/20 (2022.01); G10L 17/00 (2013.01); H04L 12/1813 (2013.01); H04N 7/15 (2013.01)]

20 Claims

1. A computer-implemented method for composing a presentation video from an input video stream depicting a person and a content video stream of shared content, the computer-implemented method configured for execution on a computing system comprising:

obtaining, by the computing system, the content video stream of the shared content;

analyzing, by the computing system, the input video stream to select a real-world object meeting one or more criteria with respect to at least one of a size, a shape, or a position of the real-world object from a plurality of real-world objects depicted in the input video stream, wherein the selection of the real-world object causes a selection of the person as a presenter;

analyzing, by the computing system, the input video stream to select the person depicted in the input video stream as the presenter, wherein the selection of the person is in response to determining that the person meets criteria with respect to a predetermined distance or a predetermined position to the selected real-world object;

dynamically selecting the person as the presenter in response to determining that the person meets criteria with respect to the predetermined distance or the predetermined position to the real-world object, wherein the person is not selected as the presenter when the person no longer meets criteria with respect to the predetermined distance or the predetermined position to the real-world object;

generating the presentation video comprising the rendering of shared content and a filtered rendering of the person that is selected as the presenter in response to determining that the person meets the criteria with respect to the predetermined distance or the predetermined position to the selected real-world object, wherein the real-world object is selected in response to the selected real-world object meeting one or more criteria with respect to at least one of the size, the shape, or the position of the real-world object;

causing a display of the presentation video on a plurality of computing devices associated with a plurality of participants of a communication session, wherein the presentation video comprises the rendering of shared content and the filtered rendering of the person that is selected as the presenter in response to detecting that the real-world object meets the one or more criteria with respect to at least one of the size, the shape, or the position of the real-world object and determining that the person is at the predetermined distance or the predetermined position relative to the real-world object;

detecting a presence of an additional speaker of the communication session, wherein the additional speaker is detected based on an activity level of the additional speaker, the activity level being based on at least one of a threshold volume level generated by an audio stream generated by a computing device of the additional speaker, a frequency or quantity of words spoken by the additional speaker, a video stream received from a selected camera, or a position of the additional speaker relative to the real-world object, wherein the additional speaker is different than the person that is selected as the presenter; and

modifying the presentation video to add a rendering of the additional speaker in response to determining that the activity level of the additional speaker exceeds an activity threshold, wherein the modification of the presentation video causes a display of the rendering of the additional speaker concurrently with the rendering of shared content and the filtered rendering of the person that is selected as the presenter in response to determining association between the real-world object and the person, wherein the association can be based on a number or frequency of interactions with the real-world object.