US 12,149,773 B2
Voice-based scene selection for video content on a computing device
Matthew Sharifi, Kilchberg (CH); and Victor Carbune, Zurich (CH)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by GOOGLE LLC, Mountain View, CA (US)
Filed on Sep. 2, 2022, as Appl. No. 17/902,601.
Claims priority of provisional application 63/399,921, filed on Aug. 22, 2022.
Prior Publication US 2024/0064363 A1, Feb. 22, 2024
Int. Cl. G10L 25/57 (2013.01); G06V 20/40 (2022.01); G10L 15/22 (2006.01); H04N 21/422 (2011.01); H04N 21/472 (2011.01)
CPC H04N 21/42204 (2013.01) [G06V 20/40 (2022.01); G10L 15/22 (2013.01); G10L 25/57 (2013.01); H04N 21/42203 (2013.01); H04N 21/472 (2013.01); G06V 2201/10 (2022.01); G10L 2015/223 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method implemented by one or more processors comprising:
receiving, from a user and via a computing device, a spoken utterance that includes a query;
identifying video content being presented in a vicinity of the user by a media player application when the spoken utterance is received from the user;
accessing scene metadata associated with the identified video content, wherein the scene metadata includes, for each of one or more respective scenes in the identified video content, semantic scene description data describing the respective scene and timestamp data identifying one or more locations in the identified video content corresponding to the respective scene;
determining, based on the query and the scene metadata associated with the identified video content, whether the query in the spoken utterance is a scene playback request directed to the media player application to play a requested scene in the identified video content;
in response to determining that the query in the spoken utterance is a scene playback request, causing a media control command to be issued to the media player application to cause the media player application to seek to a predetermined location in the identified video content corresponding to the requested scene and identified in the timestamp data of the scene metadata for the identified video content; and
in response to determining that the query in the spoken utterance is not a scene playback request directed to the media player application, causing a non-scene playback request operation to be executed for the query included in the spoken utterance.