US 12,406,500 B2
	Moment localization in media stream
Houwen Peng, Redmond, WA (US); and Jianlong Fu, Beijing (CN)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Appl. No. 17/768,815
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
PCT Filed Oct. 19, 2020, PCT No. PCT/US2020/056390 § 371(c)(1), (2) Date Apr. 13, 2022, PCT Pub. No. WO2021/086676, PCT Pub. Date May 6, 2021.
Claims priority of application No. 201911059082.6 (CN), filed on Nov. 1, 2019.
Prior Publication US 2023/0351752 A1, Nov. 2, 2023
Int. Cl. G06V 20/40 (2022.01); G06F 16/334 (2025.01); G06F 40/10 (2020.01); G06V 10/62 (2022.01); G06V 10/77 (2022.01); G06V 10/82 (2022.01)

CPC G06V 20/48 (2022.01) [G06F 16/3344 (2019.01); G06F 40/10 (2020.01); G06V 10/62 (2022.01); G06V 10/7715 (2022.01); G06V 10/82 (2022.01); G06V 20/41 (2022.01); G06V 20/46 (2022.01); G06V 20/49 (2022.01)]

20 Claims

1. A computer-implemented method, comprising:

extracting, from a media stream, a two-dimensional temporal feature map representing a plurality of moments within the media stream, wherein the two-dimensional temporal feature map comprises a first dimension representing a start of a respective one of the plurality of moments and a second dimension representing an end of a respective one of the plurality of moments;

encoding a sentence feature extracted from an input;

fusing the encoded sentence feature with the two-dimensional temporal feature map into a unified subspace as a fused two-dimensional temporal map;

applying a convolutional layer to the two-dimensional temporal feature map to obtain a further feature map having a same dimension as the two-dimensional temporal feature map, the convolutional layer comprises a dilated convolution and strides of the dilated convolution are configured to increase as lengths of the respective moments increase;

generating a temporal adjacent network using the fused two-dimensional temporal map and the further feature map;

determining, using the temporal adjacent network, a correlation between the plurality of moments and an action in the media stream; and

identifying a matching a set of candidate moments for the input using the temporal adjacent network.