| CPC G06V 20/48 (2022.01) [G06F 16/3344 (2019.01); G06F 40/10 (2020.01); G06V 10/62 (2022.01); G06V 10/7715 (2022.01); G06V 10/82 (2022.01); G06V 20/41 (2022.01); G06V 20/46 (2022.01); G06V 20/49 (2022.01)] | 20 Claims |

|
1. A computer-implemented method, comprising:
extracting, from a media stream, a two-dimensional temporal feature map representing a plurality of moments within the media stream, wherein the two-dimensional temporal feature map comprises a first dimension representing a start of a respective one of the plurality of moments and a second dimension representing an end of a respective one of the plurality of moments;
encoding a sentence feature extracted from an input;
fusing the encoded sentence feature with the two-dimensional temporal feature map into a unified subspace as a fused two-dimensional temporal map;
applying a convolutional layer to the two-dimensional temporal feature map to obtain a further feature map having a same dimension as the two-dimensional temporal feature map, the convolutional layer comprises a dilated convolution and strides of the dilated convolution are configured to increase as lengths of the respective moments increase;
generating a temporal adjacent network using the fused two-dimensional temporal map and the further feature map;
determining, using the temporal adjacent network, a correlation between the plurality of moments and an action in the media stream; and
identifying a matching a set of candidate moments for the input using the temporal adjacent network.
|