CPC G11B 27/36 (2013.01) [G06F 18/2433 (2023.01); G06N 3/045 (2023.01); G06N 3/08 (2013.01); G06V 10/764 (2022.01); G06V 10/82 (2022.01); G06V 20/41 (2022.01); G06V 20/46 (2022.01); G10L 25/30 (2013.01); G10L 25/57 (2013.01)] | 21 Claims |
1. A neural network system implemented by one or more computers,
wherein the neural network system identifies one or more blocks in a media clip that include audio data that is misaligned with corresponding video data,
wherein the neural network system comprises:
a convolutional subnetwork that generates a plurality of feature maps by generating, for each block included in a plurality of blocks in the media clip, a corresponding feature map derived from both one or more audio features and also one or more video features of the block in the media clip;
an attention module that:
computes a first set of data based on the plurality of feature maps;
executes one or more convolution operations to compute a plurality of confidence values corresponding to the plurality of blocks in the media clip based on the first set of data;
generates a plurality of weight values corresponding to the plurality of blocks based on the plurality of confidence values; and
computes a weighted average based on the plurality of weight values to generate a global feature vector; and
an output layer that identifies, based on the global feature vector from the attention module, a first block included in the plurality of blocks that includes first audio data that is misaligned with corresponding first video data.
|