US 12,334,118 B2
	Techniques for identifying synchronization errors in media titles
Rohit Puri, Campbell, CA (US); Naji Khosravan, Orlando, FL (US); and Shervin Ardeshir Behrostaghi, Campbell, CA (US)
Assigned to NETFLIX, INC., Los Gatos, CA (US)
Filed by NETFLIX, INC., Los Gatos, CA (US)
Filed on Nov. 18, 2019, as Appl. No. 16/687,209.
Claims priority of provisional application 62/769,515, filed on Nov. 19, 2018.
Prior Publication US 2020/0160889 A1, May 21, 2020
Int. Cl. G11B 27/36 (2006.01); G06F 18/2433 (2023.01); G06N 3/045 (2023.01); G06N 3/08 (2023.01); G06V 10/764 (2022.01); G06V 10/82 (2022.01); G06V 20/40 (2022.01); G10L 25/30 (2013.01); G10L 25/57 (2013.01)

CPC G11B 27/36 (2013.01) [G06F 18/2433 (2023.01); G06N 3/045 (2023.01); G06N 3/08 (2013.01); G06V 10/764 (2022.01); G06V 10/82 (2022.01); G06V 20/41 (2022.01); G06V 20/46 (2022.01); G10L 25/30 (2013.01); G10L 25/57 (2013.01)]

21 Claims

1. A neural network system implemented by one or more computers,

wherein the neural network system identifies one or more blocks in a media clip that include audio data that is misaligned with corresponding video data,

wherein the neural network system comprises:

a convolutional subnetwork that generates a plurality of feature maps by generating, for each block included in a plurality of blocks in the media clip, a corresponding feature map derived from both one or more audio features and also one or more video features of the block in the media clip;

an attention module that:

computes a first set of data based on the plurality of feature maps;

executes one or more convolution operations to compute a plurality of confidence values corresponding to the plurality of blocks in the media clip based on the first set of data;

generates a plurality of weight values corresponding to the plurality of blocks based on the plurality of confidence values; and

computes a weighted average based on the plurality of weight values to generate a global feature vector; and

an output layer that identifies, based on the global feature vector from the attention module, a first block included in the plurality of blocks that includes first audio data that is misaligned with corresponding first video data.