US 12,437,540 B2
	System and method for automatic video summarization
Andrei Boiarov, Sofia (BG); Kseniia Alekseitseva, Sofia (BG); Anton Kivich, Belgrade (RS); Sergey Ulasen, Singapore (SG); Ilya Shimchik, Zurich (CH); Serg Bell, Singapore (SG); Stanislav Protasov, Singapore (SG); and Nikolay Dobrovolskiy, Alanya (TR)
Assigned to Constructor Technology AG, Schaffhausen (CH); and Constructor Education and Research Genossenschaft, Schaffhausen (CH)
Filed by Constructor Technology AG, Schaffhausen (CH); and Constructor Education and Research Genossenschaft, Schaffhausen (CH)
Filed on May 23, 2023, as Appl. No. 18/322,303.
Prior Publication US 2024/0395042 A1, Nov. 28, 2024
Int. Cl. G06V 20/40 (2022.01); G06F 40/40 (2020.01); G06V 10/774 (2022.01); G06V 10/776 (2022.01); G06V 10/82 (2022.01)

CPC G06V 20/47 (2022.01) [G06F 40/40 (2020.01); G06V 10/774 (2022.01); G06V 10/776 (2022.01); G06V 10/82 (2022.01); G06V 20/41 (2022.01); G06V 20/49 (2022.01)]

18 Claims

10. A system for configuring a neural network to generate highlights of a video of a specified type using a machine learning module and a set of videos depicting events of the specified type to generate a teaching set, the system comprising:

a ranking network, having a self-attention layer, configured to:

produce self-attention embeddings, wherein the self-attention layer has a query head, a key head, and a value head, wherein the query head, the key head and the value head are implemented parallel to each other, and wherein the self-attention embeddings are generated by the query head, the key head, and the value head,

compute a scalar triple product of vectors of self-attention embeddings generated by the query head, the key head, and the value head, respectively,

produce self-attention weight vectors of the scalar triple product of the self-attention embedding vectors,

produce self-attention result with dimension D by multiplying the self-attention weight vectors with value vectors,

perform activation function,

obtain a rank value for the selected clip,

calculate a final loss value with respect to the rank value,

backpropagate errors through the machine learning module,

compare the rank value of a main positive clip with the rank value of a main negative clip, and repeat the training until the rank value of the main positive clip produced by the system becomes higher than the rank value of the main negative clip produced by the system; and

an inference module configured to generate highlights in the inference time for each video based on the rank value generated by the machine learning module, wherein based on a threshold of the rank value, the clip is classified as highlight or not highlight.