US 12,271,814 B2
	Automatic digital content captioning using spatial relationships method and apparatus
Simao Herdade, San Francisco, CA (US); Armin Kappeler, Thalwil (CH); Kofi Boakye, Oakland, CA (US); and Joao Vitor Baldini Soares, New York, NY (US)
Assigned to YAHOO ASSETS LLC, New York, NY (US)
Filed by YAHOO ASSETS LLC, Dulles, CA (US)
Filed on Jun. 13, 2022, as Appl. No. 17/838,345.
Application 17/838,345 is a continuation of application No. 16/729,982, filed on Dec. 30, 2019, granted, now 11,361,550.
Prior Publication US 2022/0309791 A1, Sep. 29, 2022
This patent is subject to a terminal disclaimer.
Int. Cl. G06N 3/08 (2023.01); G06N 3/04 (2023.01); G06T 7/73 (2017.01); G06T 9/00 (2006.01); G06T 11/20 (2006.01); G06V 10/44 (2022.01); G06V 10/764 (2022.01); G06V 10/82 (2022.01); G06V 20/40 (2022.01); G10L 13/00 (2006.01)

CPC G06N 3/08 (2013.01) [G06N 3/04 (2013.01); G06T 7/73 (2017.01); G06T 9/00 (2013.01); G06T 11/20 (2013.01); G06V 10/454 (2022.01); G06V 10/764 (2022.01); G06V 10/82 (2022.01); G06V 20/47 (2022.01); G10L 13/00 (2013.01); G06T 2207/20084 (2013.01); G06T 2210/12 (2013.01)]

21 Claims

1. A method comprising:

analyzing, via a computing device, a digital content item comprising a plurality of objects to detect the plurality of objects depicted in the digital content item, the analysis comprising determining a respective bounding box for each object of the plurality of objects of the digital content item;

determining, via the computing device, a set of geometry features for each object, of the plurality of objects of the digital content item, using the object's respective bounding box;

analyzing, via the computing device, the digital content item to determine an appearance vector for each of the plurality of objects; and

automatically creating, via the computing device and using a trained image captioning machine model, a caption comprising a sequence of words that is descriptive of the digital content item, the automatic caption creation comprising determining, by the trained image captioning machine model, each word of the caption and a position of each word in the sequence of words of the caption that is descriptive of the digital content item using the appearance vector and the set of geometry features determined for each object of the plurality, and the spatial relationships among the plurality of objects identified using each object's set of geometry features.