US 12,437,565 B2
	Apparatus and method for automatically generating image caption by applying deep learning algorithm to an image
Seung Ho Han, Daejeon (KR); and Ho Jin Choi, Daejeon (KR)
Assigned to KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY, Daejeon (KR)
Appl. No. 17/925,354
Filed by KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY, Daejeon (KR)
PCT Filed Jun. 16, 2020, PCT No. PCT/KR2020/007755 § 371(c)(1), (2) Date Nov. 15, 2022, PCT Pub. No. WO2021/256578, PCT Pub. Date Dec. 23, 2021.
Prior Publication US 2023/0177854 A1, Jun. 8, 2023
Int. Cl. G06V 20/70 (2022.01); G06F 40/44 (2020.01); G06T 7/90 (2017.01); G06V 10/44 (2022.01); G06N 3/02 (2006.01); G06V 10/70 (2022.01)

CPC G06V 20/70 (2022.01) [G06F 40/44 (2020.01); G06T 7/90 (2017.01); G06V 10/44 (2022.01); G06N 3/02 (2013.01); G06T 2207/10024 (2013.01); G06T 2207/20081 (2013.01); G06T 2207/20084 (2013.01); G06V 10/768 (2022.01); G06V 2201/07 (2022.01)]

7 Claims

1. An apparatus for automatically generating an image caption, the apparatus comprising:

an automatic caption generation module configured to generate a caption by applying a deep learning algorithm to an image received from a client;

a caption basis generation module configured to generate a basis for the caption by mapping a partial area in the image received from the client with respect to important words in the caption received from the automatic caption generation module; and

a visualization module configured to visualize the caption received from the automatic caption generation module and the basis for the caption received from the caption basis generation module to return the visualized caption and basis to the client,

wherein the caption basis generation module includes:

an object recognition module configured to recognize one or more objects included in the image received from the client and extract one or more object areas;

an image area-word mapping module configured to train a relevance between words in the caption generated by the automatic caption generation module and each of the object areas extracted by the object recognition module using a deep learning algorithm, and output a weight matrix as a result of the training; and

an interpretation reinforcement module configured to extract a word having a highest weight for each object area from the weight matrix received from the image area-word mapping module, and calculate a posterior probability for each word.