US 12,136,434 B2
	Apparatus and method for generating audio-embedded image
Su-Wan Park, Daejeon (KR); Geon-Woo Kim, Daejeon (KR); and Seon-Ho Oh, Daejeon (KR)
Assigned to Electronics and Telecommunications Research Institute, Daejeon (KR)
Filed by ELECTRONICS AND TELECOMMUNICATIONS RESEARCH INSTITUTE, Daejeon (KR)
Filed on Feb. 10, 2022, as Appl. No. 17/668,531.
Claims priority of application No. 10-2021-0023670 (KR), filed on Feb. 22, 2021; and application No. 10-2021-0064301 (KR), filed on May 18, 2021.
Prior Publication US 2022/0277760 A1, Sep. 1, 2022
Int. Cl. G06V 10/40 (2022.01); G06V 10/776 (2022.01); G06V 10/82 (2022.01); G10L 21/10 (2013.01)

CPC G10L 21/10 (2013.01) [G06V 10/40 (2022.01); G06V 10/776 (2022.01); G06V 10/82 (2022.01)]

8 Claims

1. An apparatus for generating an audio-embedded image, comprising:

one or more processors; and

an execution memory for storing at least one program that is executed by the one or more processors,

wherein the at least one program is configured to:

receive audio and an image,

convert the audio into audio information having a preset image format,

generate the audio-embedded image in which the audio information is embedded in the image,

discriminate the audio-embedded image using the audio information and discrimination audio information extracted from the audio-embedded image,

generate the audio-embedded image such that a result value of a first loss function, which minimizes a visual difference between the received image and the audio-embedded image using a first neural network, is minimized,

generate the audio-embedded image such that a result value of a second loss function, which minimizes an acoustic difference between the audio information and the discrimination audio information using the first neural network, is minimized, and

learn the received image and the audio-embedded image such that an image feature corresponding to identical image classification, in which a result value of a third loss function between the received image and the audio-embedded image is minimized, is extracted using a second neural network.