US 12,347,072 B2
Methods for enhancement of low-light images based on reinforcement learning and aesthetic evaluation
Dong Liang, Nanjing (CN); Ling Li, Nanjing (CN); Shengjun Huang, Nanjing (CN); and Songcan Chen, Nanjing (CN)
Assigned to NANJING UNIVERSITY OF AERONAUTICS AND ASTRONAUTICS, Nanjing (CN)
Filed by NANJING UNIVERSITY OF AERONAUTICS AND ASTRONAUTICS, Jiangsu (CN)
Filed on Dec. 10, 2024, as Appl. No. 18/976,298.
Application 18/976,298 is a continuation in part of application No. PCT/CN2023/074843, filed on Feb. 7, 2023.
Claims priority of application No. 202210650946.7 (CN), filed on Jun. 10, 2022.
Prior Publication US 2025/0139745 A1, May 1, 2025
Int. Cl. G06T 5/60 (2024.01); G06T 5/92 (2024.01); G06T 5/94 (2024.01)
CPC G06T 5/60 (2024.01) [G06T 5/92 (2024.01); G06T 5/94 (2024.01); G06T 2207/10024 (2013.01); G06T 2207/20081 (2013.01); G06T 2207/20084 (2013.01)] 4 Claims
OG exemplary drawing
 
1. A method for enhancement of a low-light image based on reinforcement learning and aesthetic evaluation, comprising:
S1, generating images of non-normal luminance under different lighting scenes, and constructing a training dataset for a reinforcement learning system based on the images of non-normal luminance;
S2, initializing the training dataset, a policy network, and a value network in the reinforcement learning system;
S3, updating, based on a no-reference reward score and an aesthetic assessment reward score, the policy network and the value network;
S4, completing model training when all samples are trained and all training iterations are completed;
S5, outputting an image result after the enhancement of the low-light image;
wherein the initializing a policy network and a value network in the operation S2 includes:
inputting a current state s(t) into the policy network and the value network, wherein s(t) denotes a state at a time t; an output of the policy network is a policy π(a(t)|s(t)) for taking an action a(t); and an output of the value network is a value network output value V(s(t)), representing an expected total reward from the current state s(t);
the updating the policy network and the value network in S3 includes:
S3.1, training the training dataset based on historical phase images to obtain an environmental reward value, denoted as R(t), using the following equation:

OG Complex Work Unit Math
wherein, γi denotes an ith power of a discount factor γ and r(t) represents an immediate environmental reward value at the time t; wherein
the following influence factors are taken into account for obtaining the environmental reward value:
a spatial consistency loss, denoted as Lspa:

OG Complex Work Unit Math
wherein K represents a size of a local region; Ω(i) represents four neighboring regions centered on a region i; Y represents an average grayscale value of pixels in a local region of an enhanced image; and I represents an average grayscale value of pixels in a local region of an input image;
an exposure control loss, denoted as Lexp:

OG Complex Work Unit Math
wherein E represents a grayscale level of an image pixel in a RGB color space; M represents a plurality of non-overlapping local regions; Y represents the average grayscale value of the pixels in the local region of the enhanced image, and the size of the local region is {K: K∈[1, M]};
a color constancy loss, denotes as Lcol:

OG Complex Work Unit Math
wherein Jp represents an average grayscale value of pixels in a channel p of the enhanced image, Jq represents an average grayscale value of pixels in a channel q of the enhanced image; (p, q) represents any pair of channels selected from (R,G), (R,B), (G,B), and ε represents a set of (R,G), (R,B), (G,B);
a luminance smoothness loss, denotes as Ltv:

OG Complex Work Unit Math
wherein Enc represents a parametric curve mapping in each state; N represents a count of iterations for image enhancement in the reinforcement learning; ∇x represents a horizontal gradient operation, ∇y represents a vertical gradient operation; ξ denotes a set of R, G, and B channels in the enhanced image; and
an aesthetic quality loss, denoted as Leva;
in order to score the aesthetic quality of the enhanced image, two additional image aesthetic scoring deep learning network models, denoted as a Model1 and a Model2, are introduced to calculate the aesthetic quality loss; a color and luminance attribute of the enhanced image and a quality attribute of the enhanced image are used to train the Model1 and the Model2, respectively; and the aesthetic quality loss is scored using an additionally introduced aesthetic evaluation model including the following equation:

OG Complex Work Unit Math
wherein f1 denotes a score of the color and luminance attribute of the enhanced image, which is a score output by the Model1 when the enhanced image is input to the Model; f2 denotes a score of the quality attribute of the enhanced image, which is a score output by the Model2 when the enhanced image is input to the Model2, and the higher the score is, the better the quality of the enhanced image is; α and β are weight coefficients;
a goal of image enhancement is to make the immediate environmental reward value r(t) as large as possible; the smaller the spatial consistency loss, the exposure control loss, the color constancy loss, and the luminance smoothness loss are, the better the quality of the enhanced image is; the larger the aesthetic quality loss is, the better the quality of the enhanced image is; thus, the immediate environmental reward value r(t) at the time t is represented as follows:

OG Complex Work Unit Math
the environmental reward value at the time t, taking into account the influence factors, is expressed as follows:

OG Complex Work Unit Math
S3.2, training the training dataset based on the historical phase images to obtain the value network output value;
S3.3, updating the value network using the following equation based on the environmental reward value and the value network output value:

OG Complex Work Unit Math
wherein θv represents a value network parameter;
S3.4, updating the policy network based on the environmental reward value and a predicted value using the following equations:

OG Complex Work Unit Math
wherein θp represents a parameter of the policy network, the output of the policy network is the policy π(a(t)|s(t)) for taking the action a(t)∈A, π(a(t)|s(t)) is a probability calculated by a softmax function; A represents an action space; an output dimension of the policy network is |A|.