US 12,347,072 B2
	Methods for enhancement of low-light images based on reinforcement learning and aesthetic evaluation
Dong Liang, Nanjing (CN); Ling Li, Nanjing (CN); Shengjun Huang, Nanjing (CN); and Songcan Chen, Nanjing (CN)
Assigned to NANJING UNIVERSITY OF AERONAUTICS AND ASTRONAUTICS, Nanjing (CN)
Filed by NANJING UNIVERSITY OF AERONAUTICS AND ASTRONAUTICS, Jiangsu (CN)
Filed on Dec. 10, 2024, as Appl. No. 18/976,298.
Application 18/976,298 is a continuation in part of application No. PCT/CN2023/074843, filed on Feb. 7, 2023.
Claims priority of application No. 202210650946.7 (CN), filed on Jun. 10, 2022.
Prior Publication US 2025/0139745 A1, May 1, 2025
Int. Cl. G06T 5/60 (2024.01); G06T 5/92 (2024.01); G06T 5/94 (2024.01)

CPC G06T 5/60 (2024.01) [G06T 5/92 (2024.01); G06T 5/94 (2024.01); G06T 2207/10024 (2013.01); G06T 2207/20081 (2013.01); G06T 2207/20084 (2013.01)]

4 Claims

1. A method for enhancement of a low-light image based on reinforcement learning and aesthetic evaluation, comprising:

S1, generating images of non-normal luminance under different lighting scenes, and constructing a training dataset for a reinforcement learning system based on the images of non-normal luminance;

S2, initializing the training dataset, a policy network, and a value network in the reinforcement learning system;

S3, updating, based on a no-reference reward score and an aesthetic assessment reward score, the policy network and the value network;

S4, completing model training when all samples are trained and all training iterations are completed;

S5, outputting an image result after the enhancement of the low-light image;

wherein the initializing a policy network and a value network in the operation S2 includes:

inputting a current state s^(t)into the policy network and the value network, wherein s^(t)denotes a state at a time t; an output of the policy network is a policy π(a^(t)|s^(t)) for taking an action a^(t); and an output of the value network is a value network output value V(s^(t)), representing an expected total reward from the current state s^(t);

the updating the policy network and the value network in S3 includes:

S3.1, training the training dataset based on historical phase images to obtain an environmental reward value, denoted as R^(t), using the following equation:

wherein, γⁱdenotes an i^thpower of a discount factor γ and r^(t)represents an immediate environmental reward value at the time t; wherein

the following influence factors are taken into account for obtaining the environmental reward value:

a spatial consistency loss, denoted as L_spa:

wherein K represents a size of a local region; Ω(i) represents four neighboring regions centered on a region i; Y represents an average grayscale value of pixels in a local region of an enhanced image; and I represents an average grayscale value of pixels in a local region of an input image;

an exposure control loss, denoted as L_exp:

wherein E represents a grayscale level of an image pixel in a RGB color space; M represents a plurality of non-overlapping local regions; Y represents the average grayscale value of the pixels in the local region of the enhanced image, and the size of the local region is {K: K∈[1, M]};

a color constancy loss, denotes as L_col:

wherein J^prepresents an average grayscale value of pixels in a channel p of the enhanced image, J_qrepresents an average grayscale value of pixels in a channel q of the enhanced image; (p, q) represents any pair of channels selected from (R,G), (R,B), (G,B), and ε represents a set of (R,G), (R,B), (G,B);

a luminance smoothness loss, denotes as L_tv:

wherein E_n^crepresents a parametric curve mapping in each state; N represents a count of iterations for image enhancement in the reinforcement learning; ∇_xrepresents a horizontal gradient operation, ∇_yrepresents a vertical gradient operation; ξ denotes a set of R, G, and B channels in the enhanced image; and

an aesthetic quality loss, denoted as L_eva;

in order to score the aesthetic quality of the enhanced image, two additional image aesthetic scoring deep learning network models, denoted as a Model₁and a Model₂, are introduced to calculate the aesthetic quality loss; a color and luminance attribute of the enhanced image and a quality attribute of the enhanced image are used to train the Model₁and the Model₂, respectively; and the aesthetic quality loss is scored using an additionally introduced aesthetic evaluation model including the following equation:

wherein f₁denotes a score of the color and luminance attribute of the enhanced image, which is a score output by the Model₁when the enhanced image is input to the Model; f₂denotes a score of the quality attribute of the enhanced image, which is a score output by the Model₂when the enhanced image is input to the Model₂, and the higher the score is, the better the quality of the enhanced image is; α and β are weight coefficients;

a goal of image enhancement is to make the immediate environmental reward value r^(t)as large as possible; the smaller the spatial consistency loss, the exposure control loss, the color constancy loss, and the luminance smoothness loss are, the better the quality of the enhanced image is; the larger the aesthetic quality loss is, the better the quality of the enhanced image is; thus, the immediate environmental reward value r^(t)at the time t is represented as follows:

the environmental reward value at the time t, taking into account the influence factors, is expressed as follows:

S3.2, training the training dataset based on the historical phase images to obtain the value network output value;

S3.3, updating the value network using the following equation based on the environmental reward value and the value network output value:

wherein θ_vrepresents a value network parameter;

S3.4, updating the policy network based on the environmental reward value and a predicted value using the following equations:

wherein θ_prepresents a parameter of the policy network, the output of the policy network is the policy π(a^(t)|s^(t)) for taking the action a^(t)∈A, π(a^(t)|s^(t)) is a probability calculated by a softmax function; A represents an action space; an output dimension of the policy network is |A|.