US 12,254,385 B2
	Method for multi-time scale voltage quality control based on reinforcement learning in a power distribution network
Wenchuan Wu, Beijing (CN); Haotian Liu, Beijing (CN); Hongbin Sun, Beijing (CN); Bin Wang, Beijing (CN); and Qinglai Guo, Beijing (CN)
Assigned to Tsinghua University, Beijing (CN)
Filed by Tsinghua University, Beijing (CN)
Filed on Jul. 30, 2021, as Appl. No. 17/389,558.
Claims priority of application No. 202110672200.1 (CN), filed on Jun. 17, 2021.
Prior Publication US 2022/0405633 A1, Dec. 22, 2022
Int. Cl. G06N 20/00 (2019.01)

CPC G06N 20/00 (2019.01)

1 Claim

1. A method for multi-time scale reactive voltage control based on reinforcement learning in a power distribution network, comprising:

determining a multi-time scale reactive voltage control object based on a reactive voltage control object of a slow discrete device and a reactive voltage control object of a fast continuous device in a controlled power distribution network, and establishing constraints for multi-time scale reactive voltage optimization, to constitute an optimization model for multi-time scale reactive voltage control in the power distribution network;

constructing a hierarchical interaction training framework based on a two-layer Markov decision process based on the model;

setting a slow agent for the slow discrete device and setting a fast agent for the fast continuous device; and

performing online control with the slow agent and the fast agent, in which action values of the controlled devices are decided by each agent based on measurement information inputted, so as to realize the multi-time scale reactive voltage control while the slow agent and the fast agent perform continuous online learning and updating,

wherein a power distribution network model of the power distribution network is incomplete, the power distribution network comprises the slow discrete device and the fast continuous device, the slow discrete device comprises an on-load tap changer and a capacitor station, and the fast continuous device comprises a distributed generation and a static var compensator; and

wherein the method further comprises:

1) determining the multi-time scale reactive voltage control object and establishing the constraints for multi-time scale reactive voltage optimization, to constitute the optimization model for multi-time scale reactive voltage control in the power distribution network comprises:

1-1) determining the multi-time scale reactive voltage control object of the controlled power distribution network:

where T is a number of control cycles of the slow discrete device in one day; k is an integer which represents a multiple of a number of control cycles of the fast continuous device to the number of control cycles of the slow discrete device in one day; T=kT is the number of control cycles of the fast continuous device in one day; t is a number of control cycles of the slow discrete device; T_Ois a gear of the on-load tap changer OLTC; T_Bis a gear of the capacitor station; Q_Gis a reactive power output of the distributed generation DG; Q_Cis a reactive power output of the static var compensator SVC; C_O, C_B, C_Prespectively are an OLTC adjustment cost, a capacitor station adjustment cost and an active power network loss cost; P_loss^(kt+τ)is a power distribution network loss at the moment kt+τ, τ being an integer, loss τ=0, 1, 2, . . . , k−1; and

T_O,loss^(kt)loss is a gear change adjusted by the OLTC at the moment kt, and T_B,loss^(kt)is a gear change adjusted by the capacitor station at the moment kt, which are respectively calculated by the following formulas:

where T_O,i^(kt)is a gear set value of an i^thOLTC device at the moment kt, n_OLTCis a total number of OLTC devices; T_B,i^(kt)is a gear set value of an i^thcapacitor station at the moment kt, and n_CBis a total number of capacitor stations;

1-2) establishing the constraints for multi-time scale reactive voltage optimization in the controlled power distribution network which include:

voltage constraints and output constraints:

V≤V_i^(kt+τ)≤V,

|Q_Gi^(kt+τ)|≤√S_Gi²−(P_Gi^{(k{tilde over (t+τ)})²)},

Q_Ci≤Q_Ci^(kt+τ)≤Q_Ci,

∀i∈N,t∈[0,T),τ∈[0,k) (0.3)

where N is a set of all nodes in the power distribution network, V_i^(kt+τ)is a voltage magnitude of the node i at the moment kt+τ, V,V are a lower limit and an upper limit of the node voltage respectively; Q_Gi^(kt+τ)is the DG reactive power output of the node i at the moment kt+τ; Q_Ci^(kt+τ)is the SVC reactive power output of the node i at the moment kt+τ; Q_Ci, Q_Ci are a lower limit and an upper limit of the SVC reactive power output of the node i; S_Gis a DG installed capacity of the node i; P_Gi^(kt+τ)is a DG active power output at the moment kt+τ of the node i;

adjustment constraints:

1≤T_O,i^(kt)≤T_O,i,t>0i∈[1,n_OLTC]

1≤T_B,i^(kt)≤T_B,i,t>0i∈[1,n_CB] (0.4)

where T_O,i, is a number of gears of the i^thOLTC device, and T_B,i is a number of gears of the i^thcapacitor station;

2) constructing the hierarchical interaction training framework based on the two-layer Markov decision process based on the optimization model established in step 1) and actual configuration of the power distribution network, comprises:

2-1) corresponding to system measurements of the power distribution network, constructing a state observation S at the moment t shown in the following formula:

s=(P,Q, V,T_O, T_B)_t (0.5)

where P, Q are vectors composed of active power injections and reactive power injections at respective nodes in the power distribution network respectively; V is a vector composed of respective node voltages in the power distribution network; T_Ois a vector composed of respective OLTC gears, and T_Bis a vector composed of respective capacitor station gears; t is a discrete time variable of the control process, (·)_trepresents a value measured at the moment t;

2-2) corresponding to the multi-time scale reactive voltage optimization object, constructing feedback variable r_fof the fast continuous device shown in the following formula:

where s, a, s′ are a state observation at the moment t, an action of the fast continuous device at the moment t and a state observation at the moment t+1 respectively; P_loss(s′) is a network loss at the moment t+1; V_loss(s′) is a voltage deviation rate at the moment t+1; P_i(s′) is the active power output of the node i at the moment t+1; V_i(s′) is a voltage magnitude of the node i at the moment t+1; [x]₊=max (0,x); C_Vis a cost coefficient of voltage violation probability;

2-3) corresponding to the multi-time scale reactive voltage optimization object, constructing feedback variable r of the slow discrete device shown in the following formula:

r_s=−C_OT_O,loss(s,s′)−C_BT_B,loss(s,s′)−R_f({s_τ,a_τ|τ∈[0, k)},s_k) (0.7)

where s, s′ are a state observation at the moment kt and a state observation at the moment kt+k; T_O,loss(s,s′) is an OLTC adjustment cost generated by actions at the moment kt; T_B,loss(s, s′) is a capacitor station adjustment cost generated by actions at the moment kt; R_f({s_τ, a_τ|τ∈[0,k)}, s_k) is a feedback value of the fast continuous device accumulated between two actions of the slow discrete device, the calculation expression of which is as follows:

2-4) constructing an action variable q, of the fast agent and an action variable ã_tof the slow agent at the moment t shown in the following formula:

a_t=(Q_G, Q_C)_t

ã_t=(T_O, T_B)_t (0.9)

where Q_G, Q_Care vectors of the DG reactive power output and the SVC reactive power output in the power distribution network;

3) setting the slow agent to control the slow discrete device and setting the fast agent to control the fast continuous device, comprise:

3-1) the slow agent is a deep neural network including a slow strategy network π and a slow evaluation network Q_s^π, wherein an input of the slow strategy network π is s, an output is probability distribution of an action ã, and a parameter of the slow strategy network π is denoted as θ_s; an input of the slow evaluation network Q_s^π f is s, an output is an evaluation value of each action, and a parameter of the slow evaluation network Q_s^π are denoted as ϕ_s;

3-2) the fast agent is a deep neural network including a fast strategy network π and a fast evaluation network Q_f^π, wherein an input of the fast strategy network π is s, an output is probability distribution of the action a, and a parameter of the fast strategy network π is denoted as θ_f; an input of the fast evaluation network Q_f^π is (s,a), an output is an evaluation value of actions, and a parameter of the fast evaluation network Q_f^π is denoted as ϕ_f;

4) initializing parameters:

4-1) randomly initializing parameters of the neural networks corresponding to respective agents θ_s, θ_f, ϕ_s, ϕ_f;

4-2) inputting a maximum entropy parameter as of the slow agent and a maximum entropy parameter α_fof the fast agent;

4-3) initializing the discrete time variable as t=0, an actual time interval between two steps of the fast agent is Δt, and an actual time interval between two steps of the slow agent is kΔt;

4-4) initializing an action probability of the fast continuous device as p=−1;

4-5) initializing cache experience database as D_l=Ø and initializing agent experience database as D=Ø;

5) executing by the slow agent and the fast agent, the following control steps at the moment t:

5-1) judging if t mod k≠0: if yes, going to step 5-5) and if no, going to step 5-2);

5-2) obtaining by the slow agent, state information from measurement devices in the power distribution network;

5-3) judging if D_l+Ø: if yes, calculating r_s, adding an experience sample to D, updating D←D═{(s,ã,r_s,s′,D_l)} and going to step 5-4); if no, directly going to step 5-4);

5-4) updating s to s′;

5-5) generating the action ã of the slow discrete device with the slow strategy network π of the slow agent according to the state information s;

5-6) distributing ã to each slow discrete device to realize the reactive voltage control of each slow discrete device at the moment t;

5-7) obtaining by the fast agent, state information s′ from measurement devices in the power distribution network;

5-8) judging if p ≥0: if yes, calculating r_f, adding an experience sample to D_l, updating D_l←D_l∪{(s,a,r_f,s′, p)}, and going to step 5-9); if no, directly going to step 5-9);

5-9) updating s′ to s;

5-10) generating the action a of the fast continuous device with the fast strategy network π of the fast agent according to the state information s and updating p=π(a| s);

5-11) distributing a to each fast continuous device to realize the reactive voltage control of each fast continuous device at the moment t and going to step 6);

6) judging t mod k=0: if yes, going to step 6-1); if no, going to step 7);

6-1) randomly selecting a set of experiences D^B∈D from the agent experience database D, wherein a number of samples in the set of experiences is B;

6-2) calculating a loss function of the parameter ϕ_swith each sample in D^B: