US 12,254,385 B2
Method for multi-time scale voltage quality control based on reinforcement learning in a power distribution network
Wenchuan Wu, Beijing (CN); Haotian Liu, Beijing (CN); Hongbin Sun, Beijing (CN); Bin Wang, Beijing (CN); and Qinglai Guo, Beijing (CN)
Assigned to Tsinghua University, Beijing (CN)
Filed by Tsinghua University, Beijing (CN)
Filed on Jul. 30, 2021, as Appl. No. 17/389,558.
Claims priority of application No. 202110672200.1 (CN), filed on Jun. 17, 2021.
Prior Publication US 2022/0405633 A1, Dec. 22, 2022
Int. Cl. G06N 20/00 (2019.01)
CPC G06N 20/00 (2019.01) 1 Claim
 
1. A method for multi-time scale reactive voltage control based on reinforcement learning in a power distribution network, comprising:
determining a multi-time scale reactive voltage control object based on a reactive voltage control object of a slow discrete device and a reactive voltage control object of a fast continuous device in a controlled power distribution network, and establishing constraints for multi-time scale reactive voltage optimization, to constitute an optimization model for multi-time scale reactive voltage control in the power distribution network;
constructing a hierarchical interaction training framework based on a two-layer Markov decision process based on the model;
setting a slow agent for the slow discrete device and setting a fast agent for the fast continuous device; and
performing online control with the slow agent and the fast agent, in which action values of the controlled devices are decided by each agent based on measurement information inputted, so as to realize the multi-time scale reactive voltage control while the slow agent and the fast agent perform continuous online learning and updating,
wherein a power distribution network model of the power distribution network is incomplete, the power distribution network comprises the slow discrete device and the fast continuous device, the slow discrete device comprises an on-load tap changer and a capacitor station, and the fast continuous device comprises a distributed generation and a static var compensator; and
wherein the method further comprises:
1) determining the multi-time scale reactive voltage control object and establishing the constraints for multi-time scale reactive voltage optimization, to constitute the optimization model for multi-time scale reactive voltage control in the power distribution network comprises:
1-1) determining the multi-time scale reactive voltage control object of the controlled power distribution network:

OG Complex Work Unit Math
where T is a number of control cycles of the slow discrete device in one day; k is an integer which represents a multiple of a number of control cycles of the fast continuous device to the number of control cycles of the slow discrete device in one day; T=kT is the number of control cycles of the fast continuous device in one day; t is a number of control cycles of the slow discrete device; TO is a gear of the on-load tap changer OLTC; TB is a gear of the capacitor station; QG is a reactive power output of the distributed generation DG; QC is a reactive power output of the static var compensator SVC; CO, CB, CP respectively are an OLTC adjustment cost, a capacitor station adjustment cost and an active power network loss cost; Ploss(kt+τ) is a power distribution network loss at the moment kt+τ, τ being an integer, loss τ=0, 1, 2, . . . , k−1; and
TO,loss(kt) loss is a gear change adjusted by the OLTC at the moment kt, and TB,loss(kt) is a gear change adjusted by the capacitor station at the moment kt, which are respectively calculated by the following formulas:

OG Complex Work Unit Math
where TO,i(kt) is a gear set value of an ith OLTC device at the moment kt, nOLTC is a total number of OLTC devices; TB,i(kt) is a gear set value of an ith capacitor station at the moment kt, and nCB is a total number of capacitor stations;
1-2) establishing the constraints for multi-time scale reactive voltage optimization in the controlled power distribution network which include:
voltage constraints and output constraints:
V≤Vi(kt+τ)V,
|QGi(kt+τ)|≤√SGi2−(PGi(k{tilde over (t+τ))2)},
QCiQCi(kt+τ)QCi,
i∈N,t∈[0,T),τ∈[0,k)  (0.3)
where N is a set of all nodes in the power distribution network, Vi(kt+τ) is a voltage magnitude of the node i at the moment kt+τ, V,V are a lower limit and an upper limit of the node voltage respectively; QGi(kt+τ) is the DG reactive power output of the node i at the moment kt+τ; QCi(kt+τ) is the SVC reactive power output of the node i at the moment kt+τ; QCi, QCi are a lower limit and an upper limit of the SVC reactive power output of the node i; SG is a DG installed capacity of the node i; PGi(kt+τ) is a DG active power output at the moment kt+τ of the node i;
adjustment constraints:
1≤TO,i(kt)TO,i,t>0i∈[1,nOLTC]
1≤TB,i(kt)TB,i,t>0i∈[1,nCB]  (0.4)
where TO,i, is a number of gears of the ith OLTC device, and TB,i is a number of gears of the ith capacitor station;
2) constructing the hierarchical interaction training framework based on the two-layer Markov decision process based on the optimization model established in step 1) and actual configuration of the power distribution network, comprises:
2-1) corresponding to system measurements of the power distribution network, constructing a state observation S at the moment t shown in the following formula:
s=(P,Q, V,TO, TB)t  (0.5)
where P, Q are vectors composed of active power injections and reactive power injections at respective nodes in the power distribution network respectively; V is a vector composed of respective node voltages in the power distribution network; TO is a vector composed of respective OLTC gears, and TB is a vector composed of respective capacitor station gears; t is a discrete time variable of the control process, (·)t represents a value measured at the moment t;
2-2) corresponding to the multi-time scale reactive voltage optimization object, constructing feedback variable rf of the fast continuous device shown in the following formula:

OG Complex Work Unit Math
where s, a, s′ are a state observation at the moment t, an action of the fast continuous device at the moment t and a state observation at the moment t+1 respectively; Ploss(s′) is a network loss at the moment t+1; Vloss(s′) is a voltage deviation rate at the moment t+1; Pi(s′) is the active power output of the node i at the moment t+1; Vi(s′) is a voltage magnitude of the node i at the moment t+1; [x]+=max (0,x); CV is a cost coefficient of voltage violation probability;
2-3) corresponding to the multi-time scale reactive voltage optimization object, constructing feedback variable r of the slow discrete device shown in the following formula:
rs=−COTO,loss(s,s′)−CBTB,loss(s,s′)−Rf({sτ,aτ|τ∈[0, k)},sk)  (0.7)
where s, s′ are a state observation at the moment kt and a state observation at the moment kt+k; TO,loss(s,s′) is an OLTC adjustment cost generated by actions at the moment kt; TB,loss (s, s′) is a capacitor station adjustment cost generated by actions at the moment kt; Rf({sτ, aτ|τ∈[0,k)}, sk) is a feedback value of the fast continuous device accumulated between two actions of the slow discrete device, the calculation expression of which is as follows:

OG Complex Work Unit Math
2-4) constructing an action variable q, of the fast agent and an action variable ãt of the slow agent at the moment t shown in the following formula:
at=(QG, QC)t
ãt=(TO, TB)t  (0.9)
where QG, QC are vectors of the DG reactive power output and the SVC reactive power output in the power distribution network;
3) setting the slow agent to control the slow discrete device and setting the fast agent to control the fast continuous device, comprise:
3-1) the slow agent is a deep neural network including a slow strategy network π and a slow evaluation network Qsπ, wherein an input of the slow strategy network π is s, an output is probability distribution of an action ã, and a parameter of the slow strategy network π is denoted as θs; an input of the slow evaluation network Qsπ f is s, an output is an evaluation value of each action, and a parameter of the slow evaluation network Qsπ are denoted as ϕs;
3-2) the fast agent is a deep neural network including a fast strategy network π and a fast evaluation network Qfπ, wherein an input of the fast strategy network π is s, an output is probability distribution of the action a, and a parameter of the fast strategy network π is denoted as θf; an input of the fast evaluation network Qfπ is (s,a), an output is an evaluation value of actions, and a parameter of the fast evaluation network Qfπ is denoted as ϕf;
4) initializing parameters:
4-1) randomly initializing parameters of the neural networks corresponding to respective agents θs, θf, ϕs, ϕf;
4-2) inputting a maximum entropy parameter as of the slow agent and a maximum entropy parameter αf of the fast agent;
4-3) initializing the discrete time variable as t=0, an actual time interval between two steps of the fast agent is Δt, and an actual time interval between two steps of the slow agent is kΔt;
4-4) initializing an action probability of the fast continuous device as p=−1;
4-5) initializing cache experience database as Dl=Ø and initializing agent experience database as D=Ø;
5) executing by the slow agent and the fast agent, the following control steps at the moment t:
5-1) judging if t mod k≠0: if yes, going to step 5-5) and if no, going to step 5-2);
5-2) obtaining by the slow agent, state information from measurement devices in the power distribution network;
5-3) judging if Dl+Ø: if yes, calculating rs, adding an experience sample to D, updating D←D═{(s,ã,rs,s′,Dl)} and going to step 5-4); if no, directly going to step 5-4);
5-4) updating s to s′;
5-5) generating the action ã of the slow discrete device with the slow strategy network π of the slow agent according to the state information s;
5-6) distributing ã to each slow discrete device to realize the reactive voltage control of each slow discrete device at the moment t;
5-7) obtaining by the fast agent, state information s′ from measurement devices in the power distribution network;
5-8) judging if p ≥0: if yes, calculating rf, adding an experience sample to Dl, updating Dl←Dl∪{(s,a,rf,s′, p)}, and going to step 5-9); if no, directly going to step 5-9);
5-9) updating s′ to s;
5-10) generating the action a of the fast continuous device with the fast strategy network π of the fast agent according to the state information s and updating p=π(a| s);
5-11) distributing a to each fast continuous device to realize the reactive voltage control of each fast continuous device at the moment t and going to step 6);
6) judging t mod k=0: if yes, going to step 6-1); if no, going to step 7);
6-1) randomly selecting a set of experiences DB∈D from the agent experience database D, wherein a number of samples in the set of experiences is B;
6-2) calculating a loss function of the parameter ϕs with each sample in DB:

OG Complex Work Unit Math
where
ys=rs+γ[Qsπ(s′,ã′)−αs log π(ã′|s)]  (0.11)
where ã′˜π(·|s) and γ is a conversion factor;

OG Complex Work Unit Math
6-3) updating the parameter ϕs:
ϕs←ϕs−ρsϕsLs)  (0.13)
where ρs is a learning step length of the slow discrete device;
6-4) calculating a loss function of the parameter θs;

OG Complex Work Unit Math
6-5) updating the parameter θs:
θs←θs−ρsθsLs)  (0.15)
and going to step 7);
7) executing by the fast agent, the following learning steps at the moment t:
7-1) randomly selecting a set of experiences DB∈D from the agent experience database D, wherein a number of samples in the set of experiences is B;
7-2) calculating a loss function of the parameter ϕf with each sample in DB:

OG Complex Work Unit Math
where
yf=rf+γ[Qfπ(s′,a′)−αf log π(a′|s)]  (0.17)
where a′˜π(·|s);
7-3) updating the parameter ϕf:
ϕf←ϕf−ρfϕfLf)  (0.18)
where ρf is a learning step length of the fast continuous device;
7-4) calculating a loss function of the parameter θf;

OG Complex Work Unit Math
7-5) updating the parameter θf:
θf←θf−ρfθfLf)  (0.20)
8) let t=t+1, returning to step 5).