US 12,149,078 B2
	Method for intelligently adjusting power flow based on Q-learning algorithm
Jian Yang, Zhejiang (CN); Dongbo Zhang, Zhejiang (CN); Xinjian Chen, Zhejiang (CN); Yilun Zhu, Zhejiang (CN); Jie Yu, Zhejiang (CN); Daojian Hong, Zhejiang (CN); Zhouhong Wang, Zhejiang (CN); Chenghuai Hong, Zhejiang (CN); Zihuai Zheng, Zhejiang (CN); Huiying Gao, Zhejiang (CN); Minyan Xia, Zhejiang (CN); Bingren Wang, Zhejiang (CN); Guode Ying, Zhejiang (CN); and Yizhi Zhu, Zhejiang (CN)
Assigned to STATE GRID ZHEJIANG ELECTRIC POWER CO., LTD., Taizhou (CN); and TAIZHOU POWER SUPPLY COMPANY, Taizhou (CN)
Appl. No. 15/734,989
Filed by STATE GRID ZHEJIANG ELECTRIC POWER CO., LTD. TAIZHOU POWER SUPPLY COMPANY, Zhejiang (CN)
PCT Filed Oct. 11, 2020, PCT No. PCT/CN2020/120259 § 371(c)(1), (2) Date Dec. 3, 2020, PCT Pub. No. WO2021/093493, PCT Pub. Date May 20, 2021.
Claims priority of application No. 201911123269.8 (CN), filed on Nov. 16, 2019.
Prior Publication US 2021/0367426 A1, Nov. 25, 2021
Int. Cl. G06N 20/00 (2019.01); G06Q 50/06 (2012.01); H02J 3/06 (2006.01)

CPC H02J 3/06 (2013.01) [G06N 20/00 (2019.01); G06Q 50/06 (2013.01)]

2 Claims

1. A method for intelligently adjusting a power flow based on a Q-learning algorithm, comprising:

step 1: converting a variable, an action, and a goal in a power grid to a state, an action, and a reward in the algorithm, respectively;

step 2: selecting an action from an action space, giving an immediate reward based on a result of power flow calculation, and correcting a next state;

step 3: forwardly observing a next exploration action based on a strategy in the Q-learning algorithm;

step 4: updating a Q value in a corresponding position in a Q-value table based on the obtained reward;

step 5: if a final state is not reached, going back to step 2, or if a final state is reached, increasing the number of iterations by 1;

step 6: if the number of iterations does not reach predetermined value K, that is, Episode<K, going back to step 2, or if the number of iterations reaches predetermined value K, that is, Episode=K, outputting the Q-value table; and

step 7: outputting an optimal unit combination, and adjusting power flow for an IEEE 39-bus standard test system and an actually-operating power system based on the optimal unit combination;

wherein the step 1 comprises:

using active power output of each generation unit as the state and establishing an expression of a state space representing a combination of states of all generation units, as shown in formula (1):

S={Power output of generation unit 1,Power output of generation unit 2, . . . ,Power output of generation unit N} (1),

wherein

S denotes the state space, and a value of N is a positive integer, and power output of each generation unit is a respective initial power output or 0;

establishing an expression of an action space in which powering-on or powering-off of each generation unit is used as the action, as shown in formula (2):

A={Powering on of generation unit 1,Powering off of generation unit 1, . . . ,Powering off of generation unit N} (2),

wherein

A denotes the action space, and a value of N is a positive integer, and wherein when each generation unit reaches a power-on state or a power-off state, an action corresponding to the state of the generation unit is removed from the action space; and

reward design: establishing expression R of a reward design for adjusting power flow from non-convergent to convergent and simultaneously minimizing a network loss of the power grid, as shown in formula (3):