US 12,248,327 B1
	Method for UAV path planning in urban airspace based on safe reinforcement learning
Xuejun Zhang, Beijing (CN); Yan Li, Beijing (CN); and Yuanjun Zhu, Beijing (CN)
Assigned to Beihang University, Beijing (CN)
Appl. No. 18/556,353
Filed by Beihang University, Beijing (CN)
PCT Filed Feb. 23, 2023, PCT No. PCT/CN2023/077843 § 371(c)(1), (2) Date Oct. 20, 2023, PCT Pub. No. WO2024/164367, PCT Pub. Date Aug. 15, 2024.
Claims priority of application No. 202310081273.2 (CN), filed on Feb. 8, 2023.
Int. Cl. G05D 1/622 (2024.01); G05D 1/46 (2024.01); G05D 1/49 (2024.01)

CPC G05D 1/637 (2024.01) [G05D 1/46 (2024.01); G05D 1/49 (2024.01)]

5 Claims

1. A method for UAV path planning in urban airspace based on safe reinforcement learning, comprising:

S1, collecting state information of a UAV, urban airspace and an urban ground environment, and defining a state of the UAV at any moment t as s_t, wherein s_t=[x_t,y_t,z_t];

S2, constituting a safe reinforcement learning algorithm called shield-DDPG architecture by four functional modules: an environment module, a neural network module, a shield module, and a replay buffer; and conducting training by the neural network module according to the state s_t, the neural network module comprising a main network and a target network; the shield module being constructed by a linear temporal logic and specifically comprising a finite-state reactive system, a state trace, a safety specification, a Markov decision process, a safety automaton and an observe function, the shield module acting between a main actor network and a main critic network, the main actor network outputting an action u_t;

S3, determining, by the shield module, safety of an action a_t=u_t+f_t=[a_t^x,a_t^y,a_t^z], in which f_t=ε·D_t^Dis an attractive force, ε is an attractive coefficient, and D_t^Dis a distance between a UAV current position and a destination point;

S4, verifying the safety of the action a_tby the shield module, and finally outputting a safe action a_t′;

S5, by the final safe action a_t′ obtained, performing a_t′ for state transition to obtain a next state s_t+1as well as a reward Reward_t; and

S6, storing the current state s_t, the final safe action a_t′, the reward Reward_t, the next state s_t+1, and a training flag d_tin the replay buffer, and sampling a random minibatch of transitions from the replay buffer for updating the neural network.