US 12,248,327 B1
Method for UAV path planning in urban airspace based on safe reinforcement learning
Xuejun Zhang, Beijing (CN); Yan Li, Beijing (CN); and Yuanjun Zhu, Beijing (CN)
Assigned to Beihang University, Beijing (CN)
Appl. No. 18/556,353
Filed by Beihang University, Beijing (CN)
PCT Filed Feb. 23, 2023, PCT No. PCT/CN2023/077843
§ 371(c)(1), (2) Date Oct. 20, 2023,
PCT Pub. No. WO2024/164367, PCT Pub. Date Aug. 15, 2024.
Claims priority of application No. 202310081273.2 (CN), filed on Feb. 8, 2023.
Int. Cl. G05D 1/622 (2024.01); G05D 1/46 (2024.01); G05D 1/49 (2024.01)
CPC G05D 1/637 (2024.01) [G05D 1/46 (2024.01); G05D 1/49 (2024.01)] 5 Claims
OG exemplary drawing
 
1. A method for UAV path planning in urban airspace based on safe reinforcement learning, comprising:
S1, collecting state information of a UAV, urban airspace and an urban ground environment, and defining a state of the UAV at any moment t as st, wherein st=[xt,yt,zt];
S2, constituting a safe reinforcement learning algorithm called shield-DDPG architecture by four functional modules: an environment module, a neural network module, a shield module, and a replay buffer; and conducting training by the neural network module according to the state st, the neural network module comprising a main network and a target network; the shield module being constructed by a linear temporal logic and specifically comprising a finite-state reactive system, a state trace, a safety specification, a Markov decision process, a safety automaton and an observe function, the shield module acting between a main actor network and a main critic network, the main actor network outputting an action ut;
S3, determining, by the shield module, safety of an action at=ut+ft=[atx,aty,atz], in which ft=ε·DtD is an attractive force, ε is an attractive coefficient, and DtD is a distance between a UAV current position and a destination point;
S4, verifying the safety of the action at by the shield module, and finally outputting a safe action at′;
S5, by the final safe action at′ obtained, performing at′ for state transition to obtain a next state st+1 as well as a reward Rewardt; and
S6, storing the current state st, the final safe action at′, the reward Rewardt, the next state st+1, and a training flag dt in the replay buffer, and sampling a random minibatch of transitions from the replay buffer for updating the neural network.