US 11,990,044 B2
	Intelligent collision avoidance method for a swarm of unmanned surface vehicles based on deep reinforcement learning
Yong Ma, Wuhan (CN); Yujiao Zhao, Wuhan (CN); and Yulong Wang, Wuhan (CN)
Assigned to WUHAN UNIVERSITY OF TECHNOLOGY, Wuhan (CN)
Appl. No. 17/292,601
Filed by Wuhan University of Technology, Wuhan (CN)
PCT Filed Sep. 30, 2020, PCT No. PCT/CN2020/119188 § 371(c)(1), (2) Date Mar. 3, 2022, PCT Pub. No. WO2021/082864, PCT Pub. Date May 6, 2021.
Claims priority of application No. 201911043840.5 (CN), filed on Oct. 30, 2019.
Prior Publication US 2022/0189312 A1, Jun. 16, 2022
Int. Cl. G05D 1/00 (2024.01); G06N 3/08 (2023.01); G08G 3/02 (2006.01)

CPC G08G 3/02 (2013.01) [G05D 1/0088 (2013.01); G05D 1/0206 (2013.01); G06N 3/08 (2013.01)]

6 Claims

1. A collision avoidance method for a swarm of unmanned surface vehicles based on deep reinforcement learning, comprising:

S1: establishing a vehicle coordinate system, and designing an unmanned surface vehicle (USV) motion model based on the vehicle coordinate system, wherein the USV motion model is used to reflect a motion state of a USV;

S2: based on the USV motion model, expressing environmental features in the vehicle coordinate system as environmental observation values of a fixed dimension, and designing a reward and punishment function of collision avoidance effect based on the environmental observation values, wherein the reward and punishment function of collision avoidance effect is used to judge a perfection degree of a collision avoidance decision of the USV;

wherein based on the USV motion model, the expressing environmental features in the vehicle coordinate system as environmental observation values of a fixed dimension comprises:

according to international regulations for preventing collisions at sea (COLREGS), dividing n number of fields based on a vehicle-side angle of a USV_jas an obstacle, and obtaining a USV environmental observation value of each of the n number of fields, wherein the USV environmental observation value of each of the n number of fields comprises the vehicle-side angle, swarm movement characteristics, movement trends and danger degrees;

combining the USV environmental observation value in each of the n number of fields to form a four-dimensional matrix, and adding weights to the four-dimensional matrix corresponding to each of the n number of fields to obtain a final USV environmental observation value;

wherein the USV environmental observation value in each of the n number of fields is obtained through

where θ_maxand θ_minrespectively represent a maximum vehicle-side angle and a minimum vehicle-side angle of the obstacle in each of the n fields; C_jrepresents a movement direction of the USV_jas the obstacle; CRI_jrepresents a danger degree of the USV_j; d_jrepresents a distance between a USV_jand the USV_jas the obstacle, V_jxand V_jyrespectively represent a x horizontal component and a y vertical component of a velocity of the USV_jas the obstacle;

wherein the four-dimensional matrix of each of the n fields is formed by S_u=[Δθ C CRI′ d V_xV_y], where u=(1, 2, . . . , n), and the final USV environmental observation value is formed by s_t=[V_iC_iχ₁S₁χ₂S₂χ₃S₃. . . χ_nS_n]^T, where χ₁, χ₂, χ₃, . . . , χ_nare the weights corresponding to the n fields respectively;

wherein the reward and punishment function is designed by

where CRI_t′ represents a danger degree of field at time t, d_trepresents an average distance of the obstacle of field at time t, a_t(C) represents a course change in a decision set, a_t(V) represents a velocity change of the decision set, R_dangerrepresents a changing trend of the danger degree considering reward and punishment calculations of encounter situation, R_rulerepresents a deviation between a current decision and the COLREGS and the deviation is a punitive negative value when the COLREGS is violated, R_actionrepresents a continuity of decision-making actions and calculates an impact of decision-making shock in collision avoidance, r_trepresents a reward and punishment value formed by combining the factors R_danger, R_ruleand R_action, DCPA_irepresents a minimum encounter distance between the USV and the USV_i, D_srepresents a safe distance between the USV and the USV_ito avoid collision, k_CRIrepresents an impact coefficient of collision risk, D_LMArepresents a shortest distance between the USV and the USV_ito avoid collision when one USV implements full rudder rotation to avoid collision and the other USV is going straight, V₀represents an initial velocity of the USV, V_srepresents a safe velocity of the USV calculated considering an influence of USV velocity, a safety distance between vehicles and a navigation environment, k_actionrepresents a coefficient of action influence, and δ_trepresents a rudder angle of the USV;

S3: integrating a long short-term memory (LSTM) neural network and deep reinforcement learning principles to build a collision avoidance training model for the swarm of USVs; and

S4: training the USV to avoid collision in an environment with the swarm of USVs based on the collision avoidance training model for the swarm of USVs.