US 12,445,585 B2
	Curved grid for acquiring spatial track coordinates of sound source objects of audio elements in an audio stream in video data
Qingxing Jiang, HuiZhou (CN); and Weibiao Gao, HuiZhou (CN)
Assigned to HUIZHOU VISION NEW TECHNOLOGY CO., LTD., HuiZhou (CN)
Appl. No. 18/286,495
Filed by HUIZHOU VISION NEW TECHNOLOGY CO., LTD., HuiZhou (CN)
PCT Filed Dec. 14, 2022, PCT No. PCT/CN2022/139009 § 371(c)(1), (2) Date Oct. 11, 2023, PCT Pub. No. WO2024/124437, PCT Pub. Date Jun. 20, 2024.
Claims priority of application No. 202211608964.5 (CN), filed on Dec. 14, 2022.
Prior Publication US 2025/0088619 A1, Mar. 13, 2025
Int. Cl. H04N 13/161 (2018.01); G06T 7/20 (2017.01)

CPC H04N 13/161 (2018.05) [G06T 7/20 (2013.01)]

12 Claims

1. A method of processing video data, applied to a display device, comprising:

constructing a curved grid matching a display screen of the display device and acquiring a coordinate transformation relationship between the display screen and the curved grid;

acquiring an image stream and an audio stream in the video data and identifying motion track coordinates of sound source objects in the image stream corresponding to different audio elements in the audio stream according to the image stream and the audio stream;

acquiring spatial track coordinates of the sound source object corresponding to each of the audio elements on the curved grid according to the motion track coordinates of the sound source object and the coordinate transformation relationship; and

constructing a stereo video based on the image stream, each of the audio elements in the audio stream, and the spatial track coordinates of each of the audio elements corresponding to the sound source object;

wherein identifying the motion track coordinates of the sound source objects in the image stream corresponding to the different audio elements in the audio stream according to the image stream and the audio stream comprises:

performing audio data separation on the audio stream to obtain the audio elements;

intercepting a first image stream synchronized with a target audio element in the image stream for the target audio element in the audio stream;

inputting the target audio element and each frame image of the first image stream into a sound source localization model and obtaining sound source position coordinates of the sound source object corresponding to the target audio element in each frame image; and

determining the motion track coordinates of the sound source object corresponding to the target audio element in the first image stream according to the sound source position coordinates in each frame image of the first image stream;

wherein inputting the target audio element and each frame image of the first image stream into the sound source localization model and obtaining the sound source position coordinates of the sound source object corresponding to the target audio element in each frame image comprises:

acquiring a target frame image and a historical frame image corresponding to a current prediction step from the first image stream;

inputting the target audio element and the historical frame image into the sound source localization model and obtaining confidence degrees of different prediction regions of the target audio element corresponding to the sound source object in the target frame image;

if a maximum confidence degree among the confidence degrees of each of the prediction regions is greater than a preset confidence threshold value, determining the sound source position coordinates of the target audio element corresponding to the sound source object in the target frame image according to a position information of the prediction region corresponding to the maximum confidence degree; and

setting the sound source position coordinates of the sound source object corresponding to the target audio element in the target frame image to a null value if the maximum confidence degree among the confidence degrees of each prediction region is less than or equal to the preset confidence threshold value;

wherein determining the motion track coordinates of the sound source object corresponding to the target audio element in the first image stream according to the sound source position coordinates in each frame image of the first image stream comprises:

acquiring invalid frame images whose sound source position coordinates of the sound source object corresponding to the target audio element are a null value; and

if the invalid frame images comprise consecutive invalid frame images whose number is less than a preset value, according to the sound source position coordinates of the target audio element corresponding to the sound source object in a previous frame image and the sound source position coordinates in a subsequent frame image, acquiring the sound source position coordinates in the invalid frame images.