US 12,230,246 B2
	Speech recognition method, speech recognition device, and electronic equipment
Yichen Gong, Beijing (CN)
Assigned to BEIJING HORIZON ROBOTICS TECHNOLOGY RESEARCH AND DEVELOPMENT CO., LTD., Beijing (CN)
Filed by BEIJING HORIZON ROBOTICS TECHNOLOGY RESEARCH AND DEVELOPMENT CO., LTD., Beijing (CN)
Filed on Oct. 14, 2021, as Appl. No. 17/501,181.
Claims priority of application No. 202011098653.X (CN), filed on Oct. 14, 2020.
Prior Publication US 2022/0115002 A1, Apr. 14, 2022
Int. Cl. G10L 15/02 (2006.01); G06V 40/16 (2022.01); G10L 15/25 (2013.01); G10L 25/78 (2013.01)

CPC G10L 15/02 (2013.01) [G06V 40/171 (2022.01); G10L 15/25 (2013.01); G10L 25/78 (2013.01); G10L 2015/025 (2013.01)]

18 Claims

1. A speech recognition method, comprising:

obtaining a video stream and an audio stream within a preset time period, the video stream within the preset time period including a current frame image and a historical frame image before the current frame image, the audio stream within the preset time period including current frame audio and historical frame audio before the current frame audio;

obtaining at least one first lip region of a user in the historical frame image, and determining a second lip region of the user in the current frame image based on the current frame image and the at least one first lip region;

obtaining at least one first speech feature of the historical frame audio, and obtaining a second speech feature of the current frame audio based on the current frame audio and the at least one first speech feature, wherein the second lip region in the current frame image corresponds to the second speech feature;

obtaining a phoneme probability distribution of the current frame according to the at least one first lip region, the second lip region, the at least one first speech feature and the second speech feature; and

obtaining a speech recognition result of the current frame audio according to the phoneme probability distribution,

wherein obtaining the phoneme probability distribution of the current frame according to the at least one first lip region, the second lip region, the at least one first speech feature and the second speech feature comprises:

extracting a first lip visual feature from the at least one first lip region, and extracting a second lip visual feature from the second lip region by processing input lip region picture through combination of a convolutional neural network and a pooling network;

matching the first lip visual feature with the first speech feature in a time dimension and performing feature fusion, and matching the second lip visual feature with the second speech feature in the time dimension and performing the feature fusion;

recognizing features which are obtained after the feature fusion; and

obtaining the phoneme probability distribution of the current frame.