US 12,437,775 B2
	Speech processing method, computer storage medium, and electronic device
Hanyu Ding, Zhejiang (CN); Yue Lin, Zhejiang (CN); and Duisheng Chen, Zhejiang (CN)
Assigned to NETEASE (HANGZHOU) NETWORK CO., LTD., Zhejiang (CN)
Appl. No. 18/248,528
Filed by NETEASE (HANGZHOU) NETWORK CO., LTD., Zhejiang (CN)
PCT Filed Feb. 22, 2021, PCT No. PCT/CN2021/077309 § 371(c)(1), (2) Date Apr. 11, 2023, PCT Pub. No. WO2022/083039, PCT Pub. Date Apr. 28, 2022.
Claims priority of application No. 202011128423.3 (CN), filed on Oct. 20, 2020.
Prior Publication US 2023/0395094 A1, Dec. 7, 2023
Int. Cl. G10L 25/60 (2013.01); G10L 25/45 (2013.01); G10L 25/78 (2013.01); G10L 25/84 (2013.01)

CPC G10L 25/60 (2013.01) [G10L 25/45 (2013.01); G10L 25/84 (2013.01); G10L 2025/783 (2013.01)]

19 Claims

1. A speech processing method, comprising:

acquiring a speech sequence, obtaining a plurality of speech sub-sequences by performing framing processing on the speech sequence, and extracting a target feature of each speech sub-sequence of the plurality of speech sub-sequences;

detecting each speech sub-sequence of the plurality of speech sub-sequences by a speech detection model according to each target feature, and determining valid speech based on a detection result;

inputting a target feature corresponding to the valid speech into a voiceprint recognition model, and screening out target speech from the valid speech by the voiceprint recognition model; and

controlling the target speech to be forwarded to a client;

wherein the voiceprint recognition model comprises a convolutional layer, a double-layer Long-Short Term Memory (LSTM) layer, a pooling layer, and an affine layer, and the inputting the target feature corresponding to the valid speech into the voiceprint recognition model and screening out the target speech from the valid speech by the voiceprint recognition model, comprises:

determining the target feature corresponding to the valid speech as a valid target feature;

inputting the valid target feature into the voiceprint recognition model, and obtaining a deep feature of the valid target feature by performing feature extraction on the valid target feature sequentially using the convolutional layer and the double-layer LSTM layer, wherein the deep feature comprises time dimension and feature dimension;

obtaining a maximum feature and a mean feature of the deep feature in the time dimension by inputting the deep feature into the pooling layer for feature extraction, and obtaining a hidden layer feature by summing the maximum feature and the mean feature; and

obtaining a speech representation vector of a valid speech sub-sequence corresponding to the valid target feature by inputting the hidden layer feature into the affine layer for affine transformation.