CPC G10L 15/25 (2013.01) [G06V 10/82 (2022.01); G06V 40/172 (2022.01); G10L 15/05 (2013.01); G10L 15/26 (2013.01)] | 20 Claims |
1. A method comprising:
obtaining an audio signal and a face image, wherein a first photographing time point of the face image is the same as a first collection time point of the audio signal;
inputting the face image into a prediction model to predict whether a user intends to continue speaking;
processing the face image using the prediction model to obtain a prediction result;
outputting the prediction result; and
determining that the audio signal is a speech end point when the prediction result indicates that the user does not intend to continue speaking.
|