US 12,249,322 B1
Intent recognition method, apparatus and storage medium
Peng Shen, Beijing (CN); Lizhao Guo, Beijing (CN); Fupo Wang, Beijing (CN); Mingxing Huang, Beijing (CN); and Xiaobo Zhou, Beijing (CN)
Assigned to Beijing Waterdrop Technology Group Co., Ltd., Beijing (CN)
Filed by Beijing Waterdrop Technology Group Co., Ltd., Beijing (CN)
Filed on Oct. 25, 2024, as Appl. No. 18/927,212.
Int. Cl. G10L 15/18 (2013.01); G10L 15/02 (2006.01); G10L 15/197 (2013.01); G10L 25/18 (2013.01); G10L 25/21 (2013.01); G10L 25/45 (2013.01)
CPC G10L 15/1815 (2013.01) [G10L 15/02 (2013.01); G10L 15/197 (2013.01); G10L 25/18 (2013.01); G10L 25/21 (2013.01); G10L 25/45 (2013.01)] 6 Claims
OG exemplary drawing
 
1. An intent recognition method, comprising:
acquiring user's audio to be recognized;
determining multi-frame audio feature vectors corresponding to the audio to be recognized;
inputting the multi-frame audio feature vectors into a preset intent recognition model to obtain multiple output sequences corresponding to the multi-frame audio feature vectors, wherein the output sequences comprise blank characters and non-blank characters; and
determining a target intent corresponding to the audio to be recognized based on the multiple output sequences;
wherein the determining the target intent corresponding to the audio to be recognized based on the multiple output sequences comprises:
calculating probability values corresponding to the multiple output sequences respectively; and
determining a maximum probability value in the probability values, and then determining the target intent corresponding to the audio to be recognized based on the output sequence corresponding to the maximum probability value;
wherein the calculating probability values corresponding to the multiple output sequences respectively comprises:
removing the blank characters of an arbitrary output sequence in the multiple output sequences to obtain a processed output sequence;
merging a duplicate non-blank characters in the processed output sequence to obtain a simplified output sequence;
tokenizing the simplified output sequence to obtain multiple tokens corresponding to the arbitrary output sequence;
determining token frequencies of the tokens in the arbitrary output sequence; and
multiplying the token frequencies corresponding to the tokens to obtain a probability value corresponding to the arbitrary output sequence;
wherein the determining the target intent corresponding to the audio to be recognized based on the output sequence corresponding to the maximum probability value comprises:
splicing the tokens of the output sequence corresponding to the maximum probability value in sequence to obtain the target intent corresponding to the audio to be recognized; and
wherein the determining token frequencies of the tokens in the arbitrary output sequence comprises:
determining total quantity of a characters corresponding to the arbitrary output sequence, wherein the characters comprise blank characters and non-blank characters;
determining quantity of the blank characters that are adjacent in forward order to an arbitrary token in the tokens, and determining quantity of a duplicate characters that are the same as the arbitrary token;
dividing the quantity of the blank characters by the total quantity of the characters to obtain a blank token frequency corresponding to the arbitrary token;
dividing a sum of the quantity of the duplicate character and 1 by the total quantity of the characters to obtain a duplicate token frequency corresponding to the arbitrary token; and
multiplying the blank token frequency by the duplicate token frequency to obtain the token frequency of the arbitrary token in the arbitrary output sequence.