US 11,735,168 B2
Method and apparatus for recognizing voice
Xin Li, Beijing (CN); Bin Huang, Beijing (CN); Ce Zhang, Beijing (CN); Jinfeng Bai, Beijing (CN); and Lei Jia, Beijing (CN)
Assigned to BEIJING BAIDU NETCOM SCIENCE AND TECHNOLOGY CO., LTD., Beijing (CN)
Filed by BEIJING BAIDU NETCOM SCIENCE AND TECHNOLOGY CO., LTD., Beijing (CN)
Filed on Mar. 23, 2021, as Appl. No. 17/209,681.
Claims priority of application No. 202010697077.4 (CN), filed on Jul. 20, 2020.
Prior Publication US 2021/0233518 A1, Jul. 29, 2021
Int. Cl. G10L 15/16 (2006.01); G06N 3/08 (2023.01); G10L 15/06 (2013.01); G10L 15/197 (2013.01); G10L 15/22 (2006.01); G10L 15/32 (2013.01); G10L 25/18 (2013.01); G10L 15/20 (2006.01)
CPC G10L 15/16 (2013.01) [G06N 3/08 (2013.01); G10L 15/063 (2013.01); G10L 15/197 (2013.01); G10L 15/22 (2013.01); G10L 15/32 (2013.01); G10L 25/18 (2013.01); G10L 15/20 (2013.01); G10L 2015/0631 (2013.01)] 21 Claims
OG exemplary drawing
 
1. A method for recognizing a voice, the method comprising:
inputting a target voice into a pre-trained voice recognition model to obtain an initial text output by at least one recognition network in the voice recognition model, wherein the at least one recognition network comprises an omnidirectional network and a plurality of directional networks, each of the plurality of directional networks being obtained by training using a voice sample in a different direction interval, wherein inputting the target voice into the pre-trained voice recognition model to obtain the initial text output by the at least one recognition network in the voice recognition model comprises:
inputting a transformed voice obtained from the target voice into the omnidirectional network to obtain a given voice feature output by a complex linear transformation layer of the omnidirectional network, and
inputting the given voice feature into each of the plurality of directional networks to obtain an initial sub-text output by each directional network, wherein each of the plurality of directional networks comprises a long short-term memory network layer and a streaming multi-layer truncated attention layer; and
determining a voice recognition result of the target voice, based on the initial text comprising the initial sub-text output by each directional network.