US 11,727,927 B2
	View-based voice interaction method, apparatus, server, terminal and medium
Zhou Shen, Beijing (CN); Dai Tan, Beijing (CN); Sheng Lv, Beijing (CN); Kaifang Wu, Beijing (CN); and Yudong Li, Beijing (CN)
Assigned to BAIDU ONLINE NETWORK TECHNOLOGY (BEIJING) CO., LTD., Beijing (CN)
Filed by BAIDU ONLINE NETWORK TECHNOLOGY (BEIJING) CO., LTD., Beijing (CN)
Filed on May 29, 2020, as Appl. No. 16/888,426.
Application 16/888,426 is a continuation of application No. PCT/CN2019/072339, filed on Jan. 18, 2019.
Claims priority of application No. 201810501073.7 (CN), filed on May 23, 2018.
Prior Publication US 2020/0294505 A1, Sep. 17, 2020
Int. Cl. G10L 15/22 (2006.01); G10L 15/18 (2013.01); G10L 15/26 (2006.01); G10L 15/30 (2013.01)

CPC G10L 15/22 (2013.01) [G10L 15/1815 (2013.01); G10L 15/26 (2013.01); G10L 15/30 (2013.01); G10L 2015/223 (2013.01)]

9 Claims

1. A view-based voice interaction method, which is applied to a server, comprising:

obtaining voice information of a user and voice-action description information of a voice-operable element in a currently displayed view on a terminal;

obtaining operational intention of the user by performing semantic recognition on the voice information according to view description information of the voice-operable element, in which the view description information comprises an element name, a text label, and coordinate distribution of the voice-operable element in the view;

locating a sequence of actions matched with the operational intention in the voice-action list according to the voice-action description information;

and delivering the sequence of actions to the terminal for performing;

wherein the voice-action description information comprises a list of voice actions, a voice label of each voice action, and configuration information of each voice action, in which

each voice action is configured to describe a voice operation to be performed on the voice-operable element in the view,

the configuration information of each voice action is configured to indicate specific execution features corresponding to each voice action,

and the voice label of each voice action is configured to describe information about the voice-operable element in the view, and to identify a different function operation of the same voice action in a different view;

wherein said obtaining operational intention of the user comprises:

predicting acoustic features of an audio signal of the voice information by utilizing a pre-trained acoustic model, and generating a corresponding query text by decoding the acoustic features dynamically with a pre-trained language model based on an architecture of the view and a relationship among respective voice-operable elements in the view.