US 12,406,316 B2
	Processing multimodal user input for assistant systems
Vivek Natarajan, Sunnyvale, CA (US); Shawn C. P. Mei, San Francisco, CA (US); and Zhengping Zuo, Medina, WA (US)
Assigned to Meta Platforms, Inc., Menlo Park, CA (US)
Filed by Meta Platforms, Inc., Menlo Park, CA (US)
Filed on Mar. 16, 2023, as Appl. No. 18/185,258.
Application 18/185,258 is a continuation of application No. 17/156,964, filed on Jan. 25, 2021, granted, now 11,676,220.
Application 17/156,964 is a continuation of application No. 16/053,600, filed on Aug. 2, 2018, granted, now 10,936,346.
Claims priority of provisional application 62/660,876, filed on Apr. 20, 2018.
Prior Publication US 2023/0222605 A1, Jul. 13, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 17/30 (2006.01); G02B 27/01 (2006.01); G06F 3/01 (2006.01); G06F 3/16 (2006.01); G06F 7/14 (2006.01); G06F 9/451 (2018.01); G06F 16/176 (2019.01); G06F 16/22 (2019.01); G06F 16/23 (2019.01); G06F 16/242 (2019.01); G06F 16/2455 (2019.01); G06F 16/2457 (2019.01); G06F 16/248 (2019.01); G06F 16/28 (2019.01); G06F 16/332 (2019.01); G06F 16/3329 (2025.01); G06F 16/334 (2025.01); G06F 16/338 (2019.01); G06F 16/438 (2019.01); G06F 16/903 (2019.01); G06F 16/9032 (2019.01); G06F 16/9038 (2019.01); G06F 16/904 (2019.01); G06F 16/951 (2019.01); G06F 16/9535 (2019.01); G06F 18/2411 (2023.01); G06F 40/205 (2020.01); G06F 40/30 (2020.01); G06F 40/40 (2020.01); G06N 3/006 (2023.01); G06N 3/08 (2023.01); G06N 20/00 (2019.01); G06Q 50/00 (2012.01); G06V 10/82 (2022.01); G06V 20/10 (2022.01); G06V 20/30 (2022.01); G06V 40/16 (2022.01); G06V 40/20 (2022.01); G10L 15/02 (2006.01); G10L 15/06 (2013.01); G10L 15/07 (2013.01); G10L 15/16 (2006.01); G10L 15/18 (2013.01); G10L 15/183 (2013.01); G10L 15/187 (2013.01); G10L 15/22 (2006.01); G10L 15/26 (2006.01); G10L 17/06 (2013.01); G10L 17/22 (2013.01); H04L 12/28 (2006.01); H04L 41/00 (2022.01); H04L 41/22 (2022.01); H04L 43/0882 (2022.01); H04L 43/0894 (2022.01); H04L 51/02 (2022.01); H04L 51/216 (2022.01); H04L 67/306 (2022.01); H04L 67/50 (2022.01); H04L 67/5651 (2022.01); H04L 67/75 (2022.01); H04W 12/08 (2021.01); G10L 13/00 (2006.01); G10L 13/04 (2013.01); H04L 51/046 (2022.01); H04L 67/10 (2022.01); H04L 67/53 (2022.01)

CPC G06Q 50/01 (2013.01) [G02B 27/017 (2013.01); G06F 3/011 (2013.01); G06F 3/013 (2013.01); G06F 3/017 (2013.01); G06F 3/167 (2013.01); G06F 7/14 (2013.01); G06F 9/453 (2018.02); G06F 16/176 (2019.01); G06F 16/2255 (2019.01); G06F 16/2365 (2019.01); G06F 16/243 (2019.01); G06F 16/24552 (2019.01); G06F 16/24575 (2019.01); G06F 16/24578 (2019.01); G06F 16/248 (2019.01); G06F 16/285 (2019.01); G06F 16/3323 (2019.01); G06F 16/3329 (2019.01); G06F 16/3344 (2019.01); G06F 16/338 (2019.01); G06F 16/4393 (2019.01); G06F 16/90332 (2019.01); G06F 16/90335 (2019.01); G06F 16/9038 (2019.01); G06F 16/904 (2019.01); G06F 16/951 (2019.01); G06F 16/9535 (2019.01); G06F 18/2411 (2023.01); G06F 40/205 (2020.01); G06F 40/30 (2020.01); G06F 40/40 (2020.01); G06N 3/006 (2013.01); G06N 3/08 (2013.01); G06N 20/00 (2019.01); G06V 10/82 (2022.01); G06V 20/10 (2022.01); G06V 20/30 (2022.01); G06V 40/172 (2022.01); G06V 40/28 (2022.01); G10L 15/02 (2013.01); G10L 15/063 (2013.01); G10L 15/07 (2013.01); G10L 15/16 (2013.01); G10L 15/1815 (2013.01); G10L 15/1822 (2013.01); G10L 15/183 (2013.01); G10L 15/187 (2013.01); G10L 15/22 (2013.01); G10L 15/26 (2013.01); G10L 17/06 (2013.01); G10L 17/22 (2013.01); H04L 12/2816 (2013.01); H04L 41/20 (2013.01); H04L 41/22 (2013.01); H04L 43/0882 (2013.01); H04L 43/0894 (2013.01); H04L 51/02 (2013.01); H04L 51/216 (2022.05); H04L 67/306 (2013.01); H04L 67/535 (2022.05); H04L 67/5651 (2022.05); H04L 67/75 (2022.05); H04W 12/08 (2013.01); G02B 2027/0138 (2013.01); G02B 2027/014 (2013.01); G06F 2216/13 (2013.01); G10L 13/00 (2013.01); G10L 13/04 (2013.01); G10L 2015/223 (2013.01); G10L 2015/225 (2013.01); H04L 51/046 (2013.01); H04L 67/10 (2013.01); H04L 67/53 (2022.05)]

18 Claims

1. A method comprising, by a client system:

receiving, at the client system, a speech input from a user and a visual input captured by one or more cameras of the client system, wherein the speech input comprises a textual input spoken by the user, and wherein the visual input depicts a real-time view captured by the one or more cameras, wherein the real-time view comprises one or more visual concepts and one or more attributes associated with the one or more visual concepts, and wherein the textual input comprises a co-reference to one or more of the visual concepts;

resolving, based on the one or more attributes and the co-reference, one or more entities corresponding to the one or more visual concepts associated with the co-reference; and

presenting, at the client system, a communication content responsive to the speech input and the visual input, wherein the communication content comprises information associated with executing results of one or more tasks corresponding to the one or more resolved entities.