US 12,468,902 B2
Systems and methods for automated response to natural language instructions
Divyansh Garg, Stanford, CA (US); and Skanda Vaidyanath, Stanford, CA (US)
Assigned to The Board of Trustees of the Leland Stanford Junior University, Stanford, CA (US)
Filed by The Board of Trustees of the Leland Stanford Junior University, Stanford, CA (US)
Filed on Feb. 22, 2023, as Appl. No. 18/172,969.
Claims priority of provisional application 63/268,364, filed on Feb. 22, 2022.
Prior Publication US 2023/0267284 A1, Aug. 24, 2023
Int. Cl. G06F 40/00 (2020.01); G06F 40/216 (2020.01); G06F 40/30 (2020.01); G06F 40/44 (2020.01); G06N 3/045 (2023.01); G06N 20/00 (2019.01); G06N 20/20 (2019.01)
CPC G06F 40/44 (2020.01) [G06F 40/216 (2020.01); G06F 40/30 (2020.01); G06N 3/045 (2023.01); G06N 20/00 (2019.01); G06N 20/20 (2019.01)] 12 Claims
OG exemplary drawing
 
1. A method for enabling a machine to act upon natural language instructions, comprising:
obtaining a plurality of instruction and observation pairs;
generating language embeddings for each instruction in the plurality of instruction and observation pairs using a language encoder;
generating observation embeddings for each observation in the plurality of instruction and observation pairs;
predicting a set of skill codes for each given pair in the plurality of instruction and observation pairs based on a given language embedding and a given observation embedding generated from the given pair using a skill predictor;
predicting an action to correctly resolve the instruction of the given pair using a policy based on the set of skill codes; and
controlling a device to perform the predicted action.
 
7. A system for enabling a machine to act upon natural language instructions, comprising:
a processor;
a controllable device; and
a memory, comprising a natural language processing application that configures the processor to:
obtain a plurality of instruction and observation pairs;
generate language embeddings for each instruction in the plurality of instruction and observation pairs using a language encoder;
generate observation embeddings for each observation in the plurality of instruction and observation pairs;
predict a set of skill codes for each given pair in the plurality of instruction and observation pairs based on a given language embedding and a given observation embedding generated from the given pair using a skill predictor;
predict an action to correctly resolve the instruction of the given pair using a policy based on the set of skill codes; and
control the controllable device to perform the predicted action.