CPC G06V 40/23 (2022.01) [G06V 10/82 (2022.01)] | 20 Claims |
1. A system comprising:
a computing device including:
a non-transitory computer-readable storage medium storing machine readable instructions; and
a processor coupled to the non-transitory computer-readable storage medium and configured to execute the machine readable instructions to:
obtain an image of a worker performing a work task,
encode the image to generate an embedding vector,
transmit the embedding vector to an image-grounded text decoder, while generating first tokens to instruct the image-grounded text decoder to generate a first sentence indicating a root cause of an ergonomic risk identified in the image,
compute first relative sensitivity scores relating to the first tokens and extracted image features,
generate second tokens of the first sentence based on the first relative sensitivity scores, while generating third tokens to instruct a text decoder to generate a second sentence indicating a solution to the ergonomic risk identified in the image,
calculate second relative sensitivity scores relating to the second and third tokens,
generate the first and second sentences based on the first and second relative sensitivity scores, and the second and third tokens, and
apply a loss function to the first sentence indicating the root cause of the ergonomic risk and the second sentence indicating the solution to the ergonomic solution, wherein the loss function is determined based at least upon a total number of tokens in the first and second sentences, a vocabulary size for the first and second sentences in a training dataset, information related to target tokens, the image, and a label smoothing factor.
|