| CPC G06T 7/12 (2017.01) [G06T 3/40 (2013.01); G06V 10/44 (2022.01); G06V 10/764 (2022.01); G06V 10/771 (2022.01); G06T 2207/20084 (2013.01)] | 20 Claims |

|
1. A computing system for performing open-vocabulary panoptic segmentation, the computing system comprising:
a processor and memory storing instructions that, when executed by the processor, cause the processor to:
receive an image;
extract a plurality of feature maps from the image using a convolutional neural network-based (CNN-based) vision-language model;
generate a plurality of pixel features from the plurality of feature maps using a pixel decoder;
generate a plurality of mask predictions from the plurality of pixel features using a mask decoder;
generate a plurality of in-vocabulary class predictions corresponding to the plurality of mask predictions using the plurality of pixel features;
generate a plurality of out-of-vocabulary class predictions corresponding to the plurality of mask predictions using the plurality of feature maps;
perform geometric ensembling on the plurality of in-vocabulary class predictions and the plurality of out-of-vocabulary class predictions to generate a plurality of final class predictions; and
output the plurality of mask predictions and the plurality of final class predictions.
|