US 12,488,470 B2
Single-stage open-vocabulary panoptic segmentation
Qihang Yu, Los Angeles, CA (US); Ju He, Los Angeles, CA (US); Xueqing Deng, Los Angeles, CA (US); Xiaohui Shen, Los Angeles, CA (US); and Liang-Chieh Chen, Los Angeles, CA (US)
Assigned to Lemon Inc., Grand Cayman (KY)
Filed by Lemon Inc., Grand Cayman (KY)
Filed on Aug. 3, 2023, as Appl. No. 18/365,060.
Prior Publication US 2025/0045929 A1, Feb. 6, 2025
Int. Cl. G06K 9/00 (2022.01); G06T 3/40 (2006.01); G06T 7/12 (2017.01); G06V 10/44 (2022.01); G06V 10/764 (2022.01); G06V 10/771 (2022.01)
CPC G06T 7/12 (2017.01) [G06T 3/40 (2013.01); G06V 10/44 (2022.01); G06V 10/764 (2022.01); G06V 10/771 (2022.01); G06T 2207/20084 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computing system for performing open-vocabulary panoptic segmentation, the computing system comprising:
a processor and memory storing instructions that, when executed by the processor, cause the processor to:
receive an image;
extract a plurality of feature maps from the image using a convolutional neural network-based (CNN-based) vision-language model;
generate a plurality of pixel features from the plurality of feature maps using a pixel decoder;
generate a plurality of mask predictions from the plurality of pixel features using a mask decoder;
generate a plurality of in-vocabulary class predictions corresponding to the plurality of mask predictions using the plurality of pixel features;
generate a plurality of out-of-vocabulary class predictions corresponding to the plurality of mask predictions using the plurality of feature maps;
perform geometric ensembling on the plurality of in-vocabulary class predictions and the plurality of out-of-vocabulary class predictions to generate a plurality of final class predictions; and
output the plurality of mask predictions and the plurality of final class predictions.