CMPF: Harmonizing Cross-Model Prior Fusion for Open-Vocabulary Segmentation

Abstract

Open-vocabulary segmentation poses significant challenges, as it requires segmenting and recognizing objects across an open set of categories in unconstrained environments. Building on the success of powerful vision-language (ViL) foundation models, such as CLIP, recent efforts sought to harness their zero-shot capabilities to recognize unseen categories. Despite notable performance improvements, these models still encounter the critical issue of generating and recognizing precise mask proposals for unseen categories and scenarios, resulting in inferior segmentation performance eventually. To address this challenge, we introduce a novel Cross-Model Prior Fusion (CMPF) framework, an innovative framework that fuses visual knowledge from a localization foundation model (e.g., SAM) and text knowledge from a ViL model (e.g., CLIP), leveraging their complementary knowledge priors to overcome inherent limitations in mask proposal generation. Taking the ViL model’s visual encoder as the feature backbone, we propose Query Injector and Feature Injector to inject the visual localization feature into the learnable queries and CLIP features respectively, within a transformer decoder. In addition, an OpenSeg Ensemble strategy is designed to further improve mask quality by incorporating SAM’s universal segmentation masks during inference. To fully exploit pre-trained knowledge while minimizing training overhead, we freeze both foundation models, focusing optimization efforts solely on a lightweight transformer decoder for mask proposal generation – the performance bottleneck. Extensive experiments demonstrate that CMPF advances state-of-the-art results across various segmentation benchmarks, trained exclusively on COCO panoptic data, and tested in a zero-shot manner.

Dependencies and Installation

See installation instructions.

Getting Started

See Preparing Datasets.

See Getting Started.

Models

	ADE20K(A-150)				Cityscapes			Mapillary Vistas		BDD 100K		A-847		PC-459		PAS-21		Lvis	COCO (training dataset)			download
	PQ	mAP	mIoU	FWIoU	PQ	mAP	mIoU	PQ	mIoU	PQ	mIoU	mIoU	FWIoU	mIoU	FWIoU	mIoU	FWIoU	APr	PQ	mAP	mIoU
CMPF (ResNet50x64)	23.1	13.5	30.7	56.6	45.2	28.9	56.0	18.1	27.7	12.9	46.2	11.8	52.8	18.7	60.1	82.3	92.1	23.5	55.7	47.4	65.4	checkpoint
CMPF (ConvNeXt-Large)	25.9	16.5	34.4	59.9	45.8	28.4	56.8	18.5	27.3	19.3	52.3	14.8	51.4	19.7	60.2	82.5	92.1	25.6	56.2	47.3	65.5	checkpoint

!!Note:

This repository serves as the official implementation for both CMPF and FrozenSeg, which are essentially the same work presented under different names.

FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation

Acknowledgement

Detectron2, Mask2Former, Segment Anything, OpenCLIP and FC-CLIP.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs/coco		configs/coco
datasets		datasets
demo		demo
frozenseg		frozenseg
images		images
logs/testing		logs/testing
segment_anything		segment_anything
.gitignore		.gitignore
GETTING_STARTED.md		GETTING_STARTED.md
INSTALL.md		INSTALL.md
README.md		README.md
requirements.txt		requirements.txt
save_sam_masks.py		save_sam_masks.py
train_net.py		train_net.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CMPF: Harmonizing Cross-Model Prior Fusion for Open-Vocabulary Segmentation

Abstract

Dependencies and Installation

Getting Started

Models

!!Note:

Acknowledgement

About

Uh oh!

Releases 1

Packages

Languages

chenxi52/CMPF

Folders and files

Latest commit

History

Repository files navigation

CMPF: Harmonizing Cross-Model Prior Fusion for Open-Vocabulary Segmentation

Abstract

Dependencies and Installation

Getting Started

Models

!!Note:

Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages