Description
A Better Evaluation Entry for OpenAI-Style API Models
In practice, deployment and evaluation are typically separated to avoid dependency redundancy caused by supporting various models. The more models supported, the more complex the dependencies become. This approach also facilitates the evaluation of extremely large models.
I noticed that the framework supports OpenAI-style API models for evaluation through the GPT4V
class. However, the user experience still needs improvement. Specifically:
- To test a model, you need to modify
config.py
and register amodel_name
. - Diverse parameters (e.g., temperature, timeout) require manual adjustments.
Could you provide an interface like this?
python run_api.py \
--model-name "vllm_qwen_2.5-7b" \
--base-url "xxxxxxxx" \
--api-key "xxxxx" \
--max-token-out 16000 \
--min-pixels 3k \
--max-pixels 100w \
--temperature 0.1 \
--top-p 0.9 \
--data MME \
--work-dir ./outputs
With such an entry point, models like those in VLMEvalKit#1093 could be automatically supported without additional modifications. This would allow frameworks like vLLM, SGLang, and LMDeploy to support more models seamlessly.
在实际情况下,一般会部署和评测分离,这样可以摆脱支持各种各样的模型导致的依赖冗余,支持的模型越多,以来越复杂,以及更方便的支持超巨大模型的评测。而且一次也只会评测一个API, 多个模型就执行多次命令。
我看到框架里可以通过GPT4V
这个类来支持openai 格式的API模型的评测。但使用体验还是不够好。具体的:
- 测试一个模型,需要修改config.py,注册一个model_name
- 对多样性的参数需要手动修改,比如 temperature timeout
能否提供这样的接口:
python run_api.py \
--model-name "vllm_qwen_2.5-7b" \
--base-url "xxxxxxxx" \
--api-key "xxxxx" \
--max-token-out 16000 \
--min-pixels 3k \
--max-pixels 100w \
--temperature 0.1 \
--top-p 0.9 \
--data MME \
--work-dir ./outputs
有了这样的入口,类似:#1093 这样模型就可以自动支持。并不需要额外支持。vllm, sglang 以及 lmdeploy 会支持更多的模型。