refactor: extract shared helpers for auxiliary deployment srun and ha… by wprazuch · Pull Request #839 · NVIDIA-NeMo/Evaluator

wprazuch · 2026-03-11T14:24:41Z

…proxy generation

Extract shared instance loop pattern between model and auxiliary deployment srun
Unify haproxy srun generation into parameterized function
Merge duplicate auxiliary_deployments iteration loops in helpers.py
All 137 tests pass, no behavior change

Add judge_deployment config section that deploys a judge model alongside the model under test, on dedicated SLURM nodes within the same allocation. Config structure: - judge_deployment.type: vllm | sglang | none (default: none) - judge_deployment.image, checkpoint_path, served_model_name, port, etc. - judge_deployment.num_nodes: number of nodes for the judge server - execution.judge_deployment.n_tasks: srun task count for judge - execution.mounts.judge_deployment: container mounts for judge When judge_deployment.type != none: - Total SLURM allocation = num_nodes (model) + judge_deployment.num_nodes - Nodes are split: first N for model, remaining M for judge - Model deployment uses --nodelist to stay on its assigned nodes - Judge server starts, health-checks, then eval runs - JUDGE_ENDPOINT_URL and JUDGE_MODEL_ID env vars are auto-exported to evaluation containers for downstream use - Judge server is terminated after evaluation completes Motivation: Ultra posttraining team needs to run 20+ checkpoints x 3-4 judge-dependent evals (arenahard, LCR, tau2, HLE, IFBench). NVCF judge capacity becomes a bottleneck at this scale, especially for heavy judges like DeepSeek-3.2 and Qwen-235B. On-cluster deployment eliminates the external dependency. Ref: feedback from Ultra evaluation meeting (2026-03-04)

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

…ployment Replace duplicated judge_deployment and user_deployment code with a generic auxiliary_deployments system. Adding a new deployment type is now config-only. - Add AuxDeploymentState dataclass and auxiliary_deployments.py module - Add configs/auxiliary_deployment/ shared templates (vllm, none) - Refactor executor.py: single _generate_auxiliary_deployment_srun_command() with haproxy/multi-instance support replaces 2 duplicated functions - Normalization shim translates legacy judge_deployment/user_deployment keys - Validation for duplicate ports, prefixes, required fields - 137/137 tests pass Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

- Remove unused collect_judge/user_deployment_env_vars from env_vars.py - Remove unused get_judge_endpoint_url/get_judge_served_model_name from helpers.py - Add shell variable resolution for config_ef.yaml (auxiliary_deployments refs) - Fix normalization shim to also migrate execution.mounts.{name}_deployment - 137/137 tests pass Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

No more per-type Hydra config groups. Auxiliary deployments are configured inline via auxiliary_deployments: {} in the user's run config. The normalization shim still handles legacy top-level judge_deployment/ user_deployment keys for backward compatibility at the config level. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

…n shim These features were never merged to main, so backward compatibility code is unnecessary. auxiliary_deployments is the only supported API. - Remove normalize_auxiliary_deployments() from auxiliary_deployments.py - Remove its call and import from executor.py - Remove legacy mount validation fallback for judge/user_deployment keys - Update tests to use auxiliary_deployments directly - 137/137 tests pass Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

…proxy generation - Extract shared instance loop pattern between model and auxiliary deployment srun - Unify haproxy srun generation into parameterized function - Merge duplicate auxiliary_deployments iteration loops in helpers.py - All 137 tests pass, no behavior change Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

copy-pr-bot · 2026-03-11T14:24:44Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

gchlebus and others added 12 commits March 4, 2026 21:00

Add user deployment config

082ade5

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

pull from main

9c8c676

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

Fix

f8354ba

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

A little bit cleaner approach phase 1

054cb48

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

Remove unnecessary port clash checks

c46d38e

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

github-actions bot added the nemo-evaluator-launcher label Mar 11, 2026

wprazuch force-pushed the wprazuch/judge-deployment branch from d96608a to 7501aa4 Compare March 16, 2026 12:33

Base automatically changed from wprazuch/judge-deployment to main March 18, 2026 13:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: extract shared helpers for auxiliary deployment srun and ha…#839

refactor: extract shared helpers for auxiliary deployment srun and ha…#839
wprazuch wants to merge 12 commits intomainfrom
wprazuch/refactor-auxiliary-deployments

wprazuch commented Mar 11, 2026

Uh oh!

copy-pr-bot bot commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wprazuch commented Mar 11, 2026

Uh oh!

copy-pr-bot bot commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants