Skip to content

refactor: extract shared helpers for auxiliary deployment srun and ha…#839

Draft
wprazuch wants to merge 12 commits intomainfrom
wprazuch/refactor-auxiliary-deployments
Draft

refactor: extract shared helpers for auxiliary deployment srun and ha…#839
wprazuch wants to merge 12 commits intomainfrom
wprazuch/refactor-auxiliary-deployments

Conversation

@wprazuch
Copy link
Copy Markdown
Contributor

…proxy generation

  • Extract shared instance loop pattern between model and auxiliary deployment srun
  • Unify haproxy srun generation into parameterized function
  • Merge duplicate auxiliary_deployments iteration loops in helpers.py
  • All 137 tests pass, no behavior change

gchlebus and others added 12 commits March 4, 2026 21:00
Add judge_deployment config section that deploys a judge model alongside
the model under test, on dedicated SLURM nodes within the same allocation.

Config structure:
- judge_deployment.type: vllm | sglang | none (default: none)
- judge_deployment.image, checkpoint_path, served_model_name, port, etc.
- judge_deployment.num_nodes: number of nodes for the judge server
- execution.judge_deployment.n_tasks: srun task count for judge
- execution.mounts.judge_deployment: container mounts for judge

When judge_deployment.type != none:
- Total SLURM allocation = num_nodes (model) + judge_deployment.num_nodes
- Nodes are split: first N for model, remaining M for judge
- Model deployment uses --nodelist to stay on its assigned nodes
- Judge server starts, health-checks, then eval runs
- JUDGE_ENDPOINT_URL and JUDGE_MODEL_ID env vars are auto-exported
  to evaluation containers for downstream use
- Judge server is terminated after evaluation completes

Motivation: Ultra posttraining team needs to run 20+ checkpoints x 3-4
judge-dependent evals (arenahard, LCR, tau2, HLE, IFBench). NVCF judge
capacity becomes a bottleneck at this scale, especially for heavy judges
like DeepSeek-3.2 and Qwen-235B. On-cluster deployment eliminates the
external dependency.

Ref: feedback from Ultra evaluation meeting (2026-03-04)
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
…ployment

Replace duplicated judge_deployment and user_deployment code with a generic
auxiliary_deployments system. Adding a new deployment type is now config-only.

- Add AuxDeploymentState dataclass and auxiliary_deployments.py module
- Add configs/auxiliary_deployment/ shared templates (vllm, none)
- Refactor executor.py: single _generate_auxiliary_deployment_srun_command()
  with haproxy/multi-instance support replaces 2 duplicated functions
- Normalization shim translates legacy judge_deployment/user_deployment keys
- Validation for duplicate ports, prefixes, required fields
- 137/137 tests pass

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
- Remove unused collect_judge/user_deployment_env_vars from env_vars.py
- Remove unused get_judge_endpoint_url/get_judge_served_model_name from helpers.py
- Add shell variable resolution for config_ef.yaml (auxiliary_deployments refs)
- Fix normalization shim to also migrate execution.mounts.{name}_deployment
- 137/137 tests pass

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
No more per-type Hydra config groups. Auxiliary deployments are configured
inline via auxiliary_deployments: {} in the user's run config.
The normalization shim still handles legacy top-level judge_deployment/
user_deployment keys for backward compatibility at the config level.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
No more per-type Hydra config groups. Auxiliary deployments are configured
inline via auxiliary_deployments: {} in the user's run config.
The normalization shim still handles legacy top-level judge_deployment/
user_deployment keys for backward compatibility at the config level.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
…n shim

These features were never merged to main, so backward compatibility code
is unnecessary. auxiliary_deployments is the only supported API.

- Remove normalize_auxiliary_deployments() from auxiliary_deployments.py
- Remove its call and import from executor.py
- Remove legacy mount validation fallback for judge/user_deployment keys
- Update tests to use auxiliary_deployments directly
- 137/137 tests pass

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
…proxy generation

- Extract shared instance loop pattern between model and auxiliary deployment srun
- Unify haproxy srun generation into parameterized function
- Merge duplicate auxiliary_deployments iteration loops in helpers.py
- All 137 tests pass, no behavior change

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 11, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@wprazuch wprazuch force-pushed the wprazuch/judge-deployment branch from d96608a to 7501aa4 Compare March 16, 2026 12:33
Base automatically changed from wprazuch/judge-deployment to main March 18, 2026 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants