refactor: extract shared helpers for auxiliary deployment srun and ha…#839
Draft
refactor: extract shared helpers for auxiliary deployment srun and ha…#839
Conversation
Add judge_deployment config section that deploys a judge model alongside the model under test, on dedicated SLURM nodes within the same allocation. Config structure: - judge_deployment.type: vllm | sglang | none (default: none) - judge_deployment.image, checkpoint_path, served_model_name, port, etc. - judge_deployment.num_nodes: number of nodes for the judge server - execution.judge_deployment.n_tasks: srun task count for judge - execution.mounts.judge_deployment: container mounts for judge When judge_deployment.type != none: - Total SLURM allocation = num_nodes (model) + judge_deployment.num_nodes - Nodes are split: first N for model, remaining M for judge - Model deployment uses --nodelist to stay on its assigned nodes - Judge server starts, health-checks, then eval runs - JUDGE_ENDPOINT_URL and JUDGE_MODEL_ID env vars are auto-exported to evaluation containers for downstream use - Judge server is terminated after evaluation completes Motivation: Ultra posttraining team needs to run 20+ checkpoints x 3-4 judge-dependent evals (arenahard, LCR, tau2, HLE, IFBench). NVCF judge capacity becomes a bottleneck at this scale, especially for heavy judges like DeepSeek-3.2 and Qwen-235B. On-cluster deployment eliminates the external dependency. Ref: feedback from Ultra evaluation meeting (2026-03-04)
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
…ployment Replace duplicated judge_deployment and user_deployment code with a generic auxiliary_deployments system. Adding a new deployment type is now config-only. - Add AuxDeploymentState dataclass and auxiliary_deployments.py module - Add configs/auxiliary_deployment/ shared templates (vllm, none) - Refactor executor.py: single _generate_auxiliary_deployment_srun_command() with haproxy/multi-instance support replaces 2 duplicated functions - Normalization shim translates legacy judge_deployment/user_deployment keys - Validation for duplicate ports, prefixes, required fields - 137/137 tests pass Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
- Remove unused collect_judge/user_deployment_env_vars from env_vars.py
- Remove unused get_judge_endpoint_url/get_judge_served_model_name from helpers.py
- Add shell variable resolution for config_ef.yaml (auxiliary_deployments refs)
- Fix normalization shim to also migrate execution.mounts.{name}_deployment
- 137/137 tests pass
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
No more per-type Hydra config groups. Auxiliary deployments are configured
inline via auxiliary_deployments: {} in the user's run config.
The normalization shim still handles legacy top-level judge_deployment/
user_deployment keys for backward compatibility at the config level.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
No more per-type Hydra config groups. Auxiliary deployments are configured
inline via auxiliary_deployments: {} in the user's run config.
The normalization shim still handles legacy top-level judge_deployment/
user_deployment keys for backward compatibility at the config level.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
…n shim These features were never merged to main, so backward compatibility code is unnecessary. auxiliary_deployments is the only supported API. - Remove normalize_auxiliary_deployments() from auxiliary_deployments.py - Remove its call and import from executor.py - Remove legacy mount validation fallback for judge/user_deployment keys - Update tests to use auxiliary_deployments directly - 137/137 tests pass Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
…proxy generation - Extract shared instance loop pattern between model and auxiliary deployment srun - Unify haproxy srun generation into parameterized function - Merge duplicate auxiliary_deployments iteration loops in helpers.py - All 137 tests pass, no behavior change Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
d96608a to
7501aa4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…proxy generation