fix: overhaul CI workflows for FSDP regression tests#1024
fix: overhaul CI workflows for FSDP regression tests#1024
Conversation
KeitaW
left a comment
There was a problem hiding this comment.
Review — Deployment Pipeline & Positives
Nice cleanup across the board — the sacct fix, SSH resilience, and heredoc corrections are all substantive improvements.
Security scanning (bandit) dropped without replacement
The old pr-review-and-slurm-test.yml ran bandit security scanning on all Python files. The new pr-lint.yml replaces it with only flake8 and bash -n. I understand the intent is a lightweight replacement (and the old workflow had YAML issues that made it non-functional), but this is a net reduction in security coverage.
I'd suggest either adding a bandit step to pr-lint.yml (scoped to changed files only, to keep it fast), or tracking the re-addition as a follow-up issue so it doesn't fall through the cracks.
Things That Look Great
sacctfor proper job status detection: The old pattern (squeueempty → "COMPLETED") silently treated FAILED/OOM_KILLED jobs as successes. Usingsacctto check the actual exit code is the correct fix and will catch real failures.- SSH retry wrapper with keepalive: The
ssh_cmd()function withServerAliveInterval=60and 3-retry logic is a smart pattern for the 6-hour monitoring loops where SSH connections would previously drop silently. - Dedicated
cleanup-enrootjob: Moving enroot image cleanup to a separate job thatneeds: [build, run-tests]withif: always()eliminates the race condition where matrix entries could try to delete images still in use by sibling jobs. - Heredoc indentation fixes: The
EOFterminator placement was genuinely broken (unindentedEOFinside indented YAML blocks). Fixing this across all 4 workflow files makes them actually parseable. - ShellCheck compliance: Quoting
$GITHUB_OUTPUT/$GITHUB_ENVredirects (SC2086) and grouping consecutive redirects into{ } >>blocks (SC2129) are good hygiene that prevents subtle word-splitting bugs in CI. - Scoped PR linting: Only linting files changed in the PR (via
git diff --name-only --diff-filter=ACMR) is much faster and avoids noise from pre-existing issues in the codebase.
| ERRORS=0 | ||
|
|
||
| echo "${{ steps.changed.outputs.sh_files }}" | while IFS= read -r f; do | ||
| [ -z "$f" ] && continue | ||
| [ -f "$f" ] || continue | ||
| echo " Checking: $f" | ||
| if ! bash -n "$f" 2>&1; then | ||
| ERRORS=$((ERRORS + 1)) | ||
| fi | ||
| done | ||
|
|
||
| if [ "$ERRORS" -gt 0 ]; then | ||
| echo "Found syntax errors in $ERRORS shell script(s)" | ||
| exit 1 | ||
| fi | ||
|
|
||
| echo "All shell scripts passed syntax check" |
There was a problem hiding this comment.
Shell syntax check never fails due to subshell variable scope
The ERRORS counter is incremented inside a while loop that reads from a pipe (echo | while). In bash, the right side of a pipe runs in a subshell, so ERRORS is always 0 in the parent shell after the loop exits. The if [ "$ERRORS" -gt 0 ] check will never trigger, even when bash -n reports syntax errors.
I'd suggest using a here-string to keep the loop in the parent shell:
| ERRORS=0 | |
| echo "${{ steps.changed.outputs.sh_files }}" | while IFS= read -r f; do | |
| [ -z "$f" ] && continue | |
| [ -f "$f" ] || continue | |
| echo " Checking: $f" | |
| if ! bash -n "$f" 2>&1; then | |
| ERRORS=$((ERRORS + 1)) | |
| fi | |
| done | |
| if [ "$ERRORS" -gt 0 ]; then | |
| echo "Found syntax errors in $ERRORS shell script(s)" | |
| exit 1 | |
| fi | |
| echo "All shell scripts passed syntax check" | |
| ERRORS=0 | |
| while IFS= read -r f; do | |
| [ -z "$f" ] && continue | |
| [ -f "$f" ] || continue | |
| echo " Checking: $f" | |
| if ! bash -n "$f" 2>&1; then | |
| ERRORS=$((ERRORS + 1)) | |
| fi | |
| done <<< "${{ steps.changed.outputs.sh_files }}" | |
| if [ "$ERRORS" -gt 0 ]; then | |
| echo "Found syntax errors in $ERRORS shell script(s)" | |
| exit 1 | |
| fi | |
| echo "All shell scripts passed syntax check" |
| - name: Lint Python Files | ||
| if: steps.changed.outputs.py_count != '0' | ||
| run: | | ||
| pip install flake8 |
There was a problem hiding this comment.
Unpinned flake8 version
pip install flake8 installs whatever version is current. A future major release could change rules or output format and break the workflow. I'd suggest pinning to a specific version.
| pip install flake8 | |
| pip install flake8==7.1.1 |
Restructure and harden the GitHub Actions CI workflows that run FSDP
distributed training regression tests on a remote Slurm cluster via SSH.
Workflow structural fixes:
- Fix YAML syntax errors in all 4 Slurm workflows (heredoc EOF terminators
at column 0 inside YAML block scalars)
- Remove broken EKS regression workflow (fsdp-eks-regression.yml)
- Replace heavyweight 814-line PR review workflow with lightweight
path-aware linter (pr-lint.yml)
- Update actions/stale from v5 to v9
- Resolve all actionlint and shellcheck findings
Slurm job monitoring improvements:
- Replace broken squeue-based job status detection with sacct exit code
checking (squeue returning empty was incorrectly treated as COMPLETED)
- Add SSH retry wrapper (ssh_cmd) with exponential backoff
- Add SSH keepalive settings (ServerAliveInterval/ServerAliveCountMax)
- Add dedicated enroot cleanup job to avoid race conditions
- Add inline error log dump (last 200 lines) before job failure exit
Runtime fixes:
- Skip sudo apt install in create_venv.sh when sudo is unavailable
- Fix shell variable escaping for sbatch filename in heredoc
- Insert venv activation after last SBATCH directive (not line 2)
- Add /opt/slurm/bin to PATH for non-login SSH sessions
- Fix LD_PRELOAD path to system NCCL (/lib/x86_64-linux-gnu/libnccl.so)
- Pass HF_TOKEN to Slurm jobs via sed injection
HuggingFace data loading (fixes HTTP 429 rate limiting):
- Pre-download C4 dataset shards to local FSx storage on the cluster
- Generate a JSON manifest of local file paths per split
- Modify train_utils.py to use load_dataset('json', data_files=...)
when HF_DATA_FILES_MANIFEST env var points to a valid manifest
- Set HF_HOME and HF_DATASETS_CACHE for shared FSx caching
- Set HF_HUB_OFFLINE=1 in sbatch files to block runtime API calls
- Pre-cache tokenizers in the workflow pre-download step
Test matrix:
- Align both venv and container workflows to cluster: [p5],
model_config: [llama3_1_8b, llama3_1_70b]
- Set NCCL_DEBUG=WARN (reduced from INFO) in all sbatch files
4307461 to
ea5d883
Compare
KeitaW
left a comment
There was a problem hiding this comment.
Re-review — CI Failures That Need Fixing
All 5 CI checks are currently failing. Here's a breakdown of what needs fixing before merge.
Lint check: unused imports at lines 21-22 of train_utils.py
The top-level imports from transformers import LlamaForCausalLM, LlamaTokenizer, LlamaConfig (line 21) and from transformers.models.llama.modeling_llama import LlamaDecoderLayer (line 22) are unused — they're re-imported conditionally later at lines 141 and 282. I'd suggest removing the top-level imports to fix the F401 and F811 flake8 findings (4 of the 8 errors).
Container tests: invalid --container-mounts format (leading comma)
Both run-tests (p5, llama3_1_8b) and run-tests (p5, llama3_1_70b) fail with:
srun: error: pyxis: --container-mounts: invalid format:
srun: error: Invalid --container-mounts argument: ,/fsx/.../checkpoints:/checkpoints
The leading comma indicates an empty variable is being concatenated before the mount path. I'd suggest checking the sbatch template where --container-mounts is constructed — there's likely a pattern like ${OPTIONAL_MOUNT},/fsx/...:/checkpoints where OPTIONAL_MOUNT is empty.
Venv tests: NCCL error on p5 cluster
Both venv regression jobs fail with:
torch.distributed.DistBackendError: NCCL error in: NCCLUtils.cpp:77, invalid usage, NCCL version 2.29.2
The LD_PRELOAD=/lib/x86_64-linux-gnu/libnccl.so forces system NCCL (2.29.2), but the error suggests a version or configuration mismatch with the PyTorch version in the venv. I'd suggest checking the .err log artifacts for NCCL debug details, and verifying system NCCL compatibility with the venv's PyTorch.
Previous finding still open: subshell variable scope bug in pr-lint.yml
The echo | while pipe + ERRORS counter bug from my first review (lines 85-101 of pr-lint.yml) is still present. The lint check passed this time only because flake8 caught errors first — if there were only shell scripts with syntax errors, the check would silently pass.
Things That Look Great
- The lint workflow caught real issues:
pr-lint.ymlimmediately proved its value by catching genuine unused imports and dead code intrain_utils.py. - Error log dump in monitor step: The
::group::Job error output (last 200 lines)pattern made it trivial to diagnose the NCCL and pyxis failures without downloading artifacts. - Manifest-based data loading: The
HF_DATA_FILES_MANIFESTapproach intrain_utils.pyis a clean solution to the HF rate-limiting problem. - Comprehensive scope: All changes coherently serve the goal of making FSDP regression tests actually pass on the p5 cluster.
|
|
||
| def validation(model, rank, world_size, val_loader): | ||
| model.eval() | ||
| correct = 0 |
There was a problem hiding this comment.
Flake8 F841: unused variable correct
correct = 0 is assigned but never used in the validation() function. This is one of the 8 flake8 errors failing the lint check.
| correct = 0 |
| check_fn_gpt = lambda submodule: isinstance( | ||
| submodule, transformer_layer | ||
| ) | ||
| check_fn_gpt = lambda submodule: isinstance(submodule, transformer_layer) |
There was a problem hiding this comment.
Flake8 E731: lambda assigned to variable
Flake8 flags check_fn_gpt = lambda ... — the convention is to use def instead of assigning a lambda. This is one of the 8 flake8 errors failing the lint check.
| check_fn_gpt = lambda submodule: isinstance(submodule, transformer_layer) | |
| def check_fn_gpt(submodule): return isinstance(submodule, transformer_layer) |
Summary
Restructure and harden the GitHub Actions CI workflows that run FSDP distributed training regression tests on a remote Slurm cluster via SSH.
Changes
Workflow structural fixes
EOFterminators at column 0 inside YAML block scalars)fsdp-eks-regression.yml)pr-lint.yml)actions/stalefrom v5 to v9Slurm job monitoring improvements
squeue-based job status detection withsacctexit code checking (squeuereturning empty was incorrectly treated as COMPLETED)ssh_cmd) with exponential backoffServerAliveInterval/ServerAliveCountMax)Runtime fixes
sudo apt installincreate_venv.shwhen sudo is unavailable#SBATCHdirective (not line 2)/opt/slurm/bintoPATHfor non-login SSH sessionsLD_PRELOADpath to system NCCL (/lib/x86_64-linux-gnu/libnccl.so)HF_TOKENto Slurm jobs via sed injectionHuggingFace data loading (fixes HTTP 429 rate limiting)
train_utils.pyto useload_dataset("json", data_files=...)whenHF_DATA_FILES_MANIFESTenv var points to a valid manifestHF_HOMEandHF_DATASETS_CACHEfor shared FSx cachingHF_HUB_OFFLINE=1in sbatch files to block runtime API callsTest matrix
cluster: [p5],model_config: [llama3_1_8b, llama3_1_70b]NCCL_DEBUG=WARN(reduced fromINFO) in all sbatch filesFiles changed (19)
fsdp-regression-test-venv.yml,fsdp-regression-test-container.yml,megatron-ci-slurm.yaml,closing-soon.yml,pr-lint.yml(new)fsdp-eks-regression.yml,pr-review-and-slurm-test.yml3.test_cases/pytorch/FSDP/slurm/*-training.sbatch— LD_PRELOAD, HF caching, NCCL_DEBUGcreate_venv.sh— sudo fallbacktrain_utils.py— manifest-based local data loading.gitignore— addedlog.failedTesting
Validated through 15+ CI workflow runs on the p5 cluster, iteratively fixing issues from OIDC configuration through data loading.