fix: overhaul CI workflows for FSDP regression tests by paragao · Pull Request #1024 · awslabs/awsome-distributed-training

paragao · 2026-03-17T14:45:50Z

Summary

Restructure and harden the GitHub Actions CI workflows that run FSDP distributed training regression tests on a remote Slurm cluster via SSH.

Changes

Workflow structural fixes

Fix YAML syntax errors in all 4 Slurm workflows (heredoc EOF terminators at column 0 inside YAML block scalars)
Remove broken EKS regression workflow (fsdp-eks-regression.yml)
Replace heavyweight 814-line PR review workflow with lightweight path-aware linter (pr-lint.yml)
Update actions/stale from v5 to v9
Resolve all actionlint and shellcheck findings

Slurm job monitoring improvements

Replace broken squeue-based job status detection with sacct exit code checking (squeue returning empty was incorrectly treated as COMPLETED)
Add SSH retry wrapper (ssh_cmd) with exponential backoff
Add SSH keepalive settings (ServerAliveInterval/ServerAliveCountMax)
Add dedicated enroot cleanup job to avoid race conditions
Add inline error log dump (last 200 lines) before job failure exit

Runtime fixes

Skip sudo apt install in create_venv.sh when sudo is unavailable
Fix shell variable escaping for sbatch filename in heredoc
Insert venv activation after last #SBATCH directive (not line 2)
Add /opt/slurm/bin to PATH for non-login SSH sessions
Fix LD_PRELOAD path to system NCCL (/lib/x86_64-linux-gnu/libnccl.so)
Pass HF_TOKEN to Slurm jobs via sed injection

HuggingFace data loading (fixes HTTP 429 rate limiting)

Pre-download C4 dataset shards to local FSx storage on the cluster
Generate a JSON manifest of local file paths per split
Modify train_utils.py to use load_dataset("json", data_files=...) when HF_DATA_FILES_MANIFEST env var points to a valid manifest
Set HF_HOME and HF_DATASETS_CACHE for shared FSx caching
Set HF_HUB_OFFLINE=1 in sbatch files to block runtime API calls
Pre-cache tokenizers in the workflow pre-download step

Test matrix

Align both venv and container workflows: cluster: [p5], model_config: [llama3_1_8b, llama3_1_70b]
Set NCCL_DEBUG=WARN (reduced from INFO) in all sbatch files

Files changed (19)

Category	Files
Workflows	`fsdp-regression-test-venv.yml`, `fsdp-regression-test-container.yml`, `megatron-ci-slurm.yaml`, `closing-soon.yml`, `pr-lint.yml` (new)
Workflows removed	`fsdp-eks-regression.yml`, `pr-review-and-slurm-test.yml`
Sbatch files (9)	All `3.test_cases/pytorch/FSDP/slurm/*-training.sbatch` — LD_PRELOAD, HF caching, NCCL_DEBUG
Scripts	`create_venv.sh` — sudo fallback
Training code	`train_utils.py` — manifest-based local data loading
Other	`.gitignore` — added `log.failed`

Testing

Validated through 15+ CI workflow runs on the p5 cluster, iteratively fixing issues from OIDC configuration through data loading.

KeitaW

Review — Deployment Pipeline & Positives

Nice cleanup across the board — the sacct fix, SSH resilience, and heredoc corrections are all substantive improvements.

Security scanning (bandit) dropped without replacement

The old pr-review-and-slurm-test.yml ran bandit security scanning on all Python files. The new pr-lint.yml replaces it with only flake8 and bash -n. I understand the intent is a lightweight replacement (and the old workflow had YAML issues that made it non-functional), but this is a net reduction in security coverage.

I'd suggest either adding a bandit step to pr-lint.yml (scoped to changed files only, to keep it fast), or tracking the re-addition as a follow-up issue so it doesn't fall through the cracks.

Things That Look Great

sacct for proper job status detection: The old pattern (squeue empty → "COMPLETED") silently treated FAILED/OOM_KILLED jobs as successes. Using sacct to check the actual exit code is the correct fix and will catch real failures.
SSH retry wrapper with keepalive: The ssh_cmd() function with ServerAliveInterval=60 and 3-retry logic is a smart pattern for the 6-hour monitoring loops where SSH connections would previously drop silently.
Dedicated cleanup-enroot job: Moving enroot image cleanup to a separate job that needs: [build, run-tests] with if: always() eliminates the race condition where matrix entries could try to delete images still in use by sibling jobs.
Heredoc indentation fixes: The EOF terminator placement was genuinely broken (unindented EOF inside indented YAML blocks). Fixing this across all 4 workflow files makes them actually parseable.
ShellCheck compliance: Quoting $GITHUB_OUTPUT / $GITHUB_ENV redirects (SC2086) and grouping consecutive redirects into { } >> blocks (SC2129) are good hygiene that prevents subtle word-splitting bugs in CI.
Scoped PR linting: Only linting files changed in the PR (via git diff --name-only --diff-filter=ACMR) is much faster and avoids noise from pre-existing issues in the codebase.

KeitaW · 2026-03-18T07:30:49Z

.github/workflows/pr-lint.yml

+          ERRORS=0
+
+          echo "${{ steps.changed.outputs.sh_files }}" | while IFS= read -r f; do
+            [ -z "$f" ] && continue
+            [ -f "$f" ] || continue
+            echo "  Checking: $f"
+            if ! bash -n "$f" 2>&1; then
+              ERRORS=$((ERRORS + 1))
+            fi
+          done
+
+          if [ "$ERRORS" -gt 0 ]; then
+            echo "Found syntax errors in $ERRORS shell script(s)"
+            exit 1
+          fi
+
+          echo "All shell scripts passed syntax check"


Shell syntax check never fails due to subshell variable scope

The ERRORS counter is incremented inside a while loop that reads from a pipe (echo | while). In bash, the right side of a pipe runs in a subshell, so ERRORS is always 0 in the parent shell after the loop exits. The if [ "$ERRORS" -gt 0 ] check will never trigger, even when bash -n reports syntax errors.

I'd suggest using a here-string to keep the loop in the parent shell:

Suggested change

ERRORS=0

echo "${{ steps.changed.outputs.sh_files }}" | while IFS= read -r f; do

[ -z "$f" ] && continue

[ -f "$f" ] || continue

echo " Checking: $f"

if ! bash -n "$f" 2>&1; then

ERRORS=$((ERRORS + 1))

fi

done

if [ "$ERRORS" -gt 0 ]; then

echo "Found syntax errors in $ERRORS shell script(s)"

exit 1

fi

echo "All shell scripts passed syntax check"

ERRORS=0

while IFS= read -r f; do

[ -z "$f" ] && continue

[ -f "$f" ] || continue

echo " Checking: $f"

if ! bash -n "$f" 2>&1; then

ERRORS=$((ERRORS + 1))

fi

done <<< "${{ steps.changed.outputs.sh_files }}"

if [ "$ERRORS" -gt 0 ]; then

echo "Found syntax errors in $ERRORS shell script(s)"

exit 1

fi

echo "All shell scripts passed syntax check"

KeitaW · 2026-03-18T07:30:49Z

.github/workflows/pr-lint.yml

+      - name: Lint Python Files
+        if: steps.changed.outputs.py_count != '0'
+        run: |
+          pip install flake8


Unpinned flake8 version

pip install flake8 installs whatever version is current. A future major release could change rules or output format and break the workflow. I'd suggest pinning to a specific version.

Suggested change

pip install flake8

pip install flake8==7.1.1

KeitaW

overall LGTM, few nit

Restructure and harden the GitHub Actions CI workflows that run FSDP distributed training regression tests on a remote Slurm cluster via SSH. Workflow structural fixes: - Fix YAML syntax errors in all 4 Slurm workflows (heredoc EOF terminators at column 0 inside YAML block scalars) - Remove broken EKS regression workflow (fsdp-eks-regression.yml) - Replace heavyweight 814-line PR review workflow with lightweight path-aware linter (pr-lint.yml) - Update actions/stale from v5 to v9 - Resolve all actionlint and shellcheck findings Slurm job monitoring improvements: - Replace broken squeue-based job status detection with sacct exit code checking (squeue returning empty was incorrectly treated as COMPLETED) - Add SSH retry wrapper (ssh_cmd) with exponential backoff - Add SSH keepalive settings (ServerAliveInterval/ServerAliveCountMax) - Add dedicated enroot cleanup job to avoid race conditions - Add inline error log dump (last 200 lines) before job failure exit Runtime fixes: - Skip sudo apt install in create_venv.sh when sudo is unavailable - Fix shell variable escaping for sbatch filename in heredoc - Insert venv activation after last SBATCH directive (not line 2) - Add /opt/slurm/bin to PATH for non-login SSH sessions - Fix LD_PRELOAD path to system NCCL (/lib/x86_64-linux-gnu/libnccl.so) - Pass HF_TOKEN to Slurm jobs via sed injection HuggingFace data loading (fixes HTTP 429 rate limiting): - Pre-download C4 dataset shards to local FSx storage on the cluster - Generate a JSON manifest of local file paths per split - Modify train_utils.py to use load_dataset('json', data_files=...) when HF_DATA_FILES_MANIFEST env var points to a valid manifest - Set HF_HOME and HF_DATASETS_CACHE for shared FSx caching - Set HF_HUB_OFFLINE=1 in sbatch files to block runtime API calls - Pre-cache tokenizers in the workflow pre-download step Test matrix: - Align both venv and container workflows to cluster: [p5], model_config: [llama3_1_8b, llama3_1_70b] - Set NCCL_DEBUG=WARN (reduced from INFO) in all sbatch files

KeitaW

Re-review — CI Failures That Need Fixing

All 5 CI checks are currently failing. Here's a breakdown of what needs fixing before merge.

Lint check: unused imports at lines 21-22 of `train_utils.py`

The top-level imports from transformers import LlamaForCausalLM, LlamaTokenizer, LlamaConfig (line 21) and from transformers.models.llama.modeling_llama import LlamaDecoderLayer (line 22) are unused — they're re-imported conditionally later at lines 141 and 282. I'd suggest removing the top-level imports to fix the F401 and F811 flake8 findings (4 of the 8 errors).

Container tests: invalid `--container-mounts` format (leading comma)

Both run-tests (p5, llama3_1_8b) and run-tests (p5, llama3_1_70b) fail with:

srun: error: pyxis: --container-mounts: invalid format:
srun: error: Invalid --container-mounts argument: ,/fsx/.../checkpoints:/checkpoints

The leading comma indicates an empty variable is being concatenated before the mount path. I'd suggest checking the sbatch template where --container-mounts is constructed — there's likely a pattern like ${OPTIONAL_MOUNT},/fsx/...:/checkpoints where OPTIONAL_MOUNT is empty.

Venv tests: NCCL error on p5 cluster

Both venv regression jobs fail with:

torch.distributed.DistBackendError: NCCL error in: NCCLUtils.cpp:77, invalid usage, NCCL version 2.29.2

The LD_PRELOAD=/lib/x86_64-linux-gnu/libnccl.so forces system NCCL (2.29.2), but the error suggests a version or configuration mismatch with the PyTorch version in the venv. I'd suggest checking the .err log artifacts for NCCL debug details, and verifying system NCCL compatibility with the venv's PyTorch.

Previous finding still open: subshell variable scope bug in pr-lint.yml

The echo | while pipe + ERRORS counter bug from my first review (lines 85-101 of pr-lint.yml) is still present. The lint check passed this time only because flake8 caught errors first — if there were only shell scripts with syntax errors, the check would silently pass.

Things That Look Great

The lint workflow caught real issues: pr-lint.yml immediately proved its value by catching genuine unused imports and dead code in train_utils.py.
Error log dump in monitor step: The ::group::Job error output (last 200 lines) pattern made it trivial to diagnose the NCCL and pyxis failures without downloading artifacts.
Manifest-based data loading: The HF_DATA_FILES_MANIFEST approach in train_utils.py is a clean solution to the HF rate-limiting problem.
Comprehensive scope: All changes coherently serve the goal of making FSDP regression tests actually pass on the p5 cluster.

KeitaW · 2026-03-23T13:06:35Z

3.test_cases/pytorch/FSDP/src/model_utils/train_utils.py


 def validation(model, rank, world_size, val_loader):
    model.eval()
    correct = 0


Flake8 F841: unused variable correct

correct = 0 is assigned but never used in the validation() function. This is one of the 8 flake8 errors failing the lint check.

Suggested change

correct = 0

KeitaW · 2026-03-23T13:06:35Z

3.test_cases/pytorch/FSDP/src/model_utils/train_utils.py

-    check_fn_gpt = lambda submodule: isinstance(
-        submodule, transformer_layer
-    )
+    check_fn_gpt = lambda submodule: isinstance(submodule, transformer_layer)


Flake8 E731: lambda assigned to variable

Flake8 flags check_fn_gpt = lambda ... — the convention is to use def instead of assigning a lambda. This is one of the 8 flake8 errors failing the lint check.

Suggested change

check_fn_gpt = lambda submodule: isinstance(submodule, transformer_layer)

def check_fn_gpt(submodule): return isinstance(submodule, transformer_layer)

KeitaW reviewed Mar 18, 2026

View reviewed changes

KeitaW requested changes Mar 18, 2026

View reviewed changes

paragao force-pushed the fix/ci-workflows branch from 4307461 to ea5d883 Compare March 19, 2026 23:46

paragao changed the title ~~fix: overhaul CI workflows -- fix YAML errors, improve robustness, clean up linting~~ fix: overhaul CI workflows for FSDP regression tests Mar 19, 2026

paragao mentioned this pull request Mar 20, 2026

Add DeepSpeed CI regression tests for QLoRA and GPT-103B #1029

Open

KeitaW reviewed Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: overhaul CI workflows for FSDP regression tests#1024

fix: overhaul CI workflows for FSDP regression tests#1024
paragao wants to merge 1 commit intomainfrom
fix/ci-workflows

paragao commented Mar 17, 2026 •

edited

Loading

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW Mar 18, 2026

Uh oh!

KeitaW Mar 18, 2026

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW Mar 23, 2026

Uh oh!

KeitaW Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	check_fn_gpt = lambda submodule: isinstance(submodule, transformer_layer)
	def check_fn_gpt(submodule): return isinstance(submodule, transformer_layer)

Conversation

paragao commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Workflow structural fixes

Slurm job monitoring improvements

Runtime fixes

HuggingFace data loading (fixes HTTP 429 rate limiting)

Test matrix

Files changed (19)

Testing

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review — Deployment Pipeline & Positives

Security scanning (bandit) dropped without replacement

Things That Look Great

Uh oh!

KeitaW Mar 18, 2026

Choose a reason for hiding this comment

Shell syntax check never fails due to subshell variable scope

Uh oh!

KeitaW Mar 18, 2026

Choose a reason for hiding this comment

Unpinned flake8 version

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Re-review — CI Failures That Need Fixing

Lint check: unused imports at lines 21-22 of train_utils.py

Container tests: invalid --container-mounts format (leading comma)

Venv tests: NCCL error on p5 cluster

Previous finding still open: subshell variable scope bug in pr-lint.yml

Things That Look Great

Uh oh!

KeitaW Mar 23, 2026

Choose a reason for hiding this comment

Flake8 F841: unused variable correct

Uh oh!

KeitaW Mar 23, 2026

Choose a reason for hiding this comment

Flake8 E731: lambda assigned to variable

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

paragao commented Mar 17, 2026 •

edited

Loading

Unpinned `flake8` version

Lint check: unused imports at lines 21-22 of `train_utils.py`

Container tests: invalid `--container-mounts` format (leading comma)

Flake8 F841: unused variable `correct`