Skip to content

fix: overhaul CI workflows for FSDP regression tests#1024

Open
paragao wants to merge 1 commit intomainfrom
fix/ci-workflows
Open

fix: overhaul CI workflows for FSDP regression tests#1024
paragao wants to merge 1 commit intomainfrom
fix/ci-workflows

Conversation

@paragao
Copy link
Copy Markdown
Contributor

@paragao paragao commented Mar 17, 2026

Summary

Restructure and harden the GitHub Actions CI workflows that run FSDP distributed training regression tests on a remote Slurm cluster via SSH.

Changes

Workflow structural fixes

  • Fix YAML syntax errors in all 4 Slurm workflows (heredoc EOF terminators at column 0 inside YAML block scalars)
  • Remove broken EKS regression workflow (fsdp-eks-regression.yml)
  • Replace heavyweight 814-line PR review workflow with lightweight path-aware linter (pr-lint.yml)
  • Update actions/stale from v5 to v9
  • Resolve all actionlint and shellcheck findings

Slurm job monitoring improvements

  • Replace broken squeue-based job status detection with sacct exit code checking (squeue returning empty was incorrectly treated as COMPLETED)
  • Add SSH retry wrapper (ssh_cmd) with exponential backoff
  • Add SSH keepalive settings (ServerAliveInterval/ServerAliveCountMax)
  • Add dedicated enroot cleanup job to avoid race conditions
  • Add inline error log dump (last 200 lines) before job failure exit

Runtime fixes

  • Skip sudo apt install in create_venv.sh when sudo is unavailable
  • Fix shell variable escaping for sbatch filename in heredoc
  • Insert venv activation after last #SBATCH directive (not line 2)
  • Add /opt/slurm/bin to PATH for non-login SSH sessions
  • Fix LD_PRELOAD path to system NCCL (/lib/x86_64-linux-gnu/libnccl.so)
  • Pass HF_TOKEN to Slurm jobs via sed injection

HuggingFace data loading (fixes HTTP 429 rate limiting)

  • Pre-download C4 dataset shards to local FSx storage on the cluster
  • Generate a JSON manifest of local file paths per split
  • Modify train_utils.py to use load_dataset("json", data_files=...) when HF_DATA_FILES_MANIFEST env var points to a valid manifest
  • Set HF_HOME and HF_DATASETS_CACHE for shared FSx caching
  • Set HF_HUB_OFFLINE=1 in sbatch files to block runtime API calls
  • Pre-cache tokenizers in the workflow pre-download step

Test matrix

  • Align both venv and container workflows: cluster: [p5], model_config: [llama3_1_8b, llama3_1_70b]
  • Set NCCL_DEBUG=WARN (reduced from INFO) in all sbatch files

Files changed (19)

Category Files
Workflows fsdp-regression-test-venv.yml, fsdp-regression-test-container.yml, megatron-ci-slurm.yaml, closing-soon.yml, pr-lint.yml (new)
Workflows removed fsdp-eks-regression.yml, pr-review-and-slurm-test.yml
Sbatch files (9) All 3.test_cases/pytorch/FSDP/slurm/*-training.sbatch — LD_PRELOAD, HF caching, NCCL_DEBUG
Scripts create_venv.sh — sudo fallback
Training code train_utils.py — manifest-based local data loading
Other .gitignore — added log.failed

Testing

Validated through 15+ CI workflow runs on the p5 cluster, iteratively fixing issues from OIDC configuration through data loading.

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — Deployment Pipeline & Positives

Nice cleanup across the board — the sacct fix, SSH resilience, and heredoc corrections are all substantive improvements.

Security scanning (bandit) dropped without replacement

The old pr-review-and-slurm-test.yml ran bandit security scanning on all Python files. The new pr-lint.yml replaces it with only flake8 and bash -n. I understand the intent is a lightweight replacement (and the old workflow had YAML issues that made it non-functional), but this is a net reduction in security coverage.

I'd suggest either adding a bandit step to pr-lint.yml (scoped to changed files only, to keep it fast), or tracking the re-addition as a follow-up issue so it doesn't fall through the cracks.


Things That Look Great

  • sacct for proper job status detection: The old pattern (squeue empty → "COMPLETED") silently treated FAILED/OOM_KILLED jobs as successes. Using sacct to check the actual exit code is the correct fix and will catch real failures.
  • SSH retry wrapper with keepalive: The ssh_cmd() function with ServerAliveInterval=60 and 3-retry logic is a smart pattern for the 6-hour monitoring loops where SSH connections would previously drop silently.
  • Dedicated cleanup-enroot job: Moving enroot image cleanup to a separate job that needs: [build, run-tests] with if: always() eliminates the race condition where matrix entries could try to delete images still in use by sibling jobs.
  • Heredoc indentation fixes: The EOF terminator placement was genuinely broken (unindented EOF inside indented YAML blocks). Fixing this across all 4 workflow files makes them actually parseable.
  • ShellCheck compliance: Quoting $GITHUB_OUTPUT / $GITHUB_ENV redirects (SC2086) and grouping consecutive redirects into { } >> blocks (SC2129) are good hygiene that prevents subtle word-splitting bugs in CI.
  • Scoped PR linting: Only linting files changed in the PR (via git diff --name-only --diff-filter=ACMR) is much faster and avoids noise from pre-existing issues in the codebase.

Comment on lines +85 to +101
ERRORS=0

echo "${{ steps.changed.outputs.sh_files }}" | while IFS= read -r f; do
[ -z "$f" ] && continue
[ -f "$f" ] || continue
echo " Checking: $f"
if ! bash -n "$f" 2>&1; then
ERRORS=$((ERRORS + 1))
fi
done

if [ "$ERRORS" -gt 0 ]; then
echo "Found syntax errors in $ERRORS shell script(s)"
exit 1
fi

echo "All shell scripts passed syntax check"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shell syntax check never fails due to subshell variable scope

The ERRORS counter is incremented inside a while loop that reads from a pipe (echo | while). In bash, the right side of a pipe runs in a subshell, so ERRORS is always 0 in the parent shell after the loop exits. The if [ "$ERRORS" -gt 0 ] check will never trigger, even when bash -n reports syntax errors.

I'd suggest using a here-string to keep the loop in the parent shell:

Suggested change
ERRORS=0
echo "${{ steps.changed.outputs.sh_files }}" | while IFS= read -r f; do
[ -z "$f" ] && continue
[ -f "$f" ] || continue
echo " Checking: $f"
if ! bash -n "$f" 2>&1; then
ERRORS=$((ERRORS + 1))
fi
done
if [ "$ERRORS" -gt 0 ]; then
echo "Found syntax errors in $ERRORS shell script(s)"
exit 1
fi
echo "All shell scripts passed syntax check"
ERRORS=0
while IFS= read -r f; do
[ -z "$f" ] && continue
[ -f "$f" ] || continue
echo " Checking: $f"
if ! bash -n "$f" 2>&1; then
ERRORS=$((ERRORS + 1))
fi
done <<< "${{ steps.changed.outputs.sh_files }}"
if [ "$ERRORS" -gt 0 ]; then
echo "Found syntax errors in $ERRORS shell script(s)"
exit 1
fi
echo "All shell scripts passed syntax check"

- name: Lint Python Files
if: steps.changed.outputs.py_count != '0'
run: |
pip install flake8
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unpinned flake8 version

pip install flake8 installs whatever version is current. A future major release could change rules or output format and break the workflow. I'd suggest pinning to a specific version.

Suggested change
pip install flake8
pip install flake8==7.1.1

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM, few nit

Restructure and harden the GitHub Actions CI workflows that run FSDP
distributed training regression tests on a remote Slurm cluster via SSH.

Workflow structural fixes:
- Fix YAML syntax errors in all 4 Slurm workflows (heredoc EOF terminators
  at column 0 inside YAML block scalars)
- Remove broken EKS regression workflow (fsdp-eks-regression.yml)
- Replace heavyweight 814-line PR review workflow with lightweight
  path-aware linter (pr-lint.yml)
- Update actions/stale from v5 to v9
- Resolve all actionlint and shellcheck findings

Slurm job monitoring improvements:
- Replace broken squeue-based job status detection with sacct exit code
  checking (squeue returning empty was incorrectly treated as COMPLETED)
- Add SSH retry wrapper (ssh_cmd) with exponential backoff
- Add SSH keepalive settings (ServerAliveInterval/ServerAliveCountMax)
- Add dedicated enroot cleanup job to avoid race conditions
- Add inline error log dump (last 200 lines) before job failure exit

Runtime fixes:
- Skip sudo apt install in create_venv.sh when sudo is unavailable
- Fix shell variable escaping for sbatch filename in heredoc
- Insert venv activation after last SBATCH directive (not line 2)
- Add /opt/slurm/bin to PATH for non-login SSH sessions
- Fix LD_PRELOAD path to system NCCL (/lib/x86_64-linux-gnu/libnccl.so)
- Pass HF_TOKEN to Slurm jobs via sed injection

HuggingFace data loading (fixes HTTP 429 rate limiting):
- Pre-download C4 dataset shards to local FSx storage on the cluster
- Generate a JSON manifest of local file paths per split
- Modify train_utils.py to use load_dataset('json', data_files=...)
  when HF_DATA_FILES_MANIFEST env var points to a valid manifest
- Set HF_HOME and HF_DATASETS_CACHE for shared FSx caching
- Set HF_HUB_OFFLINE=1 in sbatch files to block runtime API calls
- Pre-cache tokenizers in the workflow pre-download step

Test matrix:
- Align both venv and container workflows to cluster: [p5],
  model_config: [llama3_1_8b, llama3_1_70b]
- Set NCCL_DEBUG=WARN (reduced from INFO) in all sbatch files
@paragao paragao changed the title fix: overhaul CI workflows -- fix YAML errors, improve robustness, clean up linting fix: overhaul CI workflows for FSDP regression tests Mar 19, 2026
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review — CI Failures That Need Fixing

All 5 CI checks are currently failing. Here's a breakdown of what needs fixing before merge.

Lint check: unused imports at lines 21-22 of train_utils.py

The top-level imports from transformers import LlamaForCausalLM, LlamaTokenizer, LlamaConfig (line 21) and from transformers.models.llama.modeling_llama import LlamaDecoderLayer (line 22) are unused — they're re-imported conditionally later at lines 141 and 282. I'd suggest removing the top-level imports to fix the F401 and F811 flake8 findings (4 of the 8 errors).

Container tests: invalid --container-mounts format (leading comma)

Both run-tests (p5, llama3_1_8b) and run-tests (p5, llama3_1_70b) fail with:

srun: error: pyxis: --container-mounts: invalid format:
srun: error: Invalid --container-mounts argument: ,/fsx/.../checkpoints:/checkpoints

The leading comma indicates an empty variable is being concatenated before the mount path. I'd suggest checking the sbatch template where --container-mounts is constructed — there's likely a pattern like ${OPTIONAL_MOUNT},/fsx/...:/checkpoints where OPTIONAL_MOUNT is empty.

Venv tests: NCCL error on p5 cluster

Both venv regression jobs fail with:

torch.distributed.DistBackendError: NCCL error in: NCCLUtils.cpp:77, invalid usage, NCCL version 2.29.2

The LD_PRELOAD=/lib/x86_64-linux-gnu/libnccl.so forces system NCCL (2.29.2), but the error suggests a version or configuration mismatch with the PyTorch version in the venv. I'd suggest checking the .err log artifacts for NCCL debug details, and verifying system NCCL compatibility with the venv's PyTorch.

Previous finding still open: subshell variable scope bug in pr-lint.yml

The echo | while pipe + ERRORS counter bug from my first review (lines 85-101 of pr-lint.yml) is still present. The lint check passed this time only because flake8 caught errors first — if there were only shell scripts with syntax errors, the check would silently pass.


Things That Look Great

  • The lint workflow caught real issues: pr-lint.yml immediately proved its value by catching genuine unused imports and dead code in train_utils.py.
  • Error log dump in monitor step: The ::group::Job error output (last 200 lines) pattern made it trivial to diagnose the NCCL and pyxis failures without downloading artifacts.
  • Manifest-based data loading: The HF_DATA_FILES_MANIFEST approach in train_utils.py is a clean solution to the HF rate-limiting problem.
  • Comprehensive scope: All changes coherently serve the goal of making FSDP regression tests actually pass on the p5 cluster.


def validation(model, rank, world_size, val_loader):
model.eval()
correct = 0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flake8 F841: unused variable correct

correct = 0 is assigned but never used in the validation() function. This is one of the 8 flake8 errors failing the lint check.

Suggested change
correct = 0

check_fn_gpt = lambda submodule: isinstance(
submodule, transformer_layer
)
check_fn_gpt = lambda submodule: isinstance(submodule, transformer_layer)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flake8 E731: lambda assigned to variable

Flake8 flags check_fn_gpt = lambda ... — the convention is to use def instead of assigning a lambda. This is one of the 8 flake8 errors failing the lint check.

Suggested change
check_fn_gpt = lambda submodule: isinstance(submodule, transformer_layer)
def check_fn_gpt(submodule): return isinstance(submodule, transformer_layer)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants