Add V-JEPA 2 (Meta FAIR) distributed training test case by paragao · Pull Request #1035 · awslabs/awsome-distributed-training

paragao · 2026-03-23T09:36:58Z

Summary

Add V-JEPA 2 (Meta FAIR) ViT-g/16 1B-parameter self-supervised video model as a new PyTorch distributed training test case
Includes Slurm (Pyxis/Enroot) and Kubernetes (PyTorchJob) deployment manifests
Benchmarked on 8x p5en.48xlarge (64x NVIDIA H200 GPUs)

What is V-JEPA 2?

V-JEPA 2 is Meta FAIR's self-supervised video model that learns visual representations by predicting masked video patches. It achieves state-of-the-art on motion understanding and human action anticipation benchmarks. The ViT-g/16 variant has 1.03B encoder parameters.

Files Added

3.test_cases/pytorch/vjepa2/
├── vjepa2.Dockerfile                      # NVIDIA PyTorch 25.03 base (CUDA 13, Python 3.11)
├── README.md                              # Full walkthrough with benchmark results
├── slurm/
│   ├── benchmark_training.sbatch          # 200-iter benchmark (8 nodes)
│   ├── launch_training.sbatch             # Full 800-epoch pre-training
│   └── download_dataset.sbatch            # SSv2 dataset preparation
├── kubernetes/
│   └── vjepa2-benchmark.yaml              # PyTorchJob for EKS clusters
├── configs/
│   ├── benchmark-vitg-8nodes.yaml         # Quick benchmark config
│   └── pretrain-vitg-256px-16f.yaml       # Full pre-training config
└── scripts/
    ├── run_train.py                       # Thin srun-compatible launcher
    ├── generate_synthetic_dataset.py      # Synthetic video generator
    ├── prepare_ssv2.py                    # SSv2 CSV preparation
    ├── parse_benchmark.py                 # Log parser for throughput/MFU
    └── test_decord.py                     # Verify decord video loading

Key Technical Details

Launch pattern: V-JEPA 2 uses srun directly (not srun + torchrun). The run_train.py launcher calls app.vjepa.train.main() directly, which reads SLURM_LOCALID/SLURM_NTASKS/SLURM_PROCID for distributed setup. This avoids a bug in app/main.py where its subprocess launcher passes world_size=1 regardless of SLURM configuration.

Dataset: Supports both Something-Something v2 (SSv2) real data and synthetic generated videos for benchmarking.

Benchmark Results (8x p5en.48xlarge, 64x H200)

Metric	Value
Global batch size	1,536
Precision	BF16
Peak GPU memory	~32.9 GB / 143 GB

Testing

Validated on ParallelCluster with 8x p5en.48xlarge nodes running Slurm + Pyxis/Enroot with EFA networking. Job ran 200 iterations to completion with all 64 ranks correctly initialized via NCCL over EFA.

Add V-JEPA 2 (Meta FAIR) ViT-g/16 1B-param self-supervised video model as a new PyTorch test case with Slurm and Kubernetes support. Includes: - Dockerfile based on nvcr.io/nvidia/pytorch:25.03-py3 (CUDA 13 + Python 3.11) - Slurm sbatch scripts for benchmark (200 iters) and full pre-training (800 epochs) - Kubernetes PyTorchJob manifest for EKS clusters - Thin srun-compatible launcher (run_train.py) that calls app.vjepa.train.main() directly, avoiding the subprocess world_size=1 bug in app/main.py - Synthetic dataset generator for benchmarking without SSv2 download - SSv2 dataset preparation scripts and decord verification - YAML configs for ViT-g/16 with DDP, BF16, and activation checkpointing

…ining Add V-JEPA 2.1 (Meta FAIR) ViT-g/16 1B-param benchmark alongside the existing V-JEPA 2 test case. V-JEPA 2.1 introduces Dense Predictive Loss, Deep Self-Supervision (4 intermediate layers), doubled predictor depth (24 vs 12), and image+video co-training with 50/50 rank split. Includes: - Dockerfile and Enroot container setup (shared base with V-JEPA 2) - Slurm sbatch scripts with /workspace code overlay for latest vjepa2 repo - Kubernetes PyTorchJob manifest for EKS clusters - Synthetic image generator for co-training benchmarks - run_train.py launcher using app.scaffold.main() for dynamic dispatch - YAML configs with img_data, img_mask, and rank_ratio settings Key discovery: the container must have the latest vjepa2 repo code (post March 2026) for app/vjepa_2_1/ to be available. The sbatch scripts mount updated code at /workspace to overlay the container's stale PYTHONPATH.

KeitaW

Review Batch 1/3 — Structure & Repository Hygiene

Thanks for this thorough contribution, Paulo! The utility scripts and READMEs are excellent quality. I have some structural and reproducibility findings below.

Significant code duplication between `vjepa2/` and `vjepa2.1/`

These two directories share a large amount of identical code:

scripts/generate_synthetic_dataset.py — identical (same git blob 74f922445)
scripts/parse_benchmark.py — identical (same git blob 957b9efdf)
scripts/prepare_ssv2.py — identical (same git blob 633288d17)
scripts/test_decord.py — identical (same git blob 4881d1647)
scripts/run_train.py — nearly identical (V-JEPA 2.1 adds 4 lines)
Dockerfiles — nearly identical structure
Slurm sbatch scripts — same structure, differing only in paths/config references

The repo convention says to "extend the existing test case — add platform-specific subdirectories, parameterize scripts for additional models, or add configuration variants — rather than creating a parallel directory tree with duplicated Dockerfiles, training scripts, and utilities."

I'd suggest consolidating into a single vjepa2/ directory that supports both V-JEPA 2 and 2.1 via different configs. The run_train.py launcher already dispatches based on the app field in the config (vjepa vs vjepa_2_1), so both versions can share the same launcher, scripts, Dockerfile, and sbatch templates. The V-JEPA 2.1 additions (image co-training, synthetic image generator) would simply add to the existing directory.

Missing license headers on README and config files

Both README.md files and all 4 configs/*.yaml files are missing license headers. The Slurm scripts, Python files, K8s manifests, and Dockerfiles all have them, so this is just an oversight. I'd suggest adding the standard header as a YAML comment in configs and HTML comment in READMEs.

KeitaW · 2026-03-23T12:57:46Z

3.test_cases/pytorch/vjepa2/vjepa2.Dockerfile

+    && rm -rf /var/lib/apt/lists/*
+
+# Install EFA
+ARG EFA_INSTALLER_VERSION=latest


Unpinned EFA installer version

latest always pulls the newest EFA installer, making builds non-reproducible. The repo convention requires pinned versions. I'd suggest pinning to the version you tested:

Suggested change

ARG EFA_INSTALLER_VERSION=latest

ARG EFA_INSTALLER_VERSION=1.38.0

(Adjust to whichever version your build actually used.) Same issue in vjepa2_1.Dockerfile.

KeitaW · 2026-03-23T12:57:46Z

3.test_cases/pytorch/vjepa2/vjepa2.Dockerfile

+    scikit-image ftfy eva-decord
+
+# Clone V-JEPA 2
+RUN git clone https://github.com/facebookresearch/vjepa2.git /vjepa2


Unpinned vjepa2 git clone

Cloning without a tag or commit hash means different builds may get different code. I'd suggest pinning to the commit or tag you tested against:

Suggested change

RUN git clone https://github.com/facebookresearch/vjepa2.git /vjepa2

RUN git clone --depth 1 --branch <TAG_OR_COMMIT> https://github.com/facebookresearch/vjepa2.git /vjepa2

Same issue in vjepa2_1.Dockerfile.

KeitaW · 2026-03-23T12:57:46Z

3.test_cases/pytorch/vjepa2/kubernetes/vjepa2-benchmark.yaml

+          containers:
+            - name: vjepa2
+              # Replace with your ECR image URI
+              image: <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/vjepa2:latest


Image tag uses :latest

Even though this is a placeholder users will replace, the template should model best practice. I'd suggest using a versioned tag placeholder:

Suggested change

image: <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/vjepa2:latest

image: <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/vjepa2:<TAG>

Same in vjepa2.1/kubernetes/vjepa2-1-benchmark.yaml.

KeitaW

Review Batch 2/3 — Deployment Pipeline

KeitaW · 2026-03-23T12:58:01Z

3.test_cases/pytorch/vjepa2/kubernetes/vjepa2-benchmark.yaml

+                - name: FI_EFA_SET_CUDA_SYNC_MEMOPS
+                  value: "0"
+                - name: NCCL_SOCKET_IFNAME
+                  value: "^docker,lo,veth,eth"


NCCL_SOCKET_IFNAME excludes eth — may break socket bootstrap

The pattern ^docker,lo,veth,eth excludes all interfaces starting with eth, including eth0 — the primary ENI on EC2 instances. While NCCL data transfer uses EFA, the initial TCP bootstrap typically needs eth0. The repo convention recommends ^lo for K8s manifests.

See the EFA cheatsheet.

This same pattern appears in all 4 Slurm sbatch scripts. If eth exclusion was intentional for your cluster, a comment explaining why would help users on other setups.

Suggested change

value: "^docker,lo,veth,eth"

value: "^lo"

KeitaW · 2026-03-23T12:58:01Z

3.test_cases/pytorch/vjepa2/vjepa2.Dockerfile

+RUN pip install --no-cache-dir \
+    tensorboard wandb iopath pyyaml \
+    opencv-python submitit braceexpand webdataset timm transformers \
+    peft decord pandas einops beartype psutil h5py fire python-box \
+    scikit-image ftfy eva-decord


Pip packages not version-pinned

All ~20 packages are installed without version pins, making builds non-reproducible. I'd suggest adding at least minimum version pins (e.g., timm>=0.9.0,<1.0), or better yet including a requirements.txt with pinned versions from a known-good build.

Same issue in vjepa2_1.Dockerfile.

KeitaW

Review Batch 3/3 — Documentation Consistency

Things That Look Great

Comprehensive utility scripts: The synthetic data generators (video and image), SSv2 CSV preparer, benchmark log parser, and decord test script form a complete toolkit that makes this test case truly self-contained.
Excellent README documentation: Both READMEs walk through every step from dataset prep to result parsing, with clear architecture notes explaining the srun direct launch pattern and why app/main.py doesn't work with SLURM.
Smart launch pattern: Using app.scaffold.main() to dispatch based on the config's app field is elegant and avoids the world_size=1 bug in app/main.py.
Proper license headers on most files: Scripts, Dockerfiles, sbatch files, and K8s manifests all have the standard copyright header.
HyperPod auto-resume detection: The if [ -d "/opt/sagemaker_cluster" ] pattern in sbatch scripts correctly detects HyperPod clusters and enables auto-resume.
Both Slurm and Kubernetes deployment paths: Providing PyTorchJob manifests alongside Slurm scripts makes this accessible to EKS-based clusters too.
Well-structured config separation: Benchmark configs (200 iterations, no checkpointing) vs. full pre-training configs (800+ epochs, regular checkpoints) give users clear starting points for different use cases.
V-JEPA 2.1 comparison table: The feature comparison table in the V-JEPA 2.1 README clearly explains what changed between versions.

KeitaW · 2026-03-23T12:58:19Z

3.test_cases/pytorch/vjepa2/README.md

+## 1. Clone this repository
+
+```bash
+git clone https://github.com/aws-samples/awsome-distributed-training.git


Stale repository URL

The repo was transferred from aws-samples to awslabs. GitHub redirects still work, but the canonical URL should be used.

Suggested change

git clone https://github.com/aws-samples/awsome-distributed-training.git

git clone https://github.com/awslabs/awsome-distributed-training.git

Same in vjepa2.1/README.md.

KeitaW

Few comments

The Dockerfile-based container (pytorch:25.03-py3) ships NCCL 2.25 and an older aws-ofi-nccl plugin that are incompatible with B200 EFA networking. The B200 scripts use a NeMo container with NCCL 2.29+ and a matching OFI/EFA/libfabric stack instead, with V-JEPA dependencies installed to shared storage and added to PYTHONPATH at runtime.

paragao added 2 commits March 23, 2026 12:24

paragao force-pushed the feat/vjepa2-distributed-training branch from 11b8971 to 92abb8c Compare March 23, 2026 12:27

KeitaW reviewed Mar 23, 2026

View reviewed changes

KeitaW requested changes Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add V-JEPA 2 (Meta FAIR) distributed training test case#1035

Add V-JEPA 2 (Meta FAIR) distributed training test case#1035
paragao wants to merge 3 commits intomainfrom
feat/vjepa2-distributed-training

paragao commented Mar 23, 2026 •

edited

Loading

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW Mar 23, 2026

Uh oh!

KeitaW Mar 23, 2026

Uh oh!

KeitaW Mar 23, 2026

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW Mar 23, 2026

Uh oh!

KeitaW Mar 23, 2026

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW Mar 23, 2026

Uh oh!

KeitaW left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	ARG EFA_INSTALLER_VERSION=latest
	ARG EFA_INSTALLER_VERSION=1.38.0

	RUN git clone https://github.com/facebookresearch/vjepa2.git /vjepa2
	RUN git clone --depth 1 --branch <TAG_OR_COMMIT> https://github.com/facebookresearch/vjepa2.git /vjepa2

	image: <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/vjepa2:latest
	image: <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/vjepa2:<TAG>

	git clone https://github.com/aws-samples/awsome-distributed-training.git
	git clone https://github.com/awslabs/awsome-distributed-training.git

Conversation

paragao commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What is V-JEPA 2?

Files Added

Key Technical Details

Benchmark Results (8x p5en.48xlarge, 64x H200)

Testing

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 1/3 — Structure & Repository Hygiene

Significant code duplication between vjepa2/ and vjepa2.1/

Missing license headers on README and config files

Uh oh!

KeitaW Mar 23, 2026

Choose a reason for hiding this comment

Unpinned EFA installer version

Uh oh!

KeitaW Mar 23, 2026

Choose a reason for hiding this comment

Unpinned vjepa2 git clone

Uh oh!

KeitaW Mar 23, 2026

Choose a reason for hiding this comment

Image tag uses :latest

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 2/3 — Deployment Pipeline

Uh oh!

KeitaW Mar 23, 2026

Choose a reason for hiding this comment

NCCL_SOCKET_IFNAME excludes eth — may break socket bootstrap

Uh oh!

KeitaW Mar 23, 2026

Choose a reason for hiding this comment

Pip packages not version-pinned

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 3/3 — Documentation Consistency

Things That Look Great

Uh oh!

KeitaW Mar 23, 2026

Choose a reason for hiding this comment

Stale repository URL

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

paragao commented Mar 23, 2026 •

edited

Loading

Significant code duplication between `vjepa2/` and `vjepa2.1/`

Image tag uses `:latest`

`NCCL_SOCKET_IFNAME` excludes `eth` — may break socket bootstrap