This script runs training iterations using ZeRO and compares loss values, gradients, and parameters after update with those of PyTorch.
Usage: deepspeed [DEEPSPEED_OPTIONS] compare_loss.py [OPTIONS]
Options
$ python compare_loss.py -h
[2024-02-22 08:38:22,353] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
usage: compare_loss.py [-h] [--local_rank LOCAL_RANK] [--dtype {torch.bfloat16,torch.float16,torch.float32}]
[--zero_stage {0,1,2,3}] [--offload_device {none,cpu,nvme}] [--use_torch_adam] [--rtol RTOL]
[--atol ATOL] [--compile] [--deepcompile] [--verbose_logging]
DeepSpeed ZeRO correctness test.
options:
-h, --help show this help message and exit
--local_rank LOCAL_RANK
Local rank
--dtype {torch.bfloat16,torch.float16,torch.float32}
Data type
--zero_stage {0,1,2,3}
ZeRO stage
--offload_device {none,cpu,nvme}
Offload device
--use_torch_adam Use torch adam optimizer
--rtol RTOL Relative tolerance
--atol ATOL Absolute tolerance
--compile Enable torch.compile() on the model
--deepcompile Enable deepcompile optimization
--verbose_logging Enable verbose debugging for recompilations
When the number of processes is 2, NCCL produces a deterministic result.
When using --use_torch_adam
, the script uses the PyTorch Adam optimizer instead of the DeepSpeed optimizer.
This brings a slight difference in the parameter update.
Given --use_torch_adam
and --dtype torch.float32
, this script is expected to show that the results from DeepSpeed and PyTorch are exactly the same (no tolerance) for all ZeRO stages.
deepspeed --num_gpus=2 compare_loss.py --use_torch_adam --dtype torch.float32 --zero_stage 1
When dtype is not torch.float32
, this compares DeepSpeed's results with results of PyTorch's AMP.
This pass is not fully tested and may not work as expected.
I observed pretty large differences in the results of torch.float16
and torch.bfloat16
with --use_torch_adam
.
You can roughly check the matching by setting --rtol
and --atol
to a certain value.
deepspeed --num_gpus=2 compare_loss.py --dtype torch.float16 --rtol 0.05 --atol 0.2
Note: For debugging torch.compile recompilation issues, see the Debugging Options section below.
The script now supports PyTorch compilation features:
Enable standard PyTorch compilation for both baseline and target models:
deepspeed --num_gpus=2 compare_loss.py --compile --use_torch_adam --dtype torch.float32 --zero_stage 1
Enable DeepSpeed's DeepCompile optimization, which provides compiler-level optimizations for distributed training:
deepspeed --num_gpus=2 compare_loss.py --compile --use_torch_adam --deepcompile --zero_stage 3 --dtype torch.bfloat16
Use the --verbose_logging
flag to enable detailed debugging information when using torch.compile()
. This is particularly useful for analyzing compiler's behavior and guard failures:
deepspeed --num_gpus=1 compare_loss.py --compile --verbose_logging --zero_stage 1
This option enables:
- Comprehensive logging of compiler events
- Guard failure analysis and debugging
- Environment variable setup for maximum torch.compile debugging
- Detailed output of compilation internals
Note: Verbose logging produces significant output and should be used primarily for debugging purposes.