-
Notifications
You must be signed in to change notification settings - Fork 647
Pass quantization_kwargs to CLIP builders #1994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass quantization_kwargs to CLIP builders #1994
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1994
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 3cdfd92 with merge base 4df97ad ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@@ -193,6 +193,9 @@ def lora_llama3_2_vision_11b( | |||
lora_dropout=lora_dropout, | |||
use_dora=use_dora, | |||
quantize_base=quantize_base, | |||
# Update scaler block size to ensure that weights can be quantized evenly across 1, 2, 4, 6, 8 GPUs. | |||
# This is dependent on ``clip_embed_dim`` so if that is updated, this variable should be as well | |||
scaler_block_size=200 if quantize_base else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So no negative perf impact to using this value on < 8 GPUs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With diff scaler block size
(joe-torchtune) [[email protected] ~/projects/joe-torchtune (update-clip-with-quant-nums)]$ tune run --nproc-per-node 4 lora_finetune_distributed --config llama3_2_vision/90B_qlora max_steps_per_epoch=5 lr_scheduler.num_warmup_steps=0
Running with torchrun...
W1112 18:35:18.602000 4067602 site-packages/torch/distributed/run.py:793]
W1112 18:35:18.602000 4067602 site-packages/torch/distributed/run.py:793] *****************************************
W1112 18:35:18.602000 4067602 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1112 18:35:18.602000 4067602 site-packages/torch/distributed/run.py:793] *****************************************
_component_: torchtune.modules.loss.CEWithChunkedOutputLoss
lr_scheduler:
_component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
num_warmup_steps: 0
max_steps_per_epoch: 5
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: /tmp/Llama-3.2-90B-Vision-Instruct/logs
model:
_component_: torchtune.models.llama3_2_vision.qlora_llama3_2_vision_90b
apply_lora_to_mlp: true
apply_lora_to_output: false
decoder_trainable: frozen
encoder_trainable: lora
fusion_trainable: lora
image_size: 560
lora_alpha: 16
lora_attn_modules:
- q_proj
- v_proj
- output_proj
lora_dropout: 0.0
lora_rank: 8
optimizer:
_component_: torch.optim.AdamW
fused: true
lr: 0.0001
weight_decay: 0.01
output_dir: /tmp/qlora-llama3.2-vision-finetune
profiler:
_component_: torchtune.training.setup_torch_profiler
active_steps: 2
cpu: true
cuda: true
enabled: false
num_cycles: 1
output_dir: /tmp/qlora-llama3.2-vision-finetune/profiling_outputs
profile_memory: false
record_shapes: true
wait_steps: 5
warmup_steps: 3
with_flops: false
with_stack: false
resume_from_checkpoint: false
save_adapter_weights_only: false
seed: null
shuffle: true
tokenizer:
_component_: torchtune.models.llama3_2_vision.llama3_2_vision_transform
image_size: 560
max_seq_len: 8192
path: /tmp/Llama-3.2-90B-Vision-Instruct/original/tokenizer.model
NCCL version 2.21.5+cuda12.4
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 3681342200. Local seed is seed + rank = 3681342200 + 0
Writing logs to /tmp/Llama-3.2-90B-Vision-Instruct/logs/log_1731465330.txt
INFO:torchtune.utils._logging:FSDP is enabled. Instantiating model and loading checkpoint on Rank 0 ...
INFO:torchtune.utils._logging:Instantiating model and loading checkpoint took 128.29 secs
INFO:torchtune.utils._logging:Memory stats after model init:
GPU peak memory allocation: 13.40 GiB
GPU peak memory reserved: 14.53 GiB
GPU peak memory active: 13.40 GiB
INFO:torchtune.utils._logging:Optimizer is initialized.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:Learning rate scheduler is initialized.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
1|5|Loss: 0.8903459906578064: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [13:04<00:00, 156.15s/it]
INFO:torchtune.utils._logging:Saving checkpoint. This may take some time. Retrieving full model state dict...
INFO:torchtune.utils._logging:Getting full model state dict took 207.90 secs
INFO:torchtune.utils._logging:Model checkpoint of size 4.60 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0001_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0002_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 5.00 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0003_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.97 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0004_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0005_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0006_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0007_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 5.00 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0008_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.97 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0009_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0010_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0011_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0012_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 5.00 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0013_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.97 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0014_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0015_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0016_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0017_0.pt
With default scaler
(joe-torchtune) [[email protected] ~/projects/joe-torchtune (update-clip-with-quant-nums)]$ tune run --nproc-per-node 4 lora_finetune_distributed --config llama3_2_vision/90B_qlora max_steps_per_epoch=5 lr_scheduler.num_warmup_steps=0
Running with torchrun...
W1112 19:01:59.371000 237246 site-packages/torch/distributed/run.py:793]
W1112 19:01:59.371000 237246 site-packages/torch/distributed/run.py:793] *****************************************
W1112 19:01:59.371000 237246 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1112 19:01:59.371000 237246 site-packages/torch/distributed/run.py:793] *****************************************
_component_: torchtune.modules.loss.CEWithChunkedOutputLoss
lr_scheduler:
_component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
num_warmup_steps: 0
max_steps_per_epoch: 5
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: /tmp/Llama-3.2-90B-Vision-Instruct/logs
model:
_component_: torchtune.models.llama3_2_vision.qlora_llama3_2_vision_90b
apply_lora_to_mlp: true
apply_lora_to_output: false
decoder_trainable: frozen
encoder_trainable: lora
fusion_trainable: lora
image_size: 560
lora_alpha: 16
lora_attn_modules:
- q_proj
- v_proj
- output_proj
lora_dropout: 0.0
lora_rank: 8
optimizer:
_component_: torch.optim.AdamW
fused: true
lr: 0.0001
weight_decay: 0.01
output_dir: /tmp/qlora-llama3.2-vision-finetune
profiler:
_component_: torchtune.training.setup_torch_profiler
active_steps: 2
cpu: true
cuda: true
enabled: false
num_cycles: 1
output_dir: /tmp/qlora-llama3.2-vision-finetune/profiling_outputs
profile_memory: false
record_shapes: true
wait_steps: 5
warmup_steps: 3
with_flops: false
with_stack: false
resume_from_checkpoint: false
save_adapter_weights_only: false
seed: null
shuffle: true
tokenizer:
_component_: torchtune.models.llama3_2_vision.llama3_2_vision_transform
image_size: 560
max_seq_len: 8192
path: /tmp/Llama-3.2-90B-Vision-Instruct/original/tokenizer.model
NCCL version 2.21.5+cuda12.4
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 1189370465. Local seed is seed + rank = 1189370465 + 0
Writing logs to /tmp/Llama-3.2-90B-Vision-Instruct/logs/log_1731466932.txt
INFO:torchtune.utils._logging:FSDP is enabled. Instantiating model and loading checkpoint on Rank 0 ...
INFO:torchtune.utils._logging:Instantiating model and loading checkpoint took 127.39 secs
INFO:torchtune.utils._logging:Memory stats after model init:
GPU peak memory allocation: 13.40 GiB
GPU peak memory reserved: 14.53 GiB
GPU peak memory active: 13.40 GiB
INFO:torchtune.utils._logging:Optimizer is initialized.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:Learning rate scheduler is initialized.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
1|5|Loss: 0.8870511651039124: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [13:37<00:00, 162.95s/it]
INFO:torchtune.utils._logging:Saving checkpoint. This may take some time. Retrieving full model state dict...
INFO:torchtune.utils._logging:Getting full model state dict took 201.79 secs
Looks to be no difference in speed or memory
Context
What is the purpose of this PR? Is it to
Please link to any issues this PR addresses.
Changelog
What are the changes made in this PR?
*
Test plan
Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.
pre-commit install
)pytest tests
pytest tests -m integration_test
UX
If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example