Skip to content

Qwen 2.5 VL #2868

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 60 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
ede1463
qwen 2.5 vl code skeleton
albert-inflection May 19, 2025
3269545
model builder progress
albert-inflection May 19, 2025
6b013ec
more model building progress
albert-inflection May 20, 2025
5cb7421
airplane update
albert-inflection May 22, 2025
6d09f1f
WIP transform + rope
Jun 10, 2025
1c5dd67
image transform progress
Jun 10, 2025
8992e50
image transform progress
albert-inflection Jun 10, 2025
74614b2
Qwen2_5_VLImageTransform complete
lawrence-inflection Jun 11, 2025
59fe9cd
remove context.md from tracking
lawrence-inflection Jun 12, 2025
f9cdb83
Qwen2_5_VLTransform implemented
lawrence-inflection Jun 12, 2025
3032d75
module progress
Jun 12, 2025
c634a4b
batch size in ViT forward
Jun 13, 2025
423a268
rehaul modules, start from near HF
Jun 18, 2025
d3d4bd2
Rope + Window attn attempt 1
Jun 20, 2025
ad39ebb
_positional_embeddings.py implementation
lawrencefeng17 Jun 20, 2025
0193832
progress on _component_builders.py for decoder
lawrencefeng17 Jun 20, 2025
caa77ff
upstream cleanup
Jun 21, 2025
f1a235e
more cleanup
Jun 21, 2025
a2eacc9
merge temp branch onto albert/qwen2.5-vl
lawrencefeng17 Jun 23, 2025
16902fa
refactored Qwen25VLRotaryPositionalEmbeddings; passed test cases
lawrencefeng17 Jun 23, 2025
d4fb9c2
refactored Qwen25VLRotaryPositionalEmbeddings; added summary context.md
lawrencefeng17 Jun 23, 2025
f2c3a0e
feat: Qwen25VLEarlyFusionModel wrapper class
lawrencefeng17 Jun 23, 2025
896b070
rebase
Jul 3, 2025
3db79f9
clean up mlps
Jun 23, 2025
7024fdc
clean up encoder builder
Jun 23, 2025
20728a0
fix: removed raise condition; decoder bias fix
lawrencefeng17 Jun 24, 2025
bb3b4a6
checkpointing + edits
Jun 24, 2025
045f71b
init
Jun 24, 2025
b959286
convert weights final
Jun 24, 2025
7bf0a09
model builder slight fix
Jun 24, 2025
06ce596
fixes: minor changes, early end-to-end testing
lawrencefeng17 Jun 25, 2025
e8ab57c
fix: completely rewrote mrope
lawrencefeng17 Jun 26, 2025
4e44c1f
fix: minor fixes to mrope
lawrencefeng17 Jun 26, 2025
00e79f8
transform edits
Jun 26, 2025
257cbcf
feat: mrope cache implemented for decoder (#2)
lawrence-inflection Jun 26, 2025
801efb4
encoder forward pass edits
Jun 26, 2025
3df44cf
bug fixes, training works now
albert-inflection Jun 27, 2025
cc52ebb
tested and fixed _transform
lawrence-inflection Jun 27, 2025
5ab217b
weight saving fix + import
albert-inflection Jun 30, 2025
4928249
Lawrence/qwen2.5 vl/encoder tests
lawrence-inflection Jul 2, 2025
47a9e19
feat: added other qwen variants in model builders
lawrencefeng17 Jul 2, 2025
a8b00df
custom collation + init edits
albert-inflection Jul 2, 2025
e63202a
fix: removed default args to transform
lawrencefeng17 Jul 2, 2025
50314d3
nits
albert-inflection Jul 2, 2025
f6e75d3
7B config
albert-inflection Jul 2, 2025
b2b74bc
config nit
albert-inflection Jul 2, 2025
767b025
added test cases in torchtune style
lawrencefeng17 Jul 3, 2025
e03eb9c
cleanup
albert-inflection Jul 2, 2025
a82e72c
rm uv.lock
albert-inflection Jul 2, 2025
47c60c5
trainable params
albert-inflection Jul 2, 2025
df68e52
updated model builders
albert-inflection Jul 2, 2025
e98578c
rename rope
albert-inflection Jul 3, 2025
346987b
cleanup
lawrencefeng17 Jul 3, 2025
9438ca8
fix
lawrencefeng17 Jul 3, 2025
23e0640
cleanup:
lawrencefeng17 Jul 3, 2025
1ff7ffa
3B recipe and model builder edit
albert-inflection Jul 3, 2025
e7c8b85
32B config and modelbuilder changes'
albert-inflection Jul 3, 2025
d5ff0e9
72B config
albert-inflection Jul 3, 2025
43f1cbe
nit diffs
Jul 3, 2025
c09279c
fix padding token
Jul 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 109 additions & 0 deletions recipes/configs/qwen2_5_vision/32B_full.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Config for single device full finetuning in full_finetune_distributed.py
# using a Qwen2.5 32B
#
# This config assumes that you've run the following command before launching
# this run:
# tune download Qwen/Qwen2.5-32B-Instruct --output-dir /tmp/Qwen2.5-32B-Instruct
#
# To launch on 4 devices, run the following command from root:
# tune run --nnodes 1 --nproc_per_node 4 full_finetune_distributed --config qwen2_5/32B_full
#
# You can add specific overrides through the command line. For example
# to override the checkpointer directory while launching training
# you can run:
# tune run --nnodes 1 --nproc_per_node 4 full_finetune_distributed --config qwen2_5/32B_full checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
#
# This config was only tested on a 4xH100 machine.

output_dir: /tmp/torchtune/qwen2_5_32B/full # /tmp may be deleted by your system. Change it to your preference.

# Tokenizer
tokenizer:
_component_: torchtune.models.qwen2_5_vision.Qwen2_5_VLTransform
path: /tmp/Qwen2.5-VL-3B-Instruct/vocab.json
merges_file: /tmp/Qwen2.5-VL-3B-Instruct/merges.txt
max_seq_len: null

# Dataset
dataset:
_component_: torchtune.datasets.multimodal.the_cauldron_dataset
packed: False # True increases speed
subset: ocrvqa
seed: null
shuffle: True
collate_fn: torchtune.models.qwen2_5_vision.qwen2_5_vl_padded_collate_images


# Model Arguments
model:
_component_: torchtune.models.qwen2_5_vision.qwen2_5_vl_32b

checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Qwen2.5-VL-32B-Instruct
checkpoint_files:
filename_format: model-{}-of-{}.safetensors
max_filename: "00018"
recipe_checkpoint: null
output_dir: ${output_dir}
model_type: QWEN2_5_VL
resume_from_checkpoint: False

# Fine-tuning arguments
batch_size: 2
epochs: 1
optimizer:
_component_: torch.optim.AdamW
lr: 5e-6
optimizer_in_bwd: True # True saves memory. Requires gradient_accumulation_steps=1
loss:
_component_: torchtune.modules.loss.LinearCrossEntropyLoss
max_steps_per_epoch: 100
gradient_accumulation_steps: 1 # Use to increase effective batch size
clip_grad_norm: null
compile: False # torch.compile the model + loss, True increases speed + decreases memory

# Training environment
device: cuda

# Memory management
enable_activation_checkpointing: True # True reduces memory
enable_activation_offloading: False # True reduces memory
custom_sharded_layers: ['decoder.tok_embeddings', 'decoder.output']

# Reduced precision
dtype: bf16

# Logging
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: ${output_dir}/logs
log_every_n_steps: 1
log_peak_memory_stats: False
log_level: INFO # DEBUG, WARN, etc.


# Profiler (disabled)
profiler:
_component_: torchtune.training.setup_torch_profiler
enabled: False

#Output directory of trace artifacts
output_dir: ${output_dir}/profiling_outputs

#`torch.profiler.ProfilerActivity` types to trace
cpu: True
cuda: True

#trace options passed to `torch.profiler.profile`
profile_memory: False
with_stack: False
record_shapes: True
with_flops: False

# `torch.profiler.schedule` options:
# wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat
wait_steps: 5
warmup_steps: 3
active_steps: 2
num_cycles: 1
115 changes: 115 additions & 0 deletions recipes/configs/qwen2_5_vision/3B_full_single_device.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Config for single device full finetuning in full_finetune_single_device.py
# using a Qwen2.5 VL 3B
#
# This config assumes that you've run the following command before launching
# this run:
# tune download Qwen/Qwen2.5-VL-3B-Instruct --output-dir /tmp/Qwen2.5-VL-3B-Instruct
#
# The default config uses an optimizer from bitsandbytes. If you do not have it installed,
# you can install it with
# pip install bitsandbytes
#
# To launch on a single device, run the following command from root:
# tune run full_finetune_single_device --config qwen2_5_vision/3B_full_single_device
#
# You can add specific overrides through the command line. For example
# to override the checkpointer directory while launching training
# you can run:
# tune run full_finetune_single_device --config qwen2_5_vision/3B_full_single_device checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
#
# This config works only for training on single device.

output_dir: /tmp/torchtune/qwen2_5_3B_VL/full_single_device # /tmp may be deleted by your system. Change it to your preference.

# Tokenizer
tokenizer:
_component_: torchtune.models.qwen2_5_vision.Qwen2_5_VLTransform
path: /tmp/Qwen2.5-VL-3B-Instruct/vocab.json
merges_file: /tmp/Qwen2.5-VL-3B-Instruct/merges.txt
max_seq_len: null

# Dataset
dataset:
_component_: torchtune.datasets.multimodal.the_cauldron_dataset
packed: False # True increases speed
subset: ocrvqa
seed: null
shuffle: True
collate_fn: torchtune.models.qwen2_5_vision.qwen2_5_vl_padded_collate_images


# Model Arguments
model:
_component_: torchtune.models.qwen2_5_vision.qwen2_5_vl_3b

checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Qwen2.5-VL-3B-Instruct
checkpoint_files: [
model-00001-of-00004.safetensors,
model-00002-of-00004.safetensors,
model-00003-of-00004.safetensors,
model-00004-of-00004.safetensors,
]
recipe_checkpoint: null
output_dir: ${output_dir}
model_type: QWEN2_5_VL
resume_from_checkpoint: False

# Fine-tuning arguments
batch_size: 1
epochs: 1
optimizer:
_component_: bitsandbytes.optim.PagedAdamW
lr: 5e-6
optimizer_in_bwd: True # True saves memory. Requires gradient_accumulation_steps=1
loss:
_component_: torchtune.modules.loss.LinearCrossEntropyLoss
max_steps_per_epoch: null
gradient_accumulation_steps: 1 # Use to increase effective batch size
clip_grad_norm: null
compile: False # torch.compile the model + loss, True increases speed + decreases memory

# Training environment
device: cuda

# Memory management
enable_activation_checkpointing: True # True reduces memory
enable_activation_offloading: False # True reduces memory

# Reduced precision
dtype: bf16

# Logging
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: ${output_dir}/logs
log_every_n_steps: 1
log_peak_memory_stats: False
log_level: INFO # DEBUG, WARN, etc.


# Profiler (disabled)
profiler:
_component_: torchtune.training.setup_torch_profiler
enabled: False

#Output directory of trace artifacts
output_dir: ${output_dir}/profiling_outputs

#`torch.profiler.ProfilerActivity` types to trace
cpu: True
cuda: True

#trace options passed to `torch.profiler.profile`
profile_memory: False
with_stack: False
record_shapes: True
with_flops: False

# `torch.profiler.schedule` options:
# wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat
wait_steps: 5
warmup_steps: 3
active_steps: 2
num_cycles: 1
109 changes: 109 additions & 0 deletions recipes/configs/qwen2_5_vision/72B_full.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Config for single device full finetuning in full_finetune_distributed.py
# using a Qwen2.5 72B
#
# This config assumes that you've run the following command before launching
# this run:
# tune download Qwen/Qwen2.5-72B-Instruct --output-dir /tmp/Qwen2.5-72B-Instruct
#
# To launch on 4 devices, run the following command from root:
# tune run --nnodes 1 --nproc_per_node 8 full_finetune_distributed --config qwen2_5/72B_full
#
# You can add specific overrides through the command line. For example
# to override the checkpointer directory while launching training
# you can run:
# tune run --nnodes 1 --nproc_per_node 8 full_finetune_distributed --config qwen2_5/72B_full checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
#
# This config was only tested on a 8xH100 machine.

output_dir: /tmp/torchtune/qwen2_5_72B/full # /tmp may be deleted by your system. Change it to your preference.

# Tokenizer
tokenizer:
_component_: torchtune.models.qwen2_5_vision.Qwen2_5_VLTransform
path: /tmp/Qwen2.5-VL-3B-Instruct/vocab.json
merges_file: /tmp/Qwen2.5-VL-3B-Instruct/merges.txt
max_seq_len: null

# Dataset
dataset:
_component_: torchtune.datasets.multimodal.the_cauldron_dataset
packed: False # True increases speed
subset: ocrvqa
seed: null
shuffle: True
collate_fn: torchtune.models.qwen2_5_vision.qwen2_5_vl_padded_collate_images


# Model Arguments
model:
_component_: torchtune.models.qwen2_5_vision.qwen2_5_vl_72b

checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Qwen2.5-VL-72B-Instruct
checkpoint_files:
filename_format: model-{}-of-{}.safetensors
max_filename: "00018"
recipe_checkpoint: null
output_dir: ${output_dir}
model_type: QWEN2_5_VL
resume_from_checkpoint: False

# Fine-tuning arguments
batch_size: 2
epochs: 1
optimizer:
_component_: torch.optim.AdamW
lr: 5e-6
optimizer_in_bwd: True # True saves memory. Requires gradient_accumulation_steps=1
loss:
_component_: torchtune.modules.loss.LinearCrossEntropyLoss
max_steps_per_epoch: 100
gradient_accumulation_steps: 1 # Use to increase effective batch size
clip_grad_norm: null
compile: False # torch.compile the model + loss, True increases speed + decreases memory

# Training environment
device: cuda

# Memory management
enable_activation_checkpointing: True # True reduces memory
enable_activation_offloading: False # True reduces memory
custom_sharded_layers: ['decoder.tok_embeddings', 'decoder.output']

# Reduced precision
dtype: bf16

# Logging
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: ${output_dir}/logs
log_every_n_steps: 1
log_peak_memory_stats: False
log_level: INFO # DEBUG, WARN, etc.


# Profiler (disabled)
profiler:
_component_: torchtune.training.setup_torch_profiler
enabled: False

#Output directory of trace artifacts
output_dir: ${output_dir}/profiling_outputs

#`torch.profiler.ProfilerActivity` types to trace
cpu: True
cuda: True

#trace options passed to `torch.profiler.profile`
profile_memory: False
with_stack: False
record_shapes: True
with_flops: False

# `torch.profiler.schedule` options:
# wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat
wait_steps: 5
warmup_steps: 3
active_steps: 2
num_cycles: 1
Loading
Loading