Qwen 2.5 VL #2868

albert-inflection · 2025-07-03T23:16:53Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses.
#2699

Changelog

What are the changes made in this PR?
*custom modules for Qwen 2.5 VL
*model + component builders for all variants
*Transform + custom collation
*Weight loading
*Added unit tests

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

so far we've done

Logit comparison against HF for encoder, decoder, and combined models (combined shown)

successful E2E training runs for all model variants (7B run shown)

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

co-authored by @lawrencefeng17

fleshed out _positional_embeddings.py with Qwen2_5_VLRotaryEmbedding class and Qwen2_5_VLCompatibleRotaryEmbedding. Qwen2_5_VLRotaryEmbedding is used in Qwen2_5_VLCompatibleRotaryEmbedding, which inherits from nn.Module. Qwen2_5_VLCompatibleRotaryEmbedding.forward() takes in a query or key tensor, input_pos tensor, and applies MRoPE.

wrapper function around MultiHeadAttention with MRoPE beginnings of implementation for qwen2_5_vl_text_decoder

* Qwen25VLEarlyFusionModel inherits from EarlyFusionModel * forward() calls get_rope_index with input_ids * Incorporated Qwen25VLEarlyFusionModel into _model_builders.py

* incorrect raise condition in _positional_embeddings.py * set bias=False in text decoder MLP

* going home * tests and fixes: transform and tokenizer * qwen2_5 tokenizer modified to handle image tokens * computes number of patches * accounts for qwen2-5-vl special tokens * tests have hf dependency --------- Co-authored-by: lawrencefeng17 <[email protected]>

* Add test file for vision encoder * fix: reshape error in vision encoder rope * rope in Qwen does not apply to adjancent dimensions but instead mirrored dimensions * the head dimension is split in half * for example, if the head dimension is 80, then the rope pairs are (0,40), (1,41), ... * added an abundnace of test cases and tensor saves * cleanup --------- Co-authored-by: lawrencefeng17 <[email protected]>

* also cleaned up comments and docstrings

* added batch testing in test_full_model

* deleted test files * deleted qwen transform wrapper function in model_builders

* fixed embedding tying * created new vl tokenizer, inherits from qwen2_5 * deleted test.py in models/qwen2_5_vision * deleted some comments in _fusion.py? (not sure what you meant)

pytorch-bot · 2025-07-03T23:16:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2868

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 4 Cancelled Jobs

As of commit c09279c with merge base 38edb21 ():

NEW FAILURES - The following jobs have failed:

GPU tests / gpu_test (3.11, stable) (gh)
tests/torchtune/training/test_distributed.py::TestFullyShardState::test_qlora_state_dict
Lint / lint (3.10) (gh)
torchtune/models/qwen2_5_vision/_tokenizer.py:7:1: F401 'math' imported but unused
Unit Test / unit_tests (3.10) (gh)
tests/torchtune/modules/test_transformer_decoder.py::TestTransformerDecoder::test_pass_input_embeds

CANCELLED JOBS - The following jobs were cancelled. Please retry:

GPU tests / gpu_test (3.10, stable) (gh)
tests/torchtune/training/test_distributed.py::TestFullyShardState::test_qlora_state_dict
GPU tests / gpu_test (3.9, stable) (gh)
tests/torchtune/training/test_distributed.py::TestFullyShardState::test_qlora_state_dict
Unit Test / unit_tests (3.11) (gh)
##[error]The operation was canceled.
Unit Test / unit_tests (3.9) (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

albert-inflection · 2025-07-03T23:51:54Z

torchtune/modules/attention.py

@@ -185,6 +185,7 @@ def forward(
        *,
        mask: Optional[_MaskType] = None,
        input_pos: Optional[torch.Tensor] = None,
+        window_index: Optional[torch.Tensor] = None,


just a placeholder - a main dilemma we've had was where the unique window and rope index calculations should live in the stack. Currently, we've tried our best to preserve primitives and patterns but clearly not perfect, would appreciate outside input

albert-inflection · 2025-07-03T23:52:11Z

torchtune/models/qwen2_5_vision/_fusion.py

+from torchtune.modules.model_fusion._early_fusion import EarlyFusionModel
+from torchtune.modules import TransformerDecoder
+
+class Qwen25VL(EarlyFusionModel):


see https://github.com/pytorch/torchtune/pull/2868/files#r2183949447

albert-inflection · 2025-07-03T23:57:34Z

a bit rocky, early looks appreciated @joecummings

albert-inflection and others added 30 commits July 3, 2025 15:34

qwen 2.5 vl code skeleton

ede1463

model builder progress

3269545

more model building progress

6b013ec

airplane update

5cb7421

WIP transform + rope

6d09f1f

image transform progress

1c5dd67

image transform progress

8992e50

Qwen2_5_VLImageTransform complete

74614b2

remove context.md from tracking

59fe9cd

Qwen2_5_VLTransform implemented

f9cdb83

module progress

3032d75

batch size in ViT forward

c634a4b

rehaul modules, start from near HF

423a268

Rope + Window attn attempt 1

d3d4bd2

progress on _component_builders.py for decoder

0193832

wrapper function around MultiHeadAttention with MRoPE beginnings of implementation for qwen2_5_vl_text_decoder

upstream cleanup

caa77ff

more cleanup

f1a235e

merge temp branch onto albert/qwen2.5-vl

a2eacc9

refactored Qwen25VLRotaryPositionalEmbeddings; passed test cases

16902fa

refactored Qwen25VLRotaryPositionalEmbeddings; added summary context.md

d4fb9c2

feat: Qwen25VLEarlyFusionModel wrapper class

f2c3a0e

* Qwen25VLEarlyFusionModel inherits from EarlyFusionModel * forward() calls get_rope_index with input_ids * Incorporated Qwen25VLEarlyFusionModel into _model_builders.py

rebase

896b070

clean up mlps

3db79f9

clean up encoder builder

7024fdc

fix: removed raise condition; decoder bias fix

20728a0

* incorrect raise condition in _positional_embeddings.py * set bias=False in text decoder MLP

checkpointing + edits

bb3b4a6

init

045f71b

convert weights final

b959286

model builder slight fix

7bf0a09

Albert Luo and others added 25 commits July 3, 2025 15:38

encoder forward pass edits

801efb4

bug fixes, training works now

3df44cf

weight saving fix + import

5ab217b

feat: added other qwen variants in model builders

47a9e19

* also cleaned up comments and docstrings

custom collation + init edits

a8b00df

fix: removed default args to transform

e63202a

* added batch testing in test_full_model

nits

50314d3

7B config

f6e75d3

config nit

b2b74bc

added test cases in torchtune style

767b025

cleanup

e03eb9c

rm uv.lock

a82e72c

trainable params

47c60c5

updated model builders

df68e52

rename rope

e98578c

cleanup

346987b

* deleted test files * deleted qwen transform wrapper function in model_builders

fix

9438ca8

cleanup:

23e0640

* fixed embedding tying * created new vl tokenizer, inherits from qwen2_5 * deleted test.py in models/qwen2_5_vision * deleted some comments in _fusion.py? (not sure what you meant)

3B recipe and model builder edit

1ff7ffa

32B config and modelbuilder changes'

e7c8b85

72B config

d5ff0e9

nit diffs

43f1cbe

fix padding token

c09279c

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 3, 2025

albert-inflection commented Jul 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qwen 2.5 VL #2868

Qwen 2.5 VL #2868

albert-inflection commented Jul 3, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 3, 2025 •

edited

Loading

Uh oh!

albert-inflection Jul 3, 2025

Uh oh!

albert-inflection Jul 3, 2025

Uh oh!

albert-inflection commented Jul 3, 2025

Uh oh!

Uh oh!

Qwen 2.5 VL #2868

Are you sure you want to change the base?

Qwen 2.5 VL #2868

Conversation

albert-inflection commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changelog

Test plan

UX

Uh oh!

pytorch-bot bot commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2868

❌ 3 New Failures, 4 Cancelled Jobs

Uh oh!

albert-inflection Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

albert-inflection Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

albert-inflection commented Jul 3, 2025

Uh oh!

Uh oh!

albert-inflection commented Jul 3, 2025 •

edited

Loading

pytorch-bot bot commented Jul 3, 2025 •

edited

Loading