[CLIP ENCODER] Vision Transform for Clip encoder #1127

felipemello1 · 2024-06-27T01:11:14Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Added Vision Transform architecture, which can be used for the CLIP model.

Main points to focus on the review:

Docstrings, specially of VisionTransformer. Are the examples clear? Are the names intuitive (tile, patch, token)?
Are we comfortable with leaving the CLS projection in the ViT if we are going to create a projection module somewhere else?

Changelog

LayerNorm (public)
Vision Transform (public)
clip_vision_encoder builder
Positional embeddings to supports tiled images

Test plan

Shared notebook with parity check with clip in torchmutimodal
unit tests asserting shape and regression of positional encodings and ViT.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
- include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

pytorch-bot · 2024-06-27T01:11:16Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1127

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 348e681 with merge base 06a125e ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

codecov-commenter · 2024-07-01T02:08:15Z

Codecov Report

Attention: Patch coverage is 29.62963% with 209 lines in your changes missing coverage. Please review.

Project coverage is 26.82%. Comparing base (f158577) to head (beb80c4).
Report is 2 commits behind head on main.

Files	Patch %	Lines
torchtune/modules/vision_transformer.py	18.51%	66 Missing ⚠️
tests/torchtune/modules/test_vision_transformer.py	22.85%	54 Missing ⚠️
torchtune/models/clip/_position_embeddings.py	23.91%	35 Missing ⚠️
...orchtune/models/clip/test_positional_embeddings.py	26.31%	28 Missing ⚠️
torchtune/models/clip/_component_builders.py	35.00%	13 Missing ⚠️
tests/torchtune/modules/test_layernorm.py	62.96%	10 Missing ⚠️
torchtune/modules/layer_norm.py	62.50%	3 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (f158577) and HEAD (beb80c4). Click for more details.

HEAD has 1 upload less than BASE

Flag BASE (f158577) HEAD (beb80c4)

4 3

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1127       +/-   ##
===========================================
- Coverage   65.98%   26.82%   -39.17%     
===========================================
  Files         194      212       +18     
  Lines        9023     9595      +572     
===========================================
- Hits         5954     2574     -3380     
- Misses       3069     7021     +3952

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

torchtune/modules/__init__.py

torchtune/modules/vision_transformer.py

kartikayk

Some comments, but overall this looks good to me - thank you for patiently addressing all of the comments. I'll let @pbontrager and/or @ebsmothers take a pass and stamp

torchtune/models/clip/_position_embeddings.py

torchtune/modules/vision_transformer.py

tests/torchtune/models/clip/test_positional_embeddings.py

torchtune/models/clip/_position_embeddings.py

torchtune/models/clip/_component_builders.py

torchtune/modules/vision_transformer.py

pbontrager

This is looking really good, and I really love the doc strings with this. I left a number of comments for a few things to either address or clarify, but I think it's close to ready to land.

torchtune/models/clip/_component_builders.py

pbontrager · 2024-07-05T00:19:40Z

torchtune/models/clip/_component_builders.py

+
+logger = logging.getLogger(__name__)
+
+def clip_vision_encoder(


I'm tempted to say that maybe we should have clip_vision_encoder and tiled_clip_vision_encoder builders

i can see why, but we may have to pay the debt elsewhere. In the transforms, adapter, masking and inference we may have to check the shape and see if it contains tiles. Assuming everything is tiled saves some downstream complexity.

pbontrager · 2024-07-05T00:25:40Z

torchtune/modules/layer_norm.py

+from torch import nn, Tensor
+
+
+class Fp32LayerNorm(nn.LayerNorm):


We don't currently support mixed precision training in the library, so I don't believe we should include this. On top of that, I believe torch autocast already automatically converts layernorm to fp32 ref

do you know why we keep rmsNorm and the float32 logic instead of using torchs native?

torchtune: https://github.com/pytorch/torchtune/blob/main/torchtune/modules/rms_norm.py
torch: https://pytorch.org/torchtune/stable/generated/torchtune.modules.RMSNorm.html

torchtune/modules/vision_transformer.py

pbontrager · 2024-07-05T02:27:27Z

torchtune/modules/vision_transformer.py

+    3) The patches will be flattened and transformed. We call them tokens, because that's how the transformer sees them.
+
+
+    Image: shape (8x8)


I don't think any of these are actually appearing in the docs. I didn't see anything below "In summary"

weird. This is how it shows for me

into clip_encoder

kartikayk

Thanks for patiently addressing all of the comments!

Co-authored-by: Felipe Mello <[email protected]> Co-authored-by: Kartikay Khandelwal <[email protected]>

…opk and routing (pytorch#1127) This PR adds a single sort_tokens function (simplified from earlier prepare_expert_routing). This removes the code duplication that was present earlier in: moe_forward and moe_on_device, as both were doing the same exact expert routing prep. The goal here is to avoid tech debt by streamlining this functionality into a single function to ensure any future updates are thus auto-propagated. Testing: verified inference works as before, with and without cuda graphs.

Felipe Mello added 2 commits June 26, 2024 18:02

added all components

e8a895a

renamed cls embedding

2759f80

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 27, 2024

felipemello1 requested review from pbontrager, ebsmothers, RdoubleA and kartikayk June 27, 2024 01:11

felipemello1 marked this pull request as draft June 27, 2024 01:12

Felipe Mello added 7 commits June 26, 2024 18:16

naming

c6161b9

docstrings

5d31e37

small fix

1840a57

docstrings and assertions

2d1475d

added unit test

50bdc53

add new param cls_output_dim

3dd9070

docstring

4a77d69

felipemello1 changed the title ~~[WIP][CLIP ENCODER] Vision Transform for Clip encoder~~ [CLIP ENCODER] Vision Transform for Clip encoder Jul 1, 2024

Felipe Mello added 3 commits June 30, 2024 19:32

deleted file

3aee712

docstrings

d35c5c5

layernorm test fix

bca1116

felipemello1 marked this pull request as ready for review July 1, 2024 04:54

Felipe Mello added 5 commits June 30, 2024 22:01

fix cls test

8bb7313

added layernorm to rst

fc4ebf4

docstring

6fbf398

docstring

7d8160b

docstring

dc12ab6

felipemello1 commented Jul 1, 2024

View reviewed changes

torchtune/modules/__init__.py Outdated Show resolved Hide resolved

felipemello1 commented Jul 1, 2024

View reviewed changes

torchtune/modules/vision_transformer.py Show resolved Hide resolved

removed unused function

9070944

typos

f683812

kartikayk reviewed Jul 3, 2024

View reviewed changes

torchtune/models/clip/_position_embeddings.py Outdated Show resolved Hide resolved

torchtune/modules/vision_transformer.py Outdated Show resolved Hide resolved

torchtune/modules/vision_transformer.py Outdated Show resolved Hide resolved

ebsmothers reviewed Jul 3, 2024

View reviewed changes

round of changes PR comments

2ec3141

kimishpatel reviewed Jul 3, 2024

View reviewed changes

torchtune/modules/vision_transformer.py Outdated Show resolved Hide resolved

kimishpatel reviewed Jul 3, 2024

View reviewed changes

torchtune/modules/vision_transformer.py Outdated Show resolved Hide resolved

Felipe Mello added 4 commits July 3, 2024 17:00

rst fixes

44da474

added num_concurrent_media

d5cf56a

docstring rendering

618035e

docstring

872a63c

pbontrager reviewed Jul 5, 2024

View reviewed changes

felipemello1 and others added 10 commits July 5, 2024 10:35

Merge branch 'main' into clip_encoder

7aecf81

changes pr comments - untested

668bbe6

Merge branch 'clip_encoder' of https://github.com/felipemello1/torchtune

e9b75cc

into clip_encoder

test fixes

f6c539a

docstrings

10cafa1

Merge branch 'main' into clip_encoder

beb80c4

lint

c96279e

update output transforms to match new pattern

5df7a29

removed logs

b91c6be

small style update

348e681

felipemello1 requested review from pbontrager, kartikayk and ebsmothers July 8, 2024 14:17

kartikayk approved these changes Jul 8, 2024

View reviewed changes

felipemello1 merged commit 069b12b into pytorch:main Jul 8, 2024
29 checks passed

felipemello1 deleted the clip_encoder branch July 8, 2024 22:39

maximegmd pushed a commit to maximegmd/torchtune that referenced this pull request Jul 13, 2024

[CLIP ENCODER] Vision Transform for Clip encoder (pytorch#1127)

99311f4

Co-authored-by: Felipe Mello <[email protected]> Co-authored-by: Kartikay Khandelwal <[email protected]>


		logger = logging.getLogger(__name__)

		def clip_vision_encoder(

		from torch import nn, Tensor


		class Fp32LayerNorm(nn.LayerNorm):

		3) The patches will be flattened and transformed. We call them tokens, because that's how the transformer sees them.


		Image: shape (8x8)

[CLIP ENCODER] Vision Transform for Clip encoder #1127

[CLIP ENCODER] Vision Transform for Clip encoder #1127

Uh oh!

Conversation

felipemello1 commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changelog

Test plan

Uh oh!

pytorch-bot bot commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1127

✅ No Failures

Uh oh!

codecov-commenter commented Jul 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

kartikayk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pbontrager left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pbontrager Jul 5, 2024

Choose a reason for hiding this comment

Uh oh!

felipemello1 Jul 5, 2024

Choose a reason for hiding this comment

Uh oh!

pbontrager Jul 5, 2024

Choose a reason for hiding this comment

Uh oh!

felipemello1 Jul 5, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pbontrager Jul 5, 2024

Choose a reason for hiding this comment

Uh oh!

felipemello1 Jul 5, 2024

Choose a reason for hiding this comment

Uh oh!

kartikayk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

felipemello1 commented Jun 27, 2024 •

edited

Loading

pytorch-bot bot commented Jun 27, 2024 •

edited

Loading

codecov-commenter commented Jul 1, 2024 •

edited

Loading