-
Notifications
You must be signed in to change notification settings - Fork 647
[CLIP ENCODER] Vision Transform for Clip encoder #1127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1127
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 348e681 with merge base 06a125e ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1127 +/- ##
===========================================
- Coverage 65.98% 26.82% -39.17%
===========================================
Files 194 212 +18
Lines 9023 9595 +572
===========================================
- Hits 5954 2574 -3380
- Misses 3069 7021 +3952 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments, but overall this looks good to me - thank you for patiently addressing all of the comments. I'll let @pbontrager and/or @ebsmothers take a pass and stamp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good, and I really love the doc strings with this. I left a number of comments for a few things to either address or clarify, but I think it's close to ready to land.
|
||
logger = logging.getLogger(__name__) | ||
|
||
def clip_vision_encoder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm tempted to say that maybe we should have clip_vision_encoder and tiled_clip_vision_encoder builders
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i can see why, but we may have to pay the debt elsewhere. In the transforms, adapter, masking and inference we may have to check the shape and see if it contains tiles. Assuming everything is tiled saves some downstream complexity.
from torch import nn, Tensor | ||
|
||
|
||
class Fp32LayerNorm(nn.LayerNorm): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't currently support mixed precision training in the library, so I don't believe we should include this. On top of that, I believe torch autocast already automatically converts layernorm to fp32 ref
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know why we keep rmsNorm and the float32 logic instead of using torchs native?
torchtune: https://github.com/pytorch/torchtune/blob/main/torchtune/modules/rms_norm.py
torch: https://pytorch.org/torchtune/stable/generated/torchtune.modules.RMSNorm.html
3) The patches will be flattened and transformed. We call them tokens, because that's how the transformer sees them. | ||
|
||
|
||
Image: shape (8x8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think any of these are actually appearing in the docs. I didn't see anything below "In summary"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for patiently addressing all of the comments!
Co-authored-by: Felipe Mello <[email protected]> Co-authored-by: Kartikay Khandelwal <[email protected]>
…opk and routing (pytorch#1127) This PR adds a single sort_tokens function (simplified from earlier prepare_expert_routing). This removes the code duplication that was present earlier in: moe_forward and moe_on_device, as both were doing the same exact expert routing prep. The goal here is to avoid tech debt by streamlining this functionality into a single function to ensure any future updates are thus auto-propagated. Testing: verified inference works as before, with and without cuda graphs.
Context
What is the purpose of this PR? Is it to
Added Vision Transform architecture, which can be used for the CLIP model.
Main points to focus on the review:
Changelog
Test plan
pre-commit install
)pytest tests
pytest tests -m integration_test