Skip to content

Fix for Qwen with Yarn #85

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 7, 2025
Merged

Fix for Qwen with Yarn #85

merged 9 commits into from
Jul 7, 2025

Conversation

giulio98
Copy link
Contributor

@giulio98 giulio98 commented Jun 21, 2025

PR description

Fixes #84

Checklist

  • Tests are working (make test)
  • Code is formatted correctly (make style, on errors try fix with make format)
  • Copyright header is included
  • All commits are signed-off using git commit -s
  • (new press) mypress_press.py is in the presses directory
  • (new press) MyPress is in __init__.py
  • (new press) README.md is updated with a 1 liner about the new press in the Available presses section
  • (new press) new press is in the default_presses list in tests/default_presses.py

@alessiodevoto
Copy link
Collaborator

Hi @giulio98 , thanks for opening this PR! 🙂 I think the issue and the way you are addressing it make sense 👍 I left some comments concerning the rerotation logic and some style fixes, let me know what you think !

@alessiodevoto
Copy link
Collaborator

@giulio98 we updated the main, so please don't forget to pull before the next commit :)

@giulio98 giulio98 requested a review from alessiodevoto July 4, 2025 14:30
@alessiodevoto
Copy link
Collaborator

@giulio98 thanks for updating the code! I left a couple of additional comments, after addressing those and the style (make style is failing) we'll be able to merge !

Signed-off-by: giulio98 <[email protected]>
@giulio98
Copy link
Contributor Author

giulio98 commented Jul 7, 2025

Hi @alessiodevoto I run the make style, however I cannot see your latest additional comments.

@alessiodevoto
Copy link
Collaborator

alessiodevoto commented Jul 7, 2025

@giulio98 forgot to publish the review 🤦 here it is

Signed-off-by: giulio98 <[email protected]>
@giulio98
Copy link
Contributor Author

giulio98 commented Jul 7, 2025

@alessiodevoto Thanks for the feedback! I renamed the variables using the convention used in other files:
B -> bsz
H -> num_key_value_heads
L -> n_kept
D -> d

@giulio98
Copy link
Contributor Author

giulio98 commented Jul 7, 2025

@alessiodevoto noticed here: https://github.com/huggingface/transformers/blob/b283d52f7f89d9cf3c77cfef233c4cbf700959ff/src/transformers/models/llama/modeling_llama.py#L98 that i have to multiply the cos and sin by attention_scaling.

Edit: By multiplying cos and sin by attention scaling I get again a drop in performance (7.47 f1 score on narrativeqa).
@alessiodevoto is it correct in this context to multiply by the attention scaling?

My thought is: I don't have to apply attention scaling at this stage because was already applied in the prefill and if i do again is like applying the scaling two times, but let me know your thoughts.

@alessiodevoto
Copy link
Collaborator

alessiodevoto commented Jul 7, 2025

Hi @giulio98 , nice catch, let's see. When we get the keys , they are already rotated with
k_rope = (k * cos * attention_scaling) + (rotate_half(k) * sin * attention_scaling) = ((k * cos) + (rotate_half(k) * sin)) * attention_scaling.
We compute a new sine and cosine that shift the rotation again, new_sin and new_cos. If we apply the scaling again, we have
k_embed = ((k_rope * new_cos) + (rotate_half(k_rope) * new_sin)) * attention_scaling. Given that attention_scaling was already accounted for in k_rope, if we multiply it again we would end up with k_embed = ((k * composed_cos) + (rotate_half(k) * composed_sin)) * attention_scaling^2, so we would be applying the a power of the scaling. Therefore, I think we should remove it!

@giulio98
Copy link
Contributor Author

giulio98 commented Jul 7, 2025

Hi @alessiodevoto , thanks! I thought the same, maybe to be sure we should extend the test in tests/presses/test_key_rerotation_press_rope.py to also tests other rope variant (like yarn).

The test should:

  1. create random unrotated_keys of length q_len (e.g 8 ctx length).
  2. define selected_indices of length n_kept (e.g [0, 1, 5, 7]).

then compute rerotated_keys in two ways.
First way (reference)

  1. gather from unrotated_keys using selected_indices.
  2. apply rope to get reference_rotated_keys.

Second way (with KeyRerotationPress)

  1. apply rope on the full unrotated_keys to have rotated_keys.
  2. call KeyRerotationPress (rotate_keys) function passing rotated_keys and selected_indices to obtain obtained_rotated_keys.

Compare reference_rotated_keys with obtained_rotated_keys.

@giulio98
Copy link
Contributor Author

giulio98 commented Jul 7, 2025

Hi @alessiodevoto I extended the test to include a case with YaRN as well. As expected, the test passes when I don’t multiply by attention_scaling during the rerotation, and it fails when I do multiply.

This confirms our reasoning: since attention_scaling was already included in k_rope, applying it again in the rerotation would effectively square it (i.e., attention_scaling^2), which is not desired.

@alessiodevoto
Copy link
Collaborator

LGTM! Thanks for this adding this!

@alessiodevoto alessiodevoto merged commit 2770562 into NVIDIA:main Jul 7, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance Drop with Qwen using Yarn
2 participants