Skip to content

cuda : fix bounds check for src0 rows in MMVQ kernel #2231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 11, 2024

Conversation

ggerganov
Copy link
Member

The non-EN models have a tensor with odd number of rows:

decoder.token_embedding.weight - [ 1024, 51865,     1]

With BS > 1 and since d7e9f58 we have been doing out-of-bound writes with quantum models because rows_per_cuda_block can be equal to 2, leading to significantly poor transcription quality:

https://github.com/ggerganov/whisper.cpp/blob/20c542c71334b3e6c422789093a19157b110ca81/ggml-cuda/mmvq.cu#L92-L123

Repro on RTX 2060:

# any quantum model would reproduce the problem
rm models/ggml-base-q5_1.bin
./models/download-ggml-model.sh base-q5_1

make clean
WHISPER_CUDA=1 make -j && ./main -m models/ggml-base-q5_1.bin -f samples/gb0.wav

@ggerganov
Copy link
Member Author

cc @JohannesGaessler for review

@JohannesGaessler
Copy link
Contributor

This is a more general issue (also for llama.cpp) and I think there's a better way to fix it. A PR to which repository would be the least amount of work for you to sync?

Copy link
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance difference from this change (for llama.cpp) is negligible and for batch size 1 you can optimize out the check anyways. So I think it's fine to just fix it like this. Note that these out-of-bounds writes are potentially also causing issues for llama.cpp if whoever finetuned the model added an extra token to the vocabulary.

Co-authored-by: Johannes Gäßler <[email protected]>
@ggerganov
Copy link
Member Author

Thanks, I'll merge it here and will sync back to llama.cpp soon

@ggerganov ggerganov merged commit 99804b0 into master Jun 11, 2024
91 of 94 checks passed
iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull request Sep 23, 2024
* cuda : fix bounds check for src0 rows in MMVQ kernel

* Update ggml-cuda/mmvq.cu

Co-authored-by: Johannes Gäßler <[email protected]>

---------

Co-authored-by: Johannes Gäßler <[email protected]>
iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull request Sep 23, 2024
* cuda : fix bounds check for src0 rows in MMVQ kernel

* Update ggml-cuda/mmvq.cu

Co-authored-by: Johannes Gäßler <[email protected]>

---------

Co-authored-by: Johannes Gäßler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants