Description
Dear llama.cpp team,
I am experiencing two issues with llama.cpp when using it with the following hardware:
CPU: Xeon Silver 4216 x 2ea
RAM: 383GB
GPU: RTX 3090 x 4ea
The first issue is that although the model requires a total of 41478.18 MB of memory, my machine only uses 5 GB of memory when running the model. I would like to know if this is normal behavior or if there is something wrong with other.
The second issue is related to the token generation speed of the model. Despite my powerful CPU, which consists of two Xeon Silver 4216 processors, I am only getting a token generation speed of 0.65/s. This speed seems slower than what I would expect from my hardware. Could you please advise on how to improve the token generation speed?
Here is the information you may need to help troubleshoot the issue:
[Software Env]
Python 3.9.16
Windows 10 21H2
oobabooga/text-generation-webui
[Output]
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: Loading binary C:\Users\Lucy\.conda\envs\alpaca-serve\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll...
The following models are available:
1. alpaca-native
2. llama-30b-hf
3. llama-65b-hf
4. llama_cpp_65b
5. opt-1.3b
6. Salesforce_codegen-16B-multi
7. TianXxx_llama-65b-int4
Which one do you want to load? 1-7
4
Loading llama_cpp_65b...
llama.cpp weights detected: models\llama_cpp_65b\ggml-model-q4_0.bin
llama_model_load: loading model from 'models\llama_cpp_65b\ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 8192
llama_model_load: n_mult = 256
llama_model_load: n_head = 64
llama_model_load: n_layer = 80
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 22016
llama_model_load: n_parts = 8
llama_model_load: type = 4
llama_model_load: ggml map size = 38917.99 MB
llama_model_load: ggml ctx size = 201.25 KB
llama_model_load: mem required = 41478.18 MB (+ 10240.00 MB per state)
llama_model_load: loading tensors from 'models\llama_cpp_65b\ggml-model-q4_0.bin'
llama_model_load: model size = 38917.53 MB / num tensors = 723
llama_init_from_file: kv self size = 2560.00 MB
C:\Users\Lucy\AppData\Roaming\Python\Python39\site-packages\gradio\deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
warnings.warn(value)
Running on local URL: http://127.0.0.1:7860
Running on public URL: https://f973be860f84965921.gradio.live
This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces
Output generated in 71.19 seconds (0.65 tokens/s, 46 tokens, context 30)```