[User] Memory usage is extremely low when running 65b 4-bit models. (Only use 5GB)

Dear llama.cpp team,

I am experiencing two issues with llama.cpp when using it with the following hardware:
```
CPU: Xeon Silver 4216 x 2ea
RAM: 383GB
GPU: RTX 3090 x 4ea
```

The first issue is that although the model requires a total of 41478.18 MB of memory, my machine only uses 5 GB of memory when running the model. I would like to know if this is normal behavior or if there is something wrong with other.

The second issue is related to the token generation speed of the model. Despite my powerful CPU, which consists of two Xeon Silver 4216 processors, I am only getting a token generation speed of 0.65/s. This speed seems slower than what I would expect from my hardware. Could you please advise on how to improve the token generation speed?

Here is the information you may need to help troubleshoot the issue:

[Software Env]
```
Python 3.9.16
Windows 10 21H2
oobabooga/text-generation-webui
```

[Output]
```D:\one-click-installers-oobabooga-windows\one-click-installers-oobabooga-windows\text-generation-webui>python server.py --threads 36 --share --gpu-memory 23 23 23 23

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: Loading binary C:\Users\Lucy\.conda\envs\alpaca-serve\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll...
The following models are available:

1. alpaca-native
2. llama-30b-hf
3. llama-65b-hf
4. llama_cpp_65b
5. opt-1.3b
6. Salesforce_codegen-16B-multi
7. TianXxx_llama-65b-int4

Which one do you want to load? 1-7

4

Loading llama_cpp_65b...
llama.cpp weights detected: models\llama_cpp_65b\ggml-model-q4_0.bin

llama_model_load: loading model from 'models\llama_cpp_65b\ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 8192
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 64
llama_model_load: n_layer = 80
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 22016
llama_model_load: n_parts = 8
llama_model_load: type    = 4
llama_model_load: ggml map size = 38917.99 MB
llama_model_load: ggml ctx size = 201.25 KB
llama_model_load: mem required  = 41478.18 MB (+ 10240.00 MB per state)
llama_model_load: loading tensors from 'models\llama_cpp_65b\ggml-model-q4_0.bin'
llama_model_load: model size = 38917.53 MB / num tensors = 723
llama_init_from_file: kv self size  = 2560.00 MB
C:\Users\Lucy\AppData\Roaming\Python\Python39\site-packages\gradio\deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
  warnings.warn(value)
Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://f973be860f84965921.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces
Output generated in 71.19 seconds (0.65 tokens/s, 46 tokens, context 30)```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[User] Memory usage is extremely low when running 65b 4-bit models. (Only use 5GB) #864

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[User] Memory usage is extremely low when running 65b 4-bit models. (Only use 5GB) #864

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions