Skip to content

[User] Memory usage is extremely low when running 65b 4-bit models. (Only use 5GB) #864

Closed
@stuxnet147

Description

@stuxnet147

Dear llama.cpp team,

I am experiencing two issues with llama.cpp when using it with the following hardware:

CPU: Xeon Silver 4216 x 2ea
RAM: 383GB
GPU: RTX 3090 x 4ea

The first issue is that although the model requires a total of 41478.18 MB of memory, my machine only uses 5 GB of memory when running the model. I would like to know if this is normal behavior or if there is something wrong with other.

The second issue is related to the token generation speed of the model. Despite my powerful CPU, which consists of two Xeon Silver 4216 processors, I am only getting a token generation speed of 0.65/s. This speed seems slower than what I would expect from my hardware. Could you please advise on how to improve the token generation speed?

Here is the information you may need to help troubleshoot the issue:

[Software Env]

Python 3.9.16
Windows 10 21H2
oobabooga/text-generation-webui

[Output]


===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: Loading binary C:\Users\Lucy\.conda\envs\alpaca-serve\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll...
The following models are available:

1. alpaca-native
2. llama-30b-hf
3. llama-65b-hf
4. llama_cpp_65b
5. opt-1.3b
6. Salesforce_codegen-16B-multi
7. TianXxx_llama-65b-int4

Which one do you want to load? 1-7

4

Loading llama_cpp_65b...
llama.cpp weights detected: models\llama_cpp_65b\ggml-model-q4_0.bin

llama_model_load: loading model from 'models\llama_cpp_65b\ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 8192
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 64
llama_model_load: n_layer = 80
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 22016
llama_model_load: n_parts = 8
llama_model_load: type    = 4
llama_model_load: ggml map size = 38917.99 MB
llama_model_load: ggml ctx size = 201.25 KB
llama_model_load: mem required  = 41478.18 MB (+ 10240.00 MB per state)
llama_model_load: loading tensors from 'models\llama_cpp_65b\ggml-model-q4_0.bin'
llama_model_load: model size = 38917.53 MB / num tensors = 723
llama_init_from_file: kv self size  = 2560.00 MB
C:\Users\Lucy\AppData\Roaming\Python\Python39\site-packages\gradio\deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
  warnings.warn(value)
Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://f973be860f84965921.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces
Output generated in 71.19 seconds (0.65 tokens/s, 46 tokens, context 30)```

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions