You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+33-33Lines changed: 33 additions & 33 deletions
Original file line number
Diff line number
Diff line change
@@ -132,39 +132,39 @@ Then browse to
132
132
133
133
Optionally, you can use the following command-line flags:
134
134
135
-
| Flag | Description |
136
-
|-------------|-------------|
137
-
|`-h`, `--help`| show this help message and exit |
138
-
|`--model MODEL`| Name of the model to load by default. |
139
-
|`--notebook`| Launch the web UI in notebook mode, where the output is written to the same text box as the input. |
140
-
|`--chat`| Launch the web UI in chat mode.|
141
-
|`--cai-chat`| Launch the web UI in chat mode with a style similar to Character.AI's. If the file `img_bot.png` or `img_bot.jpg` exists in the same folder as server.py, this image will be used as the bot's profile picture. Similarly, `img_me.png` or `img_me.jpg` will be used as your profile picture. |
142
-
|`--cpu`| Use the CPU to generate text.|
143
-
|`--load-in-8bit`| Load the model with 8-bit precision.|
144
-
|`--load-in-4bit`| Load the model with 4-bit precision. Currently only works with LLaMA.|
145
-
|`--gptq-bits GPTQ_BITS`| Load a pre-quantized model with specified precision. 2, 3, 4 and 8 (bit) are supported. Currently only works with LLaMA.|
146
-
|`--bf16`| Load the model with bfloat16 precision. Requires NVIDIA Ampere GPU. |
147
-
|`--auto-devices`| Automatically split the model across the available GPU(s) and CPU.|
148
-
|`--disk`| If the model is too large for your GPU(s) and CPU combined, send the remaining layers to the disk. |
149
-
|`--disk-cache-dir DISK_CACHE_DIR`| Directory to save the disk cache to. Defaults to `cache/`. |
150
-
|`--gpu-memory GPU_MEMORY [GPU_MEMORY ...]`|Maxmimum GPU memory in GiB to be allocated per GPU. Example: `--gpu-memory 10` for a single GPU, `--gpu-memory 10 5` for two GPUs. |
151
-
|`--cpu-memory CPU_MEMORY`| Maximum CPU memory in GiB to allocate for offloaded weights. Must be an integer number. Defaults to 99.|
152
-
|`--flexgen`| Enable the use of FlexGen offloading. |
153
-
|`--percent PERCENT [PERCENT ...]`|FlexGen: allocation percentages. Must be 6 numbers separated by spaces (default: 0, 100, 100, 0, 100, 0). |
154
-
|`--compress-weight`|FlexGen: Whether to compress weight (default: False).|
155
-
|`--pin-weight [PIN_WEIGHT]`| FlexGen: whether to pin weights (setting this to False reduces CPU memory by 20%). |
156
-
|`--deepspeed`| Enable the use of DeepSpeed ZeRO-3 for inference via the Transformers integration. |
157
-
|`--nvme-offload-dir NVME_OFFLOAD_DIR`| DeepSpeed: Directory to use for ZeRO-3 NVME offloading. |
158
-
|`--local_rank LOCAL_RANK`| DeepSpeed: Optional argument for distributed setups. |
159
-
|`--rwkv-strategy RWKV_STRATEGY`| RWKV: The strategy to use while loading the model. Examples: "cpu fp32", "cuda fp16", "cuda fp16i8". |
160
-
|`--rwkv-cuda-on`| RWKV: Compile the CUDA kernel for better performance. |
161
-
|`--no-stream`| Don't stream the text output in real time. This improves the text generation performance.|
162
-
|`--settings SETTINGS_FILE`| Load the default interface settings from this json file. See `settings-template.json` for an example. If you create a file called `settings.json`, this file will be loaded by default without the need to use the `--settings` flag.|
163
-
|`--extensions EXTENSIONS [EXTENSIONS ...]`|The list of extensions to load. If you want to load more than one extension, write the names separated by spaces. |
164
-
|`--listen`| Make the web UI reachable from your local network.|
165
-
|`--listen-port LISTEN_PORT`| The listening port that the server will use. |
166
-
|`--share`| Create a public URL. This is useful for running the web UI on Google Colab or similar. |
167
-
|`--verbose`| Print the prompts to the terminal. |
|`--model MODEL`| Name of the model to load by default.|
139
+
|`--notebook`| Launch the web UI in notebook mode, where the output is written to the same text box as the input.|
140
+
|`--chat`| Launch the web UI in chat mode.|
141
+
|`--cai-chat`| Launch the web UI in chat mode with a style similar to Character.AI's. If the file `img_bot.png` or `img_bot.jpg` exists in the same folder as server.py, this image will be used as the bot's profile picture. Similarly, `img_me.png` or `img_me.jpg` will be used as your profile picture. |
142
+
|`--cpu`| Use the CPU to generate text.|
143
+
|`--load-in-8bit`| Load the model with 8-bit precision.|
144
+
|`--gptq-bits GPTQ_BITS`| Load a pre-quantized model with specified precision. 2, 3, 4 and 8 (bit) are supported. Currently only works with LLaMA and OPT. |
145
+
|`--gptq-model-type MODEL_TYPE`| Model type of pre-quantized model. Currently only LLaMa and OPT are supported. |
146
+
|`--bf16`| Load the model with bfloat16 precision. Requires NVIDIA Ampere GPU.|
147
+
|`--auto-devices`| Automatically split the model across the available GPU(s) and CPU.|
148
+
|`--disk`| If the model is too large for your GPU(s) and CPU combined, send the remaining layers to the disk.|
149
+
|`--disk-cache-dir DISK_CACHE_DIR`| Directory to save the disk cache to. Defaults to `cache/`.|
150
+
|`--gpu-memory GPU_MEMORY [GPU_MEMORY ...]`| Maxmimum GPU memory in GiB to be allocated per GPU. Example: `--gpu-memory 10` for a single GPU, `--gpu-memory 10 5` for two GPUs.|
151
+
|`--cpu-memory CPU_MEMORY`| Maximum CPU memory in GiB to allocate for offloaded weights. Must be an integer number. Defaults to 99.|
152
+
|`--flexgen`|Enable the use of FlexGen offloading.|
153
+
|`--percent PERCENT [PERCENT ...]`|FlexGen: allocation percentages. Must be 6 numbers separated by spaces (default: 0, 100, 100, 0, 100, 0).|
154
+
|`--compress-weight`|FlexGen: Whether to compress weight (default: False).|
155
+
|`--pin-weight [PIN_WEIGHT]`| FlexGen: whether to pin weights (setting this to False reduces CPU memory by 20%).|
156
+
|`--deepspeed`| Enable the use of DeepSpeed ZeRO-3 for inference via the Transformers integration.|
157
+
|`--nvme-offload-dir NVME_OFFLOAD_DIR`| DeepSpeed: Directory to use for ZeRO-3 NVME offloading.|
158
+
|`--local_rank LOCAL_RANK`| DeepSpeed: Optional argument for distributed setups.|
159
+
|`--rwkv-strategy RWKV_STRATEGY`| RWKV: The strategy to use while loading the model. Examples: "cpu fp32", "cuda fp16", "cuda fp16i8".|
160
+
|`--rwkv-cuda-on`|RWKV: Compile the CUDA kernel for better performance.|
161
+
|`--no-stream`| Don't stream the text output in real time. This improves the text generation performance.|
162
+
|`--settings SETTINGS_FILE`| Load the default interface settings from this json file. See `settings-template.json` for an example. If you create a file called `settings.json`, this file will be loaded by default without the need to use the `--settings` flag.|
163
+
|`--extensions EXTENSIONS [EXTENSIONS ...]`| The list of extensions to load. If you want to load more than one extension, write the names separated by spaces.|
164
+
|`--listen`| Make the web UI reachable from your local network.|
165
+
|`--listen-port LISTEN_PORT`| The listening port that the server will use.|
166
+
|`--share`| Create a public URL. This is useful for running the web UI on Google Colab or similar.|
167
+
|`--verbose`| Print the prompts to the terminal.|
168
168
169
169
Out of memory errors? [Check this guide](https://github.com/oobabooga/text-generation-webui/wiki/Low-VRAM-guide).
0 commit comments