You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) is the software development kit (SDK) used to run deep learning and
4
-
generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2,
5
-
and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores.
6
-
This tab describes how to set up your environment to run vLLM on Neuron.
4
+
generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2,
5
+
and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores.
6
+
This describes how to set up your environment to run vLLM on Neuron.
7
7
8
8
!!! warning
9
9
There are no pre-built wheels or images for this device, so you must build vLLM from source.
10
10
11
-
# --8<-- [end:installation]
12
-
# --8<-- [start:requirements]
11
+
## Requirements
13
12
14
13
- OS: Linux
15
14
- Python: 3.9 or newer
16
15
- Pytorch 2.5/2.6
17
16
- Accelerator: NeuronCore-v2 (in trn1/inf2 chips) or NeuronCore-v3 (in trn2 chips)
18
17
- AWS Neuron SDK 2.23
19
18
20
-
# --8<-- [end:requirements]
21
-
# --8<-- [start:configure-a-new-environment]
19
+
## Configure a new environment
22
20
23
21
### Launch a Trn1/Trn2/Inf2 instance and verify Neuron dependencies
24
22
@@ -27,6 +25,7 @@ The easiest way to launch a Trainium or Inferentia instance with pre-installed N
27
25
28
26
- After launching the instance, follow the instructions in [Connect to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) to connect to the instance
29
27
- Once inside your instance, activate the pre-installed virtual environment for inference by running
@@ -38,20 +37,15 @@ for alternative setup instructions including using Docker and manually installin
38
37
NxD Inference is the default recommended backend to run inference on Neuron. If you are looking to use the legacy [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx)
39
38
library, refer to [Transformers NeuronX Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/setup/index.html).
AWS Neuron maintains a [Github fork of vLLM](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2) at
64
-
[https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2), which contains several features in addition to what's
65
-
available on vLLM V0. Please utilize the AWS Fork for the following features:
58
+
<https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2>, which contains several features in addition to what's
59
+
available on vLLM V0. Please utilize the AWS Fork for the following features:
to quantize and compile your model using NxD Inference, and then load the compiled artifacts into vLLM.
132
126
- Multi-LoRA serving: NxD Inference only supports loading of LoRA adapters at server startup. Dynamic loading of LoRA adapters at
133
-
runtime is not currently supported. Refer to [multi-lora example](https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.23-vllm-v0.7.2/examples/offline_inference/neuron_multi_lora.py)
127
+
runtime is not currently supported. Refer to [multi-lora example](https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.23-vllm-v0.7.2/examples/offline_inference/neuron_multi_lora.py)
134
128
- Multi-modal support: multi-modal support is only available through the AWS Neuron fork. This feature has not been upstreamed
135
-
to vLLM main because NxD Inference currently relies on certain adaptations to the core vLLM logic to support this feature.
129
+
to vLLM main because NxD Inference currently relies on certain adaptations to the core vLLM logic to support this feature.
136
130
- Multi-node support: distributed inference across multiple Trainium/Inferentia instances is only supported on the AWS Neuron fork. Refer
137
-
to this [multi-node example](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2/examples/neuron/multi_node)
138
-
to run. Note that tensor parallelism (distributed inference across NeuronCores) is available in vLLM main.
131
+
to this [multi-node example](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2/examples/neuron/multi_node)
132
+
to run. Note that tensor parallelism (distributed inference across NeuronCores) is available in vLLM main.
139
133
- Known edge case bug in speculative decoding: An edge case failure may occur in speculative decoding when sequence length approaches
140
-
max model length (e.g. when requesting max tokens up to the max model length and ignoring eos). In this scenario, vLLM may attempt
141
-
to allocate an additional block to ensure there is enough memory for number of lookahead slots, but since we do not have good support
142
-
for paged attention, there isn't another Neuron block for vLLM to allocate. A workaround fix (to terminate 1 iteration early) is
143
-
implemented in the AWS Neuron fork but is not upstreamed to vLLM main as it modifies core vLLM logic.
144
-
134
+
max model length (e.g. when requesting max tokens up to the max model length and ignoring eos). In this scenario, vLLM may attempt
135
+
to allocate an additional block to ensure there is enough memory for number of lookahead slots, but since we do not have good support
136
+
for paged attention, there isn't another Neuron block for vLLM to allocate. A workaround fix (to terminate 1 iteration early) is
137
+
implemented in the AWS Neuron fork but is not upstreamed to vLLM main as it modifies core vLLM logic.
145
138
146
139
### Environment variables
140
+
147
141
-`NEURON_COMPILED_ARTIFACTS`: set this environment variable to point to your pre-compiled model artifacts directory to avoid
148
-
compilation time upon server initialization. If this variable is not set, the Neuron module will perform compilation and save the
149
-
artifacts under `neuron-compiled-artifacts/{unique_hash}/` sub-directory in the model path. If this environment variable is set,
150
-
but the directory does not exist, or the contents are invalid, Neuron will also fallback to a new compilation and store the artifacts
151
-
under this specified path.
142
+
compilation time upon server initialization. If this variable is not set, the Neuron module will perform compilation and save the
143
+
artifacts under `neuron-compiled-artifacts/{unique_hash}/` sub-directory in the model path. If this environment variable is set,
144
+
but the directory does not exist, or the contents are invalid, Neuron will also fallback to a new compilation and store the artifacts
145
+
under this specified path.
152
146
-`NEURON_CONTEXT_LENGTH_BUCKETS`: Bucket sizes for context encoding. (Only applicable to `transformers-neuronx` backend).
153
147
-`NEURON_TOKEN_GEN_BUCKETS`: Bucket sizes for token generation. (Only applicable to `transformers-neuronx` backend).
See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`.
161
154
162
-
# --8<-- [end:pre-built-images]
163
-
# --8<-- [start:build-image-from-source]
155
+
### Build image from source
164
156
165
157
You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support.
0 commit comments