Skip to content

Commit 916b224

Browse files
davidxiajimpang
authored andcommitted
[doc] split "Other AI Accelerators" tabs (vllm-project#19708)
1 parent 0ed20d2 commit 916b224

File tree

6 files changed

+69
-217
lines changed

6 files changed

+69
-217
lines changed

docs/getting_started/installation/.nav.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,6 @@ nav:
22
- README.md
33
- gpu.md
44
- cpu.md
5-
- ai_accelerator.md
5+
- google_tpu.md
6+
- intel_gaudi.md
7+
- aws_neuron.md

docs/getting_started/installation/README.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@ vLLM supports the following hardware platforms:
1414
- [ARM AArch64](cpu.md#arm-aarch64)
1515
- [Apple silicon](cpu.md#apple-silicon)
1616
- [IBM Z (S390X)](cpu.md#ibm-z-s390x)
17-
- [Other AI accelerators](ai_accelerator.md)
18-
- [Google TPU](ai_accelerator.md#google-tpu)
19-
- [Intel Gaudi](ai_accelerator.md#intel-gaudi)
20-
- [AWS Neuron](ai_accelerator.md#aws-neuron)
17+
- [Google TPU](google_tpu.md)
18+
- [Intel Gaudi](intel_gaudi.md)
19+
- [AWS Neuron](aws_neuron.md)

docs/getting_started/installation/ai_accelerator.md

Lines changed: 0 additions & 117 deletions
This file was deleted.
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,22 @@
1-
# --8<-- [start:installation]
1+
# AWS Neuron
22

33
[AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) is the software development kit (SDK) used to run deep learning and
4-
generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2,
5-
and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores.
6-
This tab describes how to set up your environment to run vLLM on Neuron.
4+
generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2,
5+
and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores.
6+
This describes how to set up your environment to run vLLM on Neuron.
77

88
!!! warning
99
There are no pre-built wheels or images for this device, so you must build vLLM from source.
1010

11-
# --8<-- [end:installation]
12-
# --8<-- [start:requirements]
11+
## Requirements
1312

1413
- OS: Linux
1514
- Python: 3.9 or newer
1615
- Pytorch 2.5/2.6
1716
- Accelerator: NeuronCore-v2 (in trn1/inf2 chips) or NeuronCore-v3 (in trn2 chips)
1817
- AWS Neuron SDK 2.23
1918

20-
# --8<-- [end:requirements]
21-
# --8<-- [start:configure-a-new-environment]
19+
## Configure a new environment
2220

2321
### Launch a Trn1/Trn2/Inf2 instance and verify Neuron dependencies
2422

@@ -27,6 +25,7 @@ The easiest way to launch a Trainium or Inferentia instance with pre-installed N
2725

2826
- After launching the instance, follow the instructions in [Connect to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) to connect to the instance
2927
- Once inside your instance, activate the pre-installed virtual environment for inference by running
28+
3029
```console
3130
source /opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/bin/activate
3231
```
@@ -38,20 +37,15 @@ for alternative setup instructions including using Docker and manually installin
3837
NxD Inference is the default recommended backend to run inference on Neuron. If you are looking to use the legacy [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx)
3938
library, refer to [Transformers NeuronX Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/setup/index.html).
4039

41-
# --8<-- [end:configure-a-new-environment]
42-
# --8<-- [start:set-up-using-python]
40+
## Set up using Python
4341

44-
# --8<-- [end:set-up-using-python]
45-
# --8<-- [start:pre-built-wheels]
42+
### Pre-built wheels
4643

4744
Currently, there are no pre-built Neuron wheels.
4845

49-
# --8<-- [end:pre-built-wheels]
50-
# --8<-- [start:build-wheel-from-source]
51-
52-
#### Install vLLM from source
46+
### Build wheel from source
5347

54-
Install vllm as follows:
48+
To build and install vLLM from source, run:
5549

5650
```console
5751
git clone https://github.com/vllm-project/vllm.git
@@ -61,8 +55,8 @@ VLLM_TARGET_DEVICE="neuron" pip install -e .
6155
```
6256

6357
AWS Neuron maintains a [Github fork of vLLM](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2) at
64-
[https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2), which contains several features in addition to what's
65-
available on vLLM V0. Please utilize the AWS Fork for the following features:
58+
<https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2>, which contains several features in addition to what's
59+
available on vLLM V0. Please utilize the AWS Fork for the following features:
6660

6761
- Llama-3.2 multi-modal support
6862
- Multi-node distributed inference
@@ -81,25 +75,22 @@ VLLM_TARGET_DEVICE="neuron" pip install -e .
8175

8276
Note that the AWS Neuron fork is only intended to support Neuron hardware; compatibility with other hardwares is not tested.
8377

84-
# --8<-- [end:build-wheel-from-source]
85-
# --8<-- [start:set-up-using-docker]
78+
## Set up using Docker
8679

87-
# --8<-- [end:set-up-using-docker]
88-
# --8<-- [start:pre-built-images]
80+
### Pre-built images
8981

9082
Currently, there are no pre-built Neuron images.
9183

92-
# --8<-- [end:pre-built-images]
93-
# --8<-- [start:build-image-from-source]
84+
### Build image from source
9485

9586
See [deployment-docker-build-image-from-source][deployment-docker-build-image-from-source] for instructions on building the Docker image.
9687

9788
Make sure to use <gh-file:docker/Dockerfile.neuron> in place of the default Dockerfile.
9889

99-
# --8<-- [end:build-image-from-source]
100-
# --8<-- [start:extra-information]
90+
## Extra information
10191

10292
[](){ #feature-support-through-nxd-inference-backend }
93+
10394
### Feature support through NxD Inference backend
10495

10596
The current vLLM and Neuron integration relies on either the `neuronx-distributed-inference` (preferred) or `transformers-neuronx` backend
@@ -108,12 +99,15 @@ to perform most of the heavy lifting which includes PyTorch model initialization
10899

109100
To configure NxD Inference features through the vLLM entrypoint, use the `override_neuron_config` setting. Provide the configs you want to override
110101
as a dictionary (or JSON object when starting vLLM from the CLI). For example, to disable auto bucketing, include
102+
111103
```console
112104
override_neuron_config={
113105
"enable_bucketing":False,
114106
}
115107
```
108+
116109
or when launching vLLM from the CLI, pass
110+
117111
```console
118112
--override-neuron-config "{\"enable_bucketing\":false}"
119113
```
@@ -124,32 +118,30 @@ Alternatively, users can directly call the NxDI library to trace and compile you
124118
### Known limitations
125119

126120
- EAGLE speculative decoding: NxD Inference requires the EAGLE draft checkpoint to include the LM head weights from the target model. Refer to this
127-
[guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#eagle-checkpoint-compatibility)
128-
for how to convert pretrained EAGLE model checkpoints to be compatible for NxDI.
121+
[guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#eagle-checkpoint-compatibility)
122+
for how to convert pretrained EAGLE model checkpoints to be compatible for NxDI.
129123
- Quantization: the native quantization flow in vLLM is not well supported on NxD Inference. It is recommended to follow this
130-
[Neuron quantization guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/custom-quantization.html)
131-
to quantize and compile your model using NxD Inference, and then load the compiled artifacts into vLLM.
124+
[Neuron quantization guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/custom-quantization.html)
125+
to quantize and compile your model using NxD Inference, and then load the compiled artifacts into vLLM.
132126
- Multi-LoRA serving: NxD Inference only supports loading of LoRA adapters at server startup. Dynamic loading of LoRA adapters at
133-
runtime is not currently supported. Refer to [multi-lora example](https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.23-vllm-v0.7.2/examples/offline_inference/neuron_multi_lora.py)
127+
runtime is not currently supported. Refer to [multi-lora example](https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.23-vllm-v0.7.2/examples/offline_inference/neuron_multi_lora.py)
134128
- Multi-modal support: multi-modal support is only available through the AWS Neuron fork. This feature has not been upstreamed
135-
to vLLM main because NxD Inference currently relies on certain adaptations to the core vLLM logic to support this feature.
129+
to vLLM main because NxD Inference currently relies on certain adaptations to the core vLLM logic to support this feature.
136130
- Multi-node support: distributed inference across multiple Trainium/Inferentia instances is only supported on the AWS Neuron fork. Refer
137-
to this [multi-node example](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2/examples/neuron/multi_node)
138-
to run. Note that tensor parallelism (distributed inference across NeuronCores) is available in vLLM main.
131+
to this [multi-node example](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2/examples/neuron/multi_node)
132+
to run. Note that tensor parallelism (distributed inference across NeuronCores) is available in vLLM main.
139133
- Known edge case bug in speculative decoding: An edge case failure may occur in speculative decoding when sequence length approaches
140-
max model length (e.g. when requesting max tokens up to the max model length and ignoring eos). In this scenario, vLLM may attempt
141-
to allocate an additional block to ensure there is enough memory for number of lookahead slots, but since we do not have good support
142-
for paged attention, there isn't another Neuron block for vLLM to allocate. A workaround fix (to terminate 1 iteration early) is
143-
implemented in the AWS Neuron fork but is not upstreamed to vLLM main as it modifies core vLLM logic.
144-
134+
max model length (e.g. when requesting max tokens up to the max model length and ignoring eos). In this scenario, vLLM may attempt
135+
to allocate an additional block to ensure there is enough memory for number of lookahead slots, but since we do not have good support
136+
for paged attention, there isn't another Neuron block for vLLM to allocate. A workaround fix (to terminate 1 iteration early) is
137+
implemented in the AWS Neuron fork but is not upstreamed to vLLM main as it modifies core vLLM logic.
145138

146139
### Environment variables
140+
147141
- `NEURON_COMPILED_ARTIFACTS`: set this environment variable to point to your pre-compiled model artifacts directory to avoid
148-
compilation time upon server initialization. If this variable is not set, the Neuron module will perform compilation and save the
149-
artifacts under `neuron-compiled-artifacts/{unique_hash}/` sub-directory in the model path. If this environment variable is set,
150-
but the directory does not exist, or the contents are invalid, Neuron will also fallback to a new compilation and store the artifacts
151-
under this specified path.
142+
compilation time upon server initialization. If this variable is not set, the Neuron module will perform compilation and save the
143+
artifacts under `neuron-compiled-artifacts/{unique_hash}/` sub-directory in the model path. If this environment variable is set,
144+
but the directory does not exist, or the contents are invalid, Neuron will also fallback to a new compilation and store the artifacts
145+
under this specified path.
152146
- `NEURON_CONTEXT_LENGTH_BUCKETS`: Bucket sizes for context encoding. (Only applicable to `transformers-neuronx` backend).
153147
- `NEURON_TOKEN_GEN_BUCKETS`: Bucket sizes for token generation. (Only applicable to `transformers-neuronx` backend).
154-
155-
# --8<-- [end:extra-information]

docs/getting_started/installation/ai_accelerator/tpu.inc.md renamed to docs/getting_started/installation/google_tpu.md

Lines changed: 11 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# --8<-- [start:installation]
1+
# Google TPU
22

33
Tensor Processing Units (TPUs) are Google's custom-developed application-specific
44
integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs
@@ -33,8 +33,7 @@ information, see [Storage options for Cloud TPU data](https://cloud.devsite.corp
3333
!!! warning
3434
There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
3535

36-
# --8<-- [end:installation]
37-
# --8<-- [start:requirements]
36+
## Requirements
3837

3938
- Google Cloud TPU VM
4039
- TPU versions: v6e, v5e, v5p, v4
@@ -63,8 +62,7 @@ For more information about using TPUs with GKE, see:
6362
- <https://cloud.google.com/kubernetes-engine/docs/concepts/tpus>
6463
- <https://cloud.google.com/kubernetes-engine/docs/concepts/plan-tpus>
6564

66-
# --8<-- [end:requirements]
67-
# --8<-- [start:configure-a-new-environment]
65+
## Configure a new environment
6866

6967
### Provision a Cloud TPU with the queued resource API
7068

@@ -100,16 +98,13 @@ gcloud compute tpus tpu-vm ssh TPU_NAME --project PROJECT_ID --zone ZONE
10098
[TPU VM images]: https://cloud.google.com/tpu/docs/runtimes
10199
[TPU regions and zones]: https://cloud.google.com/tpu/docs/regions-zones
102100

103-
# --8<-- [end:configure-a-new-environment]
104-
# --8<-- [start:set-up-using-python]
101+
## Set up using Python
105102

106-
# --8<-- [end:set-up-using-python]
107-
# --8<-- [start:pre-built-wheels]
103+
### Pre-built wheels
108104

109105
Currently, there are no pre-built TPU wheels.
110106

111-
# --8<-- [end:pre-built-wheels]
112-
# --8<-- [start:build-wheel-from-source]
107+
### Build wheel from source
113108

114109
Install Miniconda:
115110

@@ -142,7 +137,7 @@ Install build dependencies:
142137

143138
```bash
144139
pip install -r requirements/tpu.txt
145-
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
140+
sudo apt-get install --no-install-recommends --yes libopenblas-base libopenmpi-dev libomp-dev
146141
```
147142

148143
Run the setup script:
@@ -151,16 +146,13 @@ Run the setup script:
151146
VLLM_TARGET_DEVICE="tpu" python -m pip install -e .
152147
```
153148

154-
# --8<-- [end:build-wheel-from-source]
155-
# --8<-- [start:set-up-using-docker]
149+
## Set up using Docker
156150

157-
# --8<-- [end:set-up-using-docker]
158-
# --8<-- [start:pre-built-images]
151+
### Pre-built images
159152

160153
See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`.
161154

162-
# --8<-- [end:pre-built-images]
163-
# --8<-- [start:build-image-from-source]
155+
### Build image from source
164156

165157
You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support.
166158

@@ -194,11 +186,5 @@ docker run --privileged --net host --shm-size=16G -it vllm-tpu
194186
Install OpenBLAS with the following command:
195187

196188
```console
197-
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
189+
sudo apt-get install --no-install-recommends --yes libopenblas-base libopenmpi-dev libomp-dev
198190
```
199-
200-
# --8<-- [end:build-image-from-source]
201-
# --8<-- [start:extra-information]
202-
203-
There is no extra information for this device.
204-
# --8<-- [end:extra-information]

0 commit comments

Comments
 (0)