Cannot run parallel inference with DDP

### Search before asking

- [X] I have searched the YOLOv5 [issues](https://github.com/ultralytics/yolov5/issues) and found no similar bug report.


### YOLOv5 Component

Validation, Detection

### Bug

I am trying to make predictions in parallel using multiple GPUs in order to speed up inference on large datasets. 
From what I gathered, the best way to go about it with Pytorch is to use `torch.nn.DataParallel`.
However, the model first gets created in `cuda:0` then is copied over to the desired gpus. This overloads `cuda:0` and if not (when the batch size is small) then the same model is present over multiple gpus. I then get the following exception: 
`RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.`

See full error: 

```
YOLOv5 🚀 v6.2-145-gf8b7463 Python-3.9.13 torch-1.12.1+cu102 CUDA:4 (NVIDIA GeForce RTX 2080 Ti, 11019MiB)

Fusing layers...
Model summary: 416 layers, 140038156 parameters, 0 gradients, 208.0 GFLOPs
Adding AutoShape...
Traceback (most recent call last):
  File "/mnt/remote/data/users/thomasssajot/yolov5/notebooks/generate_classification_results.py", line 152, in <module>
    main(device=2)
  File "/mnt/remote/data/users/thomasssajot/yolov5/notebooks/generate_classification_results.py", line 136, in main
    model = get_model(model_path).to(f'cuda:{device}')
  File "/home/thomassajot/miniconda3/envs/yolov5/lib/python3.9/site-packages/torch/nn/modules/module.py", line 927, in to
    return self._apply(convert)
  File "/mnt/remote/data/users/thomasssajot/yolov5/models/common.py", line 621, in _apply
    self = super()._apply(fn)
  File "/home/thomassajot/miniconda3/envs/yolov5/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/home/thomassajot/miniconda3/envs/yolov5/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/mnt/remote/data/users/thomasssajot/yolov5/models/yolo.py", line 155, in _apply
    self = super()._apply(fn)
  File "/home/thomassajot/miniconda3/envs/yolov5/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/home/thomassajot/miniconda3/envs/yolov5/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/home/thomassajot/miniconda3/envs/yolov5/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/home/thomassajot/miniconda3/envs/yolov5/lib/python3.9/site-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/home/thomassajot/miniconda3/envs/yolov5/lib/python3.9/site-packages/torch/nn/modules/module.py", line 925, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
```

### Environment

PyTorch version: 1.12.1+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~18.04) 9.4.0
Clang version: 13.0.1-++20220120110844+75e33f71c2da-1~exp1~20220120230854.66
CMake version: version 3.10.2
Libc version: glibc-2.27

### Minimal Reproducible Example
```python

import torch 
from torch.utils.data import DataLoader
from tqdm import tqdm

def get_model(path):
    model = torch.hub.load('ultralytics/yolov5', 'yolov5s')
    model.eval()
    return model

def get_image_files():
    images= 'path/to/image.jpeg'
    return [image] * 64

def main():
    images = get_image_files()
    model = get_model()
    net = torch.nn.DataParallel(model, device_ids=[0, 1])

    loader = DataLoader(dataset=images[:64 * 4], batch_size=4, shuffle=False, num_workers=8) 

    with torch.no_grad():
        for batch in tqdm(loader, ncols=140, desc=f'Predictions'):
            res = net(batch, size=1280)


if __name__ == "__main__":
    main()
```

### Additional

_No response_

### Are you willing to submit a PR?

- [ ] Yes I'd like to help by submitting a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cannot run parallel inference with DDP #9687

Search before asking

YOLOv5 Component

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Cannot run parallel inference with DDP #9687

Description

Search before asking

YOLOv5 Component

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions