Can not run TorchTrainer on 3 node with gpu

### What happened + What you expected to happen

I have  setup a ray cluster with 3 node with gpu.  Each node have a RTX  TITAN please see the following images

<img width="648" height="357" alt="Image" src="https://github.com/user-attachments/assets/5a3a89d2-a204-4c39-a9ee-93e8c14ae330" />

I run the example code  about  'Train: Distributed Model Training', If set the  num_workers=2, the program can not run ,please following images:

<img width="2391" height="423" alt="Image" src="https://github.com/user-attachments/assets/3de4a240-9145-4a51-9d30-e7614771f9a2" />

if set num_worker=1 the program can run without any problem.
if set num_worker=2 and set use_gpu=false,   the program also can run without any problem.

can you help how to solve this problem? thank you

### Versions / Dependencies

ray, version 2.53.0
python   3.12.12 
pytorch 2.9.1+cu130
os  Ubuntu 24.04.3 LTS

### Reproduction script

```python
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor
import ray.train.torch
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig

def get_dataset():
    return datasets.FashionMNIST(
         root="/tmp/data",
         train=True,
         download=True,
         transform=ToTensor(),
    )

class NeuralNetwork(nn.Module):
      def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28 * 28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10),
        )
    def forward(self, inputs):
        inputs = self.flatten(inputs)
        logits = self.linear_relu_stack(inputs)
        return logits


def train_func_distributed():
     num_epochs = 3
     batch_size = 64
     dataset = get_dataset()
     dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
    dataloader = ray.train.torch.prepare_data_loader(dataloader)
    model = NeuralNetwork()
    model = ray.train.torch.prepare_model(model)
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
    for epoch in range(num_epochs):
        if ray.train.get_context().get_world_size() > 1:
            dataloader.sampler.set_epoch(epoch)
        for inputs, labels in dataloader:
            optimizer.zero_grad()
            pred = model(inputs)
            loss = criterion(pred, labels)
            loss.backward()
            optimizer.step()
        print(f"epoch: {epoch}, loss: {loss.item()}")


use_gpu = True

trainer = TorchTrainer(
    train_func_distributed,
    scaling_config=ScalingConfig(num_workers=2, use_gpu=use_gpu)
)

results = trainer.fit()
print(results)
```
### Issue Severity

None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not run TorchTrainer on 3 node with gpu #59804

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can not run TorchTrainer on 3 node with gpu #59804

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions