What happened + What you expected to happen
I have setup a ray cluster with 3 node with gpu. Each node have a RTX TITAN please see the following images
I run the example code about 'Train: Distributed Model Training', If set the num_workers=2, the program can not run ,please following images:
if set num_worker=1 the program can run without any problem.
if set num_worker=2 and set use_gpu=false, the program also can run without any problem.
can you help how to solve this problem? thank you
Versions / Dependencies
ray, version 2.53.0
python 3.12.12
pytorch 2.9.1+cu130
os Ubuntu 24.04.3 LTS
Reproduction script
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor
import ray.train.torch
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig
def get_dataset():
return datasets.FashionMNIST(
root="/tmp/data",
train=True,
download=True,
transform=ToTensor(),
)
class NeuralNetwork(nn.Module):
def __init__(self):
super().__init__()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(28 * 28, 512),
nn.ReLU(),
nn.Linear(512, 512),
nn.ReLU(),
nn.Linear(512, 10),
)
def forward(self, inputs):
inputs = self.flatten(inputs)
logits = self.linear_relu_stack(inputs)
return logits
def train_func_distributed():
num_epochs = 3
batch_size = 64
dataset = get_dataset()
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
dataloader = ray.train.torch.prepare_data_loader(dataloader)
model = NeuralNetwork()
model = ray.train.torch.prepare_model(model)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
for epoch in range(num_epochs):
if ray.train.get_context().get_world_size() > 1:
dataloader.sampler.set_epoch(epoch)
for inputs, labels in dataloader:
optimizer.zero_grad()
pred = model(inputs)
loss = criterion(pred, labels)
loss.backward()
optimizer.step()
print(f"epoch: {epoch}, loss: {loss.item()}")
use_gpu = True
trainer = TorchTrainer(
train_func_distributed,
scaling_config=ScalingConfig(num_workers=2, use_gpu=use_gpu)
)
results = trainer.fit()
print(results)
Issue Severity
None
What happened + What you expected to happen
I have setup a ray cluster with 3 node with gpu. Each node have a RTX TITAN please see the following images
I run the example code about 'Train: Distributed Model Training', If set the num_workers=2, the program can not run ,please following images:
if set num_worker=1 the program can run without any problem.
if set num_worker=2 and set use_gpu=false, the program also can run without any problem.
can you help how to solve this problem? thank you
Versions / Dependencies
ray, version 2.53.0
python 3.12.12
pytorch 2.9.1+cu130
os Ubuntu 24.04.3 LTS
Reproduction script
Issue Severity
None