Skip to content

Can not run TorchTrainer on 3 node with gpu #59804

@matrixoneken

Description

@matrixoneken

What happened + What you expected to happen

I have setup a ray cluster with 3 node with gpu. Each node have a RTX TITAN please see the following images

Image

I run the example code about 'Train: Distributed Model Training', If set the num_workers=2, the program can not run ,please following images:

Image

if set num_worker=1 the program can run without any problem.
if set num_worker=2 and set use_gpu=false, the program also can run without any problem.

can you help how to solve this problem? thank you

Versions / Dependencies

ray, version 2.53.0
python 3.12.12
pytorch 2.9.1+cu130
os Ubuntu 24.04.3 LTS

Reproduction script

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor
import ray.train.torch
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig

def get_dataset():
    return datasets.FashionMNIST(
         root="/tmp/data",
         train=True,
         download=True,
         transform=ToTensor(),
    )

class NeuralNetwork(nn.Module):
      def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28 * 28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10),
        )
    def forward(self, inputs):
        inputs = self.flatten(inputs)
        logits = self.linear_relu_stack(inputs)
        return logits


def train_func_distributed():
     num_epochs = 3
     batch_size = 64
     dataset = get_dataset()
     dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
    dataloader = ray.train.torch.prepare_data_loader(dataloader)
    model = NeuralNetwork()
    model = ray.train.torch.prepare_model(model)
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
    for epoch in range(num_epochs):
        if ray.train.get_context().get_world_size() > 1:
            dataloader.sampler.set_epoch(epoch)
        for inputs, labels in dataloader:
            optimizer.zero_grad()
            pred = model(inputs)
            loss = criterion(pred, labels)
            loss.backward()
            optimizer.step()
        print(f"epoch: {epoch}, loss: {loss.item()}")


use_gpu = True

trainer = TorchTrainer(
    train_func_distributed,
    scaling_config=ScalingConfig(num_workers=2, use_gpu=use_gpu)
)

results = trainer.fit()
print(results)

Issue Severity

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn'tcommunity-backloggpuGPU related issuesquestionJust a question :)stabilitytrainRay Train Related IssuetriageNeeds triage (eg: priority, bug/not-bug, and owning component)

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions