Cuda graph implementation on trt backend #1071

zjuwyz · 2025-06-06T06:18:56Z

No description provided.

zjuwyz · 2025-06-06T06:22:47Z

See https://discord.com/channels/417022162348802048/583775968804732928/1380433026093289596

…uda plan cache.

zsqdx · 2025-06-19T03:06:31Z

@lightvector Hi. Do you think this branch is ready to merge? Thanks!

lightvector

Thanks for the work! Got around to taking a look at this, left some comments. Aside from that, a question is - has this code been also tested with cuda graphs disabled, to sanity check that the changes don't break the original option that has no cuda graphs?

Additionally, does anything about the plan cache or similar things need to change with cuda graphs or any of these changes? Should we be bumping the salt for that, as well as adding cuda graph status into the hash so we differentiate the cache between cuda graphs and non-cuda graphs?

lightvector · 2025-06-26T14:45:09Z

cpp/neuralnet/trtbackend.cpp

+    cudaMallocHost((void**)&maskInputs, maxBatchSize * singleMaskElts * sizeof(float));
+    cudaMallocHost((void**)&spatialInputs, maxBatchSize * singleInputElts * sizeof(float));
+    cudaMallocHost((void**)&globalInputs, maxBatchSize * singleInputGlobalElts * sizeof(float));
+    cudaMallocHost((void**)&metaInputs, maxBatchSize * singleInputMetaElts * sizeof(float));
+    cudaMallocHost((void**)&policyPassResults, maxBatchSize * singlePolicyPassResultElts * sizeof(float));
+    cudaMallocHost((void**)&policyResults, maxBatchSize * singlePolicyResultElts * sizeof(float));
+    cudaMallocHost((void**)&valueResults, maxBatchSize * singleValueResultElts * sizeof(float));
+    cudaMallocHost((void**)&scoreValueResults, maxBatchSize * singleScoreValueResultElts * sizeof(float));
+    cudaMallocHost((void**)&ownershipResults, maxBatchSize * singleOwnershipResultElts * sizeof(float));


Since we're changing all of these to raw pointers, does this require a corresponding free operation somewhere?

lightvector · 2025-06-26T14:47:56Z

cpp/neuralnet/trtbackend.cpp

+        {
+          int gpuId;
+          cudaGetDevice(&gpuId);
+          auto& mutex = mutexPerGpu[gpuId];
+          mutex.lock();
+          planBuffer.reset(builder->buildSerializedNetwork(*model->network, *config));
+          mutex.unlock();
+        }
+


Shouldn't this code be gated behind the TENSORRT_CUDA_GRAPH define? Is there any other code that should be gated that isn't as well?

Also, is auto& mutex = mutexPerGpu[gpuId]; safe?

This is a global map that starts unintialized, so the map itself is subject to race conditions since retrieving a value out of it involves a mutation, right?

Nitpick: for the mutex lock/unlock, normally I think we would use a std::lock_guard so that RAII guarantees unlock even if the build of the network raises. (although admittedly generally katago is written so that exceptions in this kind of code are fatal anyways). Is that easy to do?

lightvector · 2025-06-26T14:53:34Z

cpp/neuralnet/trtbackend.cpp

+    int gpuId;
+    cudaGetDevice(&gpuId);
+    auto& mutex = gpuHandle->mutexPerGpu[gpuId];
+    mutex.lock();


If we're going to have repeat code that does cudaGetDevice and grabs a mutex, maybe that should be factored out as a helper, so that it can be written once safely (with unsafe access to mutexPerGpu itself fixed).

wang yize added 4 commits June 6, 2025 13:34

Add cudagraph

d04849c

Update cmakelist for USE_TENSORRT_CUDA_GRAPH

5336a81

fix compile error

8242d03

Must use async io when capturing

476d17f

wang yize added 4 commits June 7, 2025 19:52

Fix cudaMemcpyAsync not using cudaStreamPerThread

a2d6411

Fix cudaMemcpyAsync not using cudaStreamPerThread

1758eb2

Fix multi-thread-single-gpu initialization seg fault when not using c…

84f5510

…uda plan cache.

Add lock to cuda graph instantiate too.

51c9b02

lightvector reviewed Jun 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cuda graph implementation on trt backend #1071

Cuda graph implementation on trt backend #1071

Uh oh!

zjuwyz commented Jun 6, 2025

Uh oh!

zjuwyz commented Jun 6, 2025 •

edited

Loading

Uh oh!

zsqdx commented Jun 19, 2025

Uh oh!

lightvector left a comment

Uh oh!

lightvector Jun 26, 2025

Uh oh!

lightvector Jun 26, 2025

Uh oh!

lightvector Jun 26, 2025

Uh oh!

lightvector Jun 26, 2025

Uh oh!

Uh oh!

Cuda graph implementation on trt backend #1071

Are you sure you want to change the base?

Cuda graph implementation on trt backend #1071

Uh oh!

Conversation

zjuwyz commented Jun 6, 2025

Uh oh!

zjuwyz commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zsqdx commented Jun 19, 2025

Uh oh!

lightvector left a comment

Choose a reason for hiding this comment

Uh oh!

lightvector Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

lightvector Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

lightvector Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

lightvector Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zjuwyz commented Jun 6, 2025 •

edited

Loading