-
Notifications
You must be signed in to change notification settings - Fork 197
Open
Description
Well ... once again, I find myself in need of another feature. This time, dynamic parallelism.
Looks like this is also part of the C++ runtime API, similar to cooperative groups, for which I already have a PR.
I'm considering using a similar strategy for implementing this feature. I would love to just pin down the PTX, but that has proven to be a bit unclear; however, I will definitely start my search in the PTX ISA and see if there are any quick wins. If not, then probably a similar approach as was taken with the cooperative groups API.
Thoughts?
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
thedodd commentedon Nov 14, 2022
The generated PTX from a C++ program using dynamic parallelism will tend to include the following
.extern
declarations in the PTX (comments added by me based on studying the PTX):This is inserted by nvcc device-side triple-chevron syntax is used. This appears to be updated
V2
ABIs compared to what is documented here.The V2 ABIs are much more simple, and building up the PTX for these seems to be pretty straightforward. I should have a PTX based solution for this in PR form quite soon.
I will likely just copy/paste the launch macro that we currently have in cust, and maybe add it to a shared location, or just copy/paste directly to the cuda_std module. We can decide what to do with it in the PR. Just to clarify, the launch macro extracts block & grid size declarations quite nicely, which is why I want to use the macro and then feed that code into the PTX ASM.
thedodd commentedon Nov 15, 2022
Well, as it turns out, a great deal of the code (if not all) is already in place: https://github.com/Rust-GPU/Rust-CUDA/tree/master/crates/cuda_std/src/rt . I had originally been searching for this stuff in the docs, and was not able to find it. Looking in the code, there it is.
I will enable that
rt
module and start experimenting with it. I'll compare the generated PTX with an equivalent C++ program compiled via nvcc.