Open
Description
Feature gate: #![feature(stdarch_nvptx)]
(probably; see #111199)
This is a tracking issue for access to shared memory from NVPTX device code.
There are two flavors of shared memory in NVPTX (see this NVIDIA Technical Blog (2013) for a friendly introduction):
- static shared memory provides a fixed size amount of shared memory independent of the block size. There can be multiple static shared arrays (declared in CUDA C++ using the
__shared__
attribute). - dynamic shared memory provides one base pointer, with a length (in bytes) that must be specified in the kernel launch parameters. If multiple arrays are needed, the single buffer must be manually partitioned (in CUDA C++).
In practice, the required amount of shared memory for GPU algorithms will depend on the block size. Since many domains expect to choose the block size at run-time, static shared memory is often considered to be of limited use.
Public API
Note that other core::arch::nvptx
intrinsics are covered by #111199.
use core::arch::nvptx;
#[no_mangle]
pub unsafe extern "ptx-kernel" fn reverse_kernel(n: u32, a: *mut f32) {
let t = nvptx::_thread_idx_x() as usize;
let tr = n as usize - t - 1;
let (s, sn) = nvptx::_dynamic_smem(); // <--- this issue
let s = s as *mut f32;
assert_eq!(sn as u32, n * 4); // requirement for the algorithm below to be correct
*s.add(t) = *a.add(t);
nvptx::_syncthreads();
*a.add(t) = *s.add(tr);
}
A possible implementation for dynamic shared memory
Shared memory can be exposed using inline assembly (reference).
core::arch::global_asm!(".extern .shared .align 16 .b8 _shared_data[];");
#[inline(always)]
pub fn _dynamic_smem() -> (*mut u8, u32) {
// Dynamic shared memory.
let size: u32;
let saddr: u64;
let ptr: *mut u8;
unsafe {
asm!("mov.u32 {}, %dynamic_smem_size;", out(reg32) size);
asm!("mov.u64 {}, _shared_data;", out(reg64) saddr);
asm!("cvta.shared.u64 {ptr}, {saddr};", ptr = out(reg64) ptr, saddr = in(reg64) saddr);
}
(ptr, size)
}
Steps / History
Unresolved Questions
- Do we prefer the
_dynamic_smem() -> (*mut u8, u32)
or should we have two separate intrinsics for accessing the base pointer and the size? - Do we want to use
usize
for the length even though NVPTX does not (yet, and maybe ever) support that? I'm not sure about other vendors, if we're looking for the intrinsics to be as close as possible. - Is it desirable to expose static shared memory or shall we focus on dynamic (which is less intrusive to implement, but may require more thinking to launch kernels with valid parameters)?
- It has also been suggested that this be implemented as an LLVM intrinsic by introducing
llvm.nvvm.read.ptx.sreg.dynamic_smem_size
. That could in principle allow accessing the shared memory base pointer without introducing a new symbol (_shared_data above
). I lean toward using the inline assembly for now and possibly migrating it later, but no other inline assembly is used incore::arch::nvptx
and I don't know if it should be kept that way. - Do we want to tackle launch bounds in this issue? If so, it should not be under the
stdarch_nvptx
feature gate since it would need to verify launch parameters on the host (better user experience) or handle failure on the device (i.e., prevent UB if an insufficient amount of shared memory is provided). I prefer to handle only the unsafe primitives here since there are many open questions about such ergonomics. - Should this issue consider partitioning dynamic shared memory into parts (e.g., multiple arrays or more general data structures, parametrized by block size)? This interacts with more types and would also make it inappropriate for the
stdarch_nvptx
feature gate.
Footnotes
Metadata
Metadata
Assignees
Labels
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
jedbrown commentedon Jan 15, 2025
@rustbot label O-nvptx
workingjubilee commentedon Jan 15, 2025
One of the concerns here is that we likely will want to consider an API that is amenable to being used on both amdgpu and ptx targets. The concept is identical to both, as far as I am aware. In that sense, the main issue with simply using inline assembly or intrinsics is that it might be too architecture-specific for what we want.
juntyr commentedon Jan 15, 2025
In my crate rust-cuda I have wrapper types for both static (https://github.com/juntyr/rust-cuda/blob/main/src/utils/shared/static.rs) and dynamic (https://github.com/juntyr/rust-cuda/blob/main/src/utils/shared/slice.rs) shared memory, perhaps some of its Assembly hacks might be useful :)
jedbrown commentedon Jan 16, 2025
@workingjubilee Does that comment apply to the rest of
core::arch::nvptx
? Do you envision something like acore::arch::gpu
that has vendor-agnostic ways to access such things? Note that Intel's GPU execution model is a bit more different than NVIDIA to AMD.workingjubilee commentedon Jan 16, 2025
Hmm. No, I don't think it applies to the rest of the architectural intrinsics, because many of them are about more-specific ideas about how the GPU's execution is organized, as opposed to a more common concept of "there's areas of memory that threads 0..N get access to but N..N*2 do not (because it has its own area)", which also may require more special consideration for how it will be elaborated relative to our opsem.
Flakebi commentedon Jan 17, 2025
There are quite a few intrinsics that exist on both nvptx and amdgpu (and surely also intel). After all, they are built to run the same applications (thinking of Vulkan, DirectX, OpenCL, etc.). 🙂
I guess it would be nice to have the “common” intrinsics like
_thread_idx_x/y/z
, group and launch size, syncthreads/barrier.Of course, then the question comes up what the naming scheme should be 😉
The schemes that come to my mind:
I lean towards open standards, which would be the Khronos nomenclature.
(Should these be the only implementations, “replacing” the arch-specific ones, or should they only re-expose the arch-specific intrinsics?)
workingjubilee commentedon Jan 17, 2025
Sorry, I misspoke, the reason to not pursue those immediately in this issue is because ones that are not immediately about shared memory would be yet-another topic.
jedbrown commentedon Jan 17, 2025
It's nigh impossible to use shared memory (or write any useful kernel) without the concept of thread indices, which are in
core::arch::nvptx
. I had in mind that there would be vendor-specific intrinsics using that vendor's preferred terminology (thus my initial suggestion that the feature gate would bestdarch_nvptx
, same as #111199) and we could later define a common subset using a single consistent nomenclature. I think it would be confusing if users of vendor-specific features had to deal with two nomenclatures, so I'd be in favor of the the "re-expose" strategy.Note that operational semantics (e.g., with respect to forward progress) are different between vendors. Suppose one writes a kernel using
"ptx-kernel"
andcore::arch::nvptx
that relies on forward progress. If a generic"gpu-kernel"
andcore::arch::gpu
were simple aliases, one could port the kernel by changing theuse
statement (substituting nomenclature as needed) and the kernel would continue to run equivalently on NVIDIA hardware, but would deadlock with other vendors. I think that would be bad. So it seems to me that the generic side should specify the weaker model that allows conforming kernels to be portable (and unsafe aspects of that could be checked via statically safe wrappers, tools like MIRI, or other formal methods).kjetilkjeka commentedon Jan 17, 2025
I agree that an ambition for a future user friendly safe way to deal with shared memory between GPU targets (that ideally would also be consistent on targets without thread groups) should not come in the way for exposing nvptx intrinsics related to shared memory.
I think there's also precedence for this already. As in how
std::simd
will be a killer feature in the future but arch simd intrinsics is still useful and accepted in thecore::arch
module.workingjubilee commentedon Jan 17, 2025
I don't think anyone relying on forward progress guarantees in the language without forward progress guarantees in its operational semantics is relying on anything to do about Rust's operational semantics.
Flakebi commentedon Jan 17, 2025
Regarding implementing shared memory for nvptx, LLVM exposes it as
addrspace(3)
globals (inline assembly is quite hacky, I think we should use globals instead in the Rust implementation).LLVM nvptx docs: https://llvm.org/docs/NVPTXUsage.html#address-spaces
LLVM amdgpu docs (also uses
addrspace(3)
): https://llvm.org/docs/AMDGPUUsage.html#address-spacesFor static shared allocations, we can declare a global in LLVM IR:
Results in e.g. this nvptx (compiled with llc):
For dynamic shared allocations, we declare the global as
external
:Resulting nvptx:
Test in LLVM that uses it: https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/NVPTX/globals_lowering.ll
Test that demonstrates that in cannot be initialized: https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/NVPTX/gvar-init.ll
12 remaining items