Tracking Issue for NVPTX shared memory

Feature gate: `#![feature(stdarch_nvptx)]` (probably; see https://github.com/rust-lang/rust/issues/111199)

This is a tracking issue for access to shared memory from NVPTX device code.

There are two flavors of shared memory in NVPTX (see [this NVIDIA Technical Blog (2013)](https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/) for a friendly introduction):
* **static** shared memory provides a fixed size amount of shared memory independent of the block size. There can be multiple static shared arrays (declared in CUDA C++ using the `__shared__` attribute).
* **dynamic** shared memory provides one base pointer, with a length (in bytes) that must be specified in the kernel launch parameters. If multiple arrays are needed, the single buffer must be manually partitioned (in CUDA C++).

In practice, the required amount of shared memory for GPU algorithms will depend on the block size. Since many domains expect to choose the block size at run-time, *static* shared memory is often considered to be of limited use.



### Public API


Note that other `core::arch::nvptx` intrinsics are covered by https://github.com/rust-lang/rust/issues/111199.

```rust
use core::arch::nvptx;

#[no_mangle]
pub unsafe extern "ptx-kernel" fn reverse_kernel(n: u32, a: *mut f32) {
    let t = nvptx::_thread_idx_x() as usize;
    let tr = n as usize - t - 1;
    let (s, sn) = nvptx::_dynamic_smem(); // <--- this issue
    let s = s as *mut f32;
    assert_eq!(sn as u32, n * 4); // requirement for the algorithm below to be correct
    *s.add(t) = *a.add(t);
    nvptx::_syncthreads();
    *a.add(t) = *s.add(tr);
}
```

#### A possible implementation for dynamic shared memory

Shared memory can be exposed using inline assembly ([reference](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#special-registers-dynamic-smem-size)).
```rs
core::arch::global_asm!(".extern .shared .align 16 .b8 _shared_data[];");

#[inline(always)]
pub fn _dynamic_smem() -> (*mut u8, u32) {
    // Dynamic shared memory.
    let size: u32;
    let saddr: u64;
    let ptr: *mut u8;
    unsafe {
        asm!("mov.u32 {}, %dynamic_smem_size;", out(reg32) size);
        asm!("mov.u64 {}, _shared_data;", out(reg64) saddr);
        asm!("cvta.shared.u64 {ptr}, {saddr};", ptr = out(reg64) ptr, saddr = in(reg64) saddr);
    }
    (ptr, size)
}
```

### Steps / History



- [ ] Implementation: #...
- [ ] Final comment period (FCP)[^1]
- [ ] Stabilization PR



### Unresolved Questions



- Do we prefer the `_dynamic_smem() -> (*mut u8, u32)` or should we have two separate intrinsics for accessing the base pointer and the size?
- Do we want to use `usize` for the length even though NVPTX does not (yet, and maybe ever) support that? I'm not sure about other vendors, if we're looking for the intrinsics to be as close as possible.
- Is it desirable to expose static shared memory or shall we focus on dynamic (which is less intrusive to implement, but may require more thinking to launch kernels with valid parameters)?
- It has also been suggested that this be implemented as an LLVM intrinsic by introducing `llvm.nvvm.read.ptx.sreg.dynamic_smem_size`. That could in principle allow accessing the shared memory base pointer without introducing a new symbol (`_shared_data above`). I lean toward using the inline assembly for now and possibly migrating it later, but no other inline assembly is used in `core::arch::nvptx` and I don't know if it should be kept that way.
- Do we want to tackle launch bounds in this issue? If so, it should not be under the `stdarch_nvptx` feature gate since it would need to verify launch parameters on the host (better user experience) or handle failure on the device (i.e., prevent UB if an insufficient amount of shared memory is provided). I prefer to handle only the unsafe primitives here since there are many open questions about such ergonomics.
- Should this issue consider partitioning dynamic shared memory into parts (e.g., multiple arrays or more general data structures, parametrized by block size)? This interacts with more types and would also make it inappropriate for the `stdarch_nvptx` feature gate.

[^1]: https://std-dev-guide.rust-lang.org/feature-lifecycle/stabilization.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tracking Issue for NVPTX shared memory #135516

Public API

A possible implementation for dynamic shared memory

Steps / History

Unresolved Questions

12 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Tracking Issue for NVPTX shared memory #135516

Description

Public API

A possible implementation for dynamic shared memory

Steps / History

Unresolved Questions

Footnotes

Activity

jedbrown commented on Jan 15, 2025

workingjubilee commented on Jan 15, 2025

juntyr commented on Jan 15, 2025

jedbrown commented on Jan 16, 2025

workingjubilee commented on Jan 16, 2025

Flakebi commented on Jan 17, 2025

workingjubilee commented on Jan 17, 2025

jedbrown commented on Jan 17, 2025

kjetilkjeka commented on Jan 17, 2025

workingjubilee commented on Jan 17, 2025

Flakebi commented on Jan 17, 2025

12 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions