#pragma omp requires unified_shared_memory does not seem to work #127334

New issue

Open

#pragma omp requires unified_shared_memory does not seem to work#127334

Labels

clang:openmp

bschulz81

opened

on Feb 15, 2025

Unfortunately, openmp is very restricted in its

#teams distribute statement. it says:

"If a teams region is nested inside a target region, the corresponding target construct must not contain any statements, declarations or directives outside of the corresponding teams construct."

This leads to restrictions which are often not very practical.

One alternative, however, would be the unified shared memory of cuda/amd, which is available in openmp via

#pragma omp requires unified_shared_memory

yet this simple program fails. I start the compiler with this: -std=c++20 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -lgomp


#include "omp.h"
#include <math.h>
#include <vector>
#include <numeric>
using namespace std;

#pragma omp requires unified_shared_memory
int main()
{

 std::vector<double> v2(10) ;
std::iota (std::begin(v2), std::end(v2), 0);

 std::vector<double> v(10,0) ;

double*d1=v.data();
double*d2=v2.data();
//no mapping needed in unified memory
//#pragma omp target enter data map(to: d1[0:10],d2[0:10])
#pragma omp target teams distribute
    for(size_t i=1;i<10;i++)
    {
        d2[i]=d1[i]+1;
    }
}

"PluginInterface" error: Failure to synchronize stream (nil): Error in cuStreamSynchronize: an illegal memory access was encountered
omptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.
omptarget error: Source location information not present. Compile with -g or -gline-tables-only.
omptarget fatal error 1: failure of target construct while offloading is mandatory

Process returned -1 (0xFFFFFFFF) execution time : 0.160 s
Press ENTER to continue.

Perhaps unified shared memory it is not supported?

My pascal gpu (an 1660 GTX super) is, admittedly, a bit old.

However, according to nvidia, my device has unified shared memory.

https://developer.nvidia.com/blog/unified-memory-in-cuda-6/
In CUDA 6, Unified Memory is supported starting with the Kepler GPU architecture (Compute Capability 3.0 or higher)

/opt/cuda/extras/demo_suite/deviceQuery
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA GeForce GTX 1660 SUPER"
CUDA Driver Version / Runtime Version 12.4 / 12.6
CUDA Capability Major/Minor version number: 7.5

added

added

and removed

Can you re run your code with the environment variable LIBOMPTARGET_INFO=-1 and paste the output here?

bschulz81

Author

No problem. to add a few more info, i changed the code a bit. Interestingly, i found that my host is device 1 and my target is device 0...

This is a bit confusing since intuitively, one things the device with which one starts the program (which is usually the host) should be device 0 and one then maps to device 1. but no, in my case i have num_devices=1, a device number=1 and a default device=0 when i start and within a target region (without specifying the device) is initial device gets false, and the device num is 0.

so, the gpu is device 0 and the host 1... that is somewhat counterintuitive. What if i had 2 gpu's? is then the host 1, the gpu1 is device 0 and the gpu 2 is device 2? but well, here is the code and then the output:




#include "omp.h"
#include <math.h>
#include <vector>
#include <numeric>
#include <stdio.h>
using namespace std;

#pragma omp requires unified_shared_memory
int main()
{

int default_dev=omp_get_default_device();
int device_num=omp_get_device_num();
int num_devices=omp_get_num_devices();
printf("on host\n");
printf("omp get num devices ");
printf("%d\n",num_devices);
printf("omp get device num ");
printf("%d\n",device_num);
printf("omp get default device ");
printf("%d\n",default_dev);

#pragma omp target
{
    if(omp_is_initial_device())
        printf("on host\n");
    else
        printf("on target\n");

int default_dev2=omp_get_default_device();
int device_num2=omp_get_device_num();
int num_devices2=omp_get_num_devices();
printf("omp get num devices ");
printf("%d\n",num_devices2);
printf("omp get device num ");
printf("%d\n",device_num2);
printf("omp get default device ");
printf("%d\n",default_dev2);


}

 std::vector<double> v2(10) ;
std::iota (std::begin(v2), std::end(v2), 0);

 std::vector<double> v(10,0) ;

double*d1=v.data();
double*d2=v2.data();
//no mapping needed in unified memory
//#pragma omp target enter data map(to: d1[0:10],d2[0:10])
#pragma omp target teams distribute
    for(size_t i=1;i<10;i++)
    {
        d2[i]=d1[i]+1;
    }
}

CUDA device 0 info: The primary context is inactive, set its flags to CU_CTX_SCHED_BLOCKING_SYNC
on host
omp get num devices 1
omp get device num 1
omp get default device 0
omptarget device 0 info: Entering OpenMP kernel at unknown:0:0 with 0 arguments:
"PluginInterface" device 0 info: Launching kernel __omp_offloading_10306_19253c3_main_l24 with [1,1,1] blocks and [128,1,1] threads in Generic mode
omptarget device 0 info: OpenMP Host-Device pointer mappings table empty
omptarget device 0 info: Entering OpenMP kernel at unknown:0:0 with 2 arguments:
omptarget device 0 info: use_address(unknown)[0] (implicit)
omptarget device 0 info: use_address(unknown)[0] (implicit)
"PluginInterface" device 0 info: Launching kernel __omp_offloading_10306_19253c3_main_l53 with [9,1,1] blocks and [128,1,1] threads in Generic mode
on target
omp get num devices 1
omp get device num 0
omp get default device -1
"PluginInterface" error: Failure to synchronize stream (nil): Error in cuStreamSynchronize: an illegal memory access was encountered
omptarget device 0 info: OpenMP Host-Device pointer mappings table empty
omptarget error: Source location information not present. Compile with -g or -gline-tables-only.
omptarget fatal error 1: failure of target construct while offloading is mandatory
Abgebrochen

when i put the mapping command on, then i get

CUDA device 0 info: The primary context is inactive, set its flags to CU_CTX_SCHED_BLOCKING_SYNC
on host
omp get num devices 1
omp get device num 1
omp get default device 0
omptarget device 0 info: Entering OpenMP kernel at unknown:0:0 with 0 arguments:
"PluginInterface" device 0 info: Launching kernel __omp_offloading_10306_1928e33_main_l24 with [1,1,1] blocks and [128,1,1] threads in Generic mode
omptarget device 0 info: OpenMP Host-Device pointer mappings table empty
omptarget device 0 info: Entering OpenMP data region with being_mapper at unknown:0:0 with 2 arguments:
omptarget device 0 info: to(unknown)[80]
omptarget device 0 info: to(unknown)[80]
omptarget device 0 info: Return HstPtrBegin 0x0000560bde88d250 Size=80 for unified shared memory
omptarget device 0 info: Return HstPtrBegin 0x0000560bde886d00 Size=80 for unified shared memory
omptarget device 0 info: OpenMP Host-Device pointer mappings table empty
omptarget device 0 info: Entering OpenMP kernel at unknown:0:0 with 2 arguments:
omptarget device 0 info: use_address(unknown)[0] (implicit)
omptarget device 0 info: use_address(unknown)[0] (implicit)
"PluginInterface" device 0 info: Launching kernel __omp_offloading_10306_1928e33_main_l53 with [9,1,1] blocks and [128,1,1] threads in Generic mode
on target
omp get num devices 1
omp get device num 0
omp get default device -1
"PluginInterface" error: Failure to synchronize stream (nil): Error in cuStreamSynchronize: an illegal memory access was encountered
omptarget device 0 info: OpenMP Host-Device pointer mappings table empty
omptarget error: Source location information not present. Compile with -g or -gline-tables-only.
omptarget fatal error 1: failure of target construct while offloading is mandatory
Abgebrochen

when i also remove the

#pragma omp requires unified shared memory

line, then i get the following which runs ok.
So I don't know, perhaps my nvidia gtx 1660 super has no shared memory? but the documents from nvidia seem to show otherwise....

CUDA device 0 info: The primary context is inactive, set its flags to CU_CTX_SCHED_BLOCKING_SYNC
on host
omp get num devices 1
omp get device num 1
omp get default device 0
omptarget device 0 info: Entering OpenMP kernel at unknown:0:0 with 0 arguments:
"PluginInterface" device 0 info: Launching kernel __omp_offloading_10306_1928e82_main_l24 with [1,1,1] blocks and [128,1,1] threads in Generic mode
omptarget device 0 info: OpenMP Host-Device pointer mappings table empty
omptarget device 0 info: Entering OpenMP data region with being_mapper at unknown:0:0 with 2 arguments:
omptarget device 0 info: to(unknown)[80]
omptarget device 0 info: to(unknown)[80]
omptarget device 0 info: Creating new map entry with HstPtrBase=0x000055c27d6d7240, HstPtrBegin=0x000055c27d6d7240, TgtAllocBegin=0x00007fb69f600000, TgtPtrBegin=0x00007fb69f600000, Size=80, DynRefCount=1, HoldRefCount=0, Name=unknown
omptarget device 0 info: Copying data from host to device, HstPtr=0x000055c27d6d7240, TgtPtr=0x00007fb69f600000, Size=80, Name=unknown
omptarget device 0 info: Creating new map entry with HstPtrBase=0x000055c27d6d0cf0, HstPtrBegin=0x000055c27d6d0cf0, TgtAllocBegin=0x00007fb69f600200, TgtPtrBegin=0x00007fb69f600200, Size=80, DynRefCount=1, HoldRefCount=0, Name=unknown
omptarget device 0 info: Copying data from host to device, HstPtr=0x000055c27d6d0cf0, TgtPtr=0x00007fb69f600200, Size=80, Name=unknown
omptarget device 0 info: OpenMP Host-Device pointer mappings after block at unknown:0:0:
omptarget device 0 info: Host Ptr Target Ptr Size (B) DynRefCount HoldRefCount Declaration
omptarget device 0 info: 0x000055c27d6d0cf0 0x00007fb69f600200 80 1 0 unknown at unknown:0:0
omptarget device 0 info: 0x000055c27d6d7240 0x00007fb69f600000 80 1 0 unknown at unknown:0:0
omptarget device 0 info: Entering OpenMP kernel at unknown:0:0 with 2 arguments:
omptarget device 0 info: use_address(unknown)[0] (implicit)
omptarget device 0 info: use_address(unknown)[0] (implicit)
omptarget device 0 info: Mapping exists (implicit) with HstPtrBegin=0x000055c27d6d0cf0, TgtPtrBegin=0x00007fb69f600200, Size=0, DynRefCount=2 (incremented), HoldRefCount=0, Name=unknown
omptarget device 0 info: Mapping exists (implicit) with HstPtrBegin=0x000055c27d6d7240, TgtPtrBegin=0x00007fb69f600000, Size=0, DynRefCount=2 (incremented), HoldRefCount=0, Name=unknown
omptarget device 0 info: Mapping exists with HstPtrBegin=0x000055c27d6d0cf0, TgtPtrBegin=0x00007fb69f600200, Size=0, DynRefCount=2 (update suppressed), HoldRefCount=0
omptarget device 0 info: Mapping exists with HstPtrBegin=0x000055c27d6d7240, TgtPtrBegin=0x00007fb69f600000, Size=0, DynRefCount=2 (update suppressed), HoldRefCount=0
"PluginInterface" device 0 info: Launching kernel __omp_offloading_10306_1928e82_main_l53 with [9,1,1] blocks and [128,1,1] threads in Generic mode
omptarget device 0 info: Mapping exists with HstPtrBegin=0x000055c27d6d7240, TgtPtrBegin=0x00007fb69f600000, Size=0, DynRefCount=1 (decremented), HoldRefCount=0
omptarget device 0 info: Mapping exists with HstPtrBegin=0x000055c27d6d0cf0, TgtPtrBegin=0x00007fb69f600200, Size=0, DynRefCount=1 (decremented), HoldRefCount=0
omptarget device 0 info: OpenMP Host-Device pointer mappings after block at unknown:0:0:
omptarget device 0 info: Host Ptr Target Ptr Size (B) DynRefCount HoldRefCount Declaration
omptarget device 0 info: 0x000055c27d6d0cf0 0x00007fb69f600200 80 1 0 unknown at unknown:0:0
omptarget device 0 info: 0x000055c27d6d7240 0x00007fb69f600000 80 1 0 unknown at unknown:0:0
on target
omp get num devices 1
omp get device num 0
omp get default device -1
omptarget device 0 info: OpenMP Host-Device pointer mappings after block at libomptarget:0:0:
omptarget device 0 info: Host Ptr Target Ptr Size (B) DynRefCount HoldRefCount Declaration
omptarget device 0 info: 0x000055c27d6d0cf0 0x00007fb69f600200 80 1 0 unknown at unknown:0:0
omptarget device 0 info: 0x000055c27d6d7240 0x00007fb69f600000 80 1 0 unknown at unknown:0:0

KaruroChori

Just to confirm your code and build command do not work on a 3090/A2000 either.
On modern drivers they support unified memory for sure.
I also added -g as suggested, but it still fails, just with a different error:

"PluginInterface" error: Failure to synchronize stream (nil): Error in cuStreamSynchronize: an illegal memory access was encountered
omptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.
main.cpp:20:1: omptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted

Are you sure map definitions can be fully dropped just because you enabled the unified shared memory?

KaruroChori

I found the issue at least in my case. Debian official kernel images are built without CONFIG_DEVICE_PRIVATE.
Which is needed for this feature to work.
https://forums.developer.nvidia.com/t/how-to-enable-hmm-on-debian/316561

Time to rebuild the kernel.

Addendum: I can confirm that with the freshly baked kernel it works.

KaruroChori

mentioned this

on Apr 20, 2025

Runtime unavailability of unified shared memory not disclosed by libomptarget #136491

bschulz81

Author

Hi there, on my system,
CONFIG_HMM_MIRROR and CONFIG_DEVICE_PRIVATE are both enabled.

I then looked up what gcc has to say on this issue. Gcc 15.1 has support for shared memory https://gcc.gnu.org/gcc-15/changes.html
however, further inquiry shows that OpenMP code that has a requires directive with self_maps or unified_shared_memory runs on nvptx devices if and only if all of those support the pageableMemoryAccess property

https://gcc.gnu.org/onlinedocs/gcc-15.1.0/libgomp/nvptx.html

My gpu is an nvidia gtx 1660 super. Despite supporting cuda capability 7.5, and supporting managedMemory it does not have pageableMemoryAccess in that case gcc then omits the shared memory directive...

So my current system only has managedMemory and not pageableMemoryAccess

At least gcc does interpret this as forbidding pragma omp requires unified_shared_memory.

The question is whether that is really true.
According to Nvidia's documentation:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-cc60

For devices with compute capability 6.x or higher but without pageable memory access (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements), CUDA Managed Memory is fully supported and coherent.

The programming model and performance tuning of unified memory is largely similar to the model as described in Unified memory on devices with full CUDA Unified Memory support https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-pageable-systems, with the notable exception that system allocators cannot be used to allocate memory. Thus, the following list of sub-sections do not apply:

System-Allocated Memory: in-depth examples https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-system-allocator

Hardware/Software Coherencyhttps://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-hw-coherency

This suggests that for such systems the compiler would just have to find out whether a pointer is needed within a target region and then replace all malloc and new calls with cudamallocmanaged, to get a shared pointer that can be used on the device and system....

But, well, perhaps i just need a more recent graphics card...

bschulz81

Author

Hm, I now bought an nvidia RTX 5060 TI card, which has a feature called hmm (heterogeneous memory management)

https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/

By experimenting with gcc, I found out that hmm appears to translate into #pragma omp requires unified_address and not unified_shared_memory.

It is often noted in nvidia's developer forums that one may use hmm to work with stl vectors when hmm is enabled.

The following seems to compile with gcc but fails with clang

#include "omp.h"
#include <math.h>
#include
#include
#include <stdio.h>
using namespace std;

#pragma omp requires unified_address

int main()
{

std::vector v(10,1) ;
double*d1=v.data();

// A first attempt with mapping the data field of an stl vector.
#pragma omp target enter data map(to: d1[0:10])
#pragma omp target teams distribute parallel for
for(size_t i=1;i<10;i++)
{
printf("%f\n",d1[i]);
}
#pragma omp target exit data map(delete: d1[0:10])

//Now we change the data on the host.
#pragma omp parallel for
for (size_t i=1;i<10;i++)
v[i]=20;

int b=omp_is_initial_device();
if (b==1) printf("is initial device called from a host region. This statement should appear\n");

//no mapping with unified_address necessary, so commented out.
//#pragma omp target enter data map(tofrom: d1[0:10])

//we create a target region, check whether we are really on a target,
//we call a function of stl vector on the device, (the function pointers
//should point to the same function on host and device),
//and we look whether the values changed at host by a loop appear changed in the device.

#pragma omp target teams distribute
for(size_t i=1;i<10;i++)
{
int b=omp_is_initial_device();
if (b==1) printf("from a target region: is initial device. This statement should not appear.");
//since the host and target pointer coincide, the data changed by the
// host should appear and the stl function should be accessible on the target.
printf("%f\n",v.at(i));
}
}

If one replaces the last

printf("%f\n",v.at(i));

by
printf("%f\n",v.data()[i]);

then it compiles but the checks above say that the entire loop would work on the host...

This also happens when we replace it with

printf("%f\n",d1[i]);

In contrast, if compiled by gcc, the test in the loop reports that the target region would execute the last loop on target (which is the desired result). It also shows the correctly changed values.

So it appears that the trick to use Nvidia's hmm to work with stl vectors currently only works for nvidias nvc++ and maybe with gcc now...

KaruroChori

I have not checked your code yet, but in theory latest GCC introduced c++ standard library support for offloaded devices from what I recall (I have not tested that myself).
In clang to do that the process is more complex, and it requires a custom build of the toolchain I guess.

bschulz81

Author

Hi KaruroChori

but in theory latest GCC introduced c++ standard library support for offloaded devices from what I recall

Where do you have that from? C library perhaps but I would not know anything about the introduction of C++ library support. Do you have a document on this?

However, with hmm, well, when the pointers are shared in memory, then so should be the function pointers and everything should be automatically accessible, because the copying is done on a driver level, even though there is not much support for manually copying (mapping stl vectors) into the gpu by the compiler software. That is at least what I thought how this works and how this is described in the nvidia developer forum...

If gcc would now have stl support on gpu's that would be very interesting. But I can't find a document on this.

KaruroChori

I thought I dreamt about that, but no, it is reported in the release notice:
https://gcc.gnu.org/gcc-15/changes.html#nvptx
https://gcc.gnu.org/gcc-15/changes.html#amdgcn

I have not tested this yet, since I am still on 14.2

bschulz81

Author

Hi KaruroChori,
the statement:
"The standard C++ library (libstdc++) is now supported and enabled."
from the gcc manual https://gcc.gnu.org/gcc-15/changes.html#amdgcn is quite useless and dubious.
The C++stl has many classes which seem to be incompatible with the GPU, e.g exceptions with non-local gotos and tons of classes where one would like a manual.

For example the Open MPI device mapper wants variables. For a struct one has to map all its member sequentially, and then indicate possible array lengths, whose pointers are then linked implicitely to the mapped variable slots of the structs.

Lets say we write

` std::vector v(10,1) ;

For an ordinary array, I would have to do this:

Get a pointer:
double*d1=v.data();
map that pointer
#pragma omp target enter data map(to: d1[0:10])
do a loop

#pragma omp target teams distribute parallel for
    for(size_t i=1;i<10;i++)
    {
        printf("%f\n",d1[i]);
    }

unmap
#pragma omp target exit data map(from: d1[0:10])

If I have a struct, I would have to map the members and the arrays as their length...

But now, well if there is stl support... lets try to work purely with that stl vector.
Well and this new gcc compile and run this:

#pragma omp target enter data map(to: v)
#pragma omp target teams distribute parallel for
    for(size_t i=1;i<10;i++)
    {
        printf("%f\n",v.at(i));
    }
#pragma omp target exit data map(from: v)

With gcc 15, this even creates an exeutable which runs. And it runs even without me using a #pragma omp requires unified_shared_memory or a #pragma omp requires unified_address

this is highly dubious.

Without declaring that all the memory is shared, how would that mapper know the vector length?

Have they implemented this? If so, it would be highly appreciated and desired that the code automatically uploads the entire class. but obviously, that would go far ahead of the OpenMP specification, which says that the mapping pragma only maps variables or arrays, not entire stl templates...

So I think without more documentation, this is highly dubious what gcc does here.

But anyway, we are here on a clang forum...

On clang, if I use this one:

#pragma omp requires unified_shared_memory

int main()
{
 std::vector<double> v(10,1) ;
double*d1=v.data();

// A first attempt with mapping the data field of an stl vector.
#pragma omp target teams distribute parallel for
    for(size_t i=1;i<10;i++)
    {
        printf("%f\n",d1[i]);
    }
}

I get this here

omptarget fatal error 2: '#pragma omp requires unified_shared_memory' not used consistently!

despite my gpu blackwell gpu should support nvidia hmm https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/
What works on clang is this:

#pragma omp requires unified_address
int main()
{

 std::vector<double> v(10,1) ;
double*d1=v.data();

// A first attempt with mapping the data field of an stl vector.
//#pragma omp target enter data map(to: d1[0:10])
#pragma omp target teams distribute parallel for
    for(size_t i=1;i<10;i++)
    {
        printf("%f\n",d1[i]);
    }
}

but not this one

#pragma omp requires unified_address
int main()
{
 std::vector<double> v(10,1) ;

#pragma omp target teams distribute
    for(size_t i=1;i<10;i++)
    {
    int b2=omp_is_initial_device();
    if (b2==1) printf("from a target region: is initial device. This statement should not appear.");
    //since the host and target pointer coincide, the  data changed by the
    // host should appear and the stl function should be accessible on the target.
    printf("%f\n",v.at(i));
    }
}

Here clang complains about unused symbols. so the stl member functions are not implemented.
But it also complains about this

/home/benni/projects/openmptestnew/openmpoffloatest/main.cpp:21:19: warning: type 'std::vector<double>' is not trivially copyable and not guaranteed to be mapped correctly [-Wopenmp-mapping]
   21 |     printf("%f\n",v.at(i));
      |                   ^
1 warning generated.
/home/benni/projects/openmptestnew/openmpoffloatest/main.cpp:21:19: warning: type 'std::vector<double>' is not trivially copyable and not guaranteed to be mapped correctly [-Wopenmp-mapping]
   21 |     printf("%f\n",v.at(i));
      |                   ^
1 warning generated.

Copying? Well, i wrote
#pragma omp requires unified_address

So there should not be any copying needed by clang. At least as far as I understand nvidia, this should be done by the gpu driver...

So it appears that one can not use hmm as in the nvidia compiler where one can indeed then work with stl vectors with unified shared memory, despite the stl not being implemented entirely.
Unfortunately, nvidias nvc++ has other bugs. nvc++ has difficulties assuming overlapping pointers everywhere and I ran into bugs with gang loops and nvc++ has difficulties with OpenMP on the cpu,

So long...
Benjamin

KaruroChori

No time to read now, so I will just reply to the very first point:

The C++stl has many classes which seem to be incompatible with the GPU, e.g exceptions with non-local gotos and tons of classes where one would like a manual.

Most of the standard C++ library can work perfectly fine when compiled in noexcept mode. On exceptional events it just panic exit, and I assume the noexcept flag is used under the hood for all offloaded targets.
Operations which are not usually operable in the context of the offloaded device (like IO) are being handled via a remote procedure calls. At least this is how clang handles that, I assume gcc does something similar.

10 remaining items

KaruroChori

Testing:

#include <omp.h>
//Even removing this pragma changes nothing.
#pragma omp requires unified_shared_memory

#include <vector>
#include <cstdio>
#include <cstdlib>

int main()
{
    std::vector<double> v(10) ;
    for(auto& i :v)i=rand()%10;
    double*d1=v.data();

    for(;;){

    // A first attempt with mapping the data field of an stl vector.
    #pragma omp target teams distribute parallel for
    for(size_t i=1;i<10;i++)
    {
        printf("%i %f %p %d %d %d\n", i, d1[i], d1, omp_get_device_num(), omp_get_thread_num(), omp_get_team_num());
    }

    printf("\n");

    for(size_t i=1;i<10;i++)
    {
        printf("%i %f %p %d %d %d\n", i, d1[i], d1, omp_get_device_num(), omp_get_thread_num(), omp_get_team_num());
    }

    printf("--------\n");

    }
}

I can see load on the gpu when compiled for both clang and gcc. So I assume it is working as intended.
And removing the unified shared memory directive will see it working still for both compilers on my system.

With gcc (fragment):

--------
7 2.000000 0x5575b3760360 0 0 6
5 5.000000 0x5575b3760360 0 0 4
6 6.000000 0x5575b3760360 0 0 5
8 9.000000 0x5575b3760360 0 0 7
9 1.000000 0x5575b3760360 0 0 8
3 5.000000 0x5575b3760360 0 0 2
4 3.000000 0x5575b3760360 0 0 3
1 6.000000 0x5575b3760360 0 0 0
2 7.000000 0x5575b3760360 0 0 1

1 6.000000 0x5575b3760360 2 0 0
2 7.000000 0x5575b3760360 2 0 0
3 5.000000 0x5575b3760360 2 0 0
4 3.000000 0x5575b3760360 2 0 0
5 5.000000 0x5575b3760360 2 0 0
6 6.000000 0x5575b3760360 2 0 0
7 2.000000 0x5575b3760360 2 0 0
8 9.000000 0x5575b3760360 2 0 0
9 1.000000 0x5575b3760360 2 0 0
--------

With clang (fragment):

--------
1 6.000000 0x5594ae106060 0 0 0
2 7.000000 0x5594ae106060 0 1 0
3 5.000000 0x5594ae106060 0 2 0
4 3.000000 0x5594ae106060 0 3 0
5 5.000000 0x5594ae106060 0 4 0
6 6.000000 0x5594ae106060 0 5 0
7 2.000000 0x5594ae106060 0 6 0
8 9.000000 0x5594ae106060 0 7 0
9 1.000000 0x5594ae106060 0 8 0

1 6.000000 0x5594ae106060 2 0 0
2 7.000000 0x5594ae106060 2 0 0
3 5.000000 0x5594ae106060 2 0 0
4 3.000000 0x5594ae106060 2 0 0
5 5.000000 0x5594ae106060 2 0 0
6 6.000000 0x5594ae106060 2 0 0
7 2.000000 0x5594ae106060 2 0 0
8 9.000000 0x5594ae106060 2 0 0
9 1.000000 0x5594ae106060 2 0 0
--------

edited to also show the team number.

bschulz81

Author

Interesting. The numbers seem to be correct in my case. What is irritating is that the gpu in my case is device 0 and the cpu device 1. If I change the code as follows, then one can clearly see that the cpu has just one team, and many threads and that the gpu (device 0) works mostly with teams and one thread for this small loop.

Also, I have, in the snipped below, not used the data pointer, but instead the real stl vector, without any mapping.

Maybe gcc uses shared memory automatically if the gpu supports it... Because copying an stl vector is certainly nontrivial.

For clang in version 20, the example below fails.
It makes copying warnings and linker problems...

Maybe clang21 works better in this respect.
But it is interesting that apparently one does not need any mapping with gcc anymore...

The order of the outputs is of course different but that is because printf needs to be flushed which is done after the loop...

#include <omp.h>
//This pragma is apparently not necessary in gcc anymore...
//#pragma omp requires unified_shared_memory

#include <vector>
#include <cstdio>
#include <cstdlib>

int main()
{
    std::vector<double> v(10) ;
    for(auto& i :v)i=rand()%10;

    for(size_t q=0;q<100;q++){

    // A first attempt with mapping the data field of an stl vector.
    #pragma omp target teams distribute parallel for
    for(int i=1;i<10;i++)
    {
        printf("%i %f %d %d %d\n", i, v[i] ,omp_get_device_num(), omp_get_thread_num(), omp_get_team_num());
    }

    printf("\n");
    #pragma omp teams distribute parallel for
    for(int i=1;i<10;i++)
    {
        printf("%i %f %d %d %d\n", i, v[i], omp_get_device_num(), omp_get_thread_num(), omp_get_team_num());
    }

    printf("--------\n");

    }
}

KaruroChori

Clang will fail out of the box even in the most recent version as libstdc++ is not linked. It is explained how to do that in here: #132429 (comment)
In theory since gcc allows to link it, we could do the same from clang and it should work.

clang++-21 main.cpp  -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -stdlib=libc++ -Xoffload-linker -lc++ -Xoffload-linker -lc++abi

works

bschulz81

Author

Hm, in my case (clang20) this ends with..

In file included from /usr/local/include/c++/v1/stddef.h:27:
/usr/local/include/c++/v1/__config:13:10: fatal error: '__config_site' file not found
   13 | #include <__config_site>
      |          ^~~~~~~~~~~~~~~
1 error generated.

Apart from that, your comment is very interesting.

When the mapping commands (that need a length for arrays) are not needed anymore and having just a pointer in the loop suffices, I wonder what happens if we give him hot a vector but a pointer to a memory mapped file?

    #pragma omp target teams distribute parallel for
    for(int i=1;i<10;i++)
    {
        printf("%i %f %d %d %d\n", i, v[i] ,omp_get_device_num(), omp_get_thread_num(), omp_get_team_num());
    }

v[i] could literally come from

double v=(double) mmap(NULL, length, prot, flags, fd, offset);

https://www.gnu.org/software/libc/manual/html_node/Memory_002dmapped-I_002fO.html

If that file is larger than the gpu memory, is this then loaded by the gpu driver upon need?

I.e. if we have a 60 GB file and vram is just 30 GB, does the driver then assess, we need for the current iteration of our thread group currently 100 MB, upload these asynchronically, then puts everything back (in the background) and then loads the next chunk for the next threads?

Or does it limit the threads such that they will only load the file into the vram until it is full?

And what happens with older cards that do not have hmm?

Still a bit nebulous how this works...

KaruroChori

I wonder what happens if we give him hot a vector but a pointer to a memory mapped file?

It should be working. I am counting on it for a library I am writing.
My guess is that pages for that virtual range of addresses are loaded and purged based on the order they are requested from the virtual memory space on the offloaded devices.
So blowing past the limit of physical VRAM should not stop the application from working, but it might be slowed down a lot.

The same logic could work for cards without support for the unified memory space, but it would require software to emulate the behaviour of an mmu to translate addresses and handle sync via traps? Not the simplest of things, and it would have a negative impact on performance I guess.

bschulz81

Author

Hi KaruroChori,
I made some speed tests. Cuda should support asynchroneous memory transfer.
https://developer.nvidia.com/blog/controlling-data-movement-to-boost-performance-on-ampere-architecture/

That way, while parts of the loop are computed, data for the next chunk could be uploaded to the gpu and even linear operations would then be considerably fast.
Openmp supports asynchroneous offloading with nowait clauses on mapping commands and depend clauses on target constructs.

However, it turns out that at least for gcc, the automatic mechanism without mapping commands is, at least with gcc-15.1, currently very slow. When I upload manually, then there is a speedup of around 8x for a naive upload. When I use a more clever upload strategy (only alloc and then download the result), the speedup for manually calling target enter data (to and so on is around 16 times). Sadly, when I use asynchroneous offloading manually, then there there is no speed advantage on gcc-15.1.

When the program contains a #pragma omp requires unified_address, then the speed is the same as without that for my blackwell 5060 gpu with hmm. Sadly, when I use instead a #pragma omp requires unified shared memory then the speed slows down considerably. I do not know why.

Here is a benchmark program with some numbers for gcc-15.1.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120679

I suspect at least gcc has something to improve there.

I have not yet installed clang-2.1.

When it is unmasked in gentoo I will try that out and get the appropriate numbers for clang in that situation...

Hopefully, the two compiler engineers learn from each other and improve in a healthy competition...

KaruroChori

~~Hi, since you already made the effort, would you make the code you used for testing public?~~ just saw you made it public as an attachment in the bug report.
I can package it with some benchmarking library, configure a testsuite and ensure we can run it programmatically against several flags, compiler and hardware.

bschulz81

Author

Hi KaruroChori,

for a benchmark, I think one should add something that is not linear, like matrix multiplication, and then test it with various loop variants (teams distribute, parallel for, simd and perhals also the new loop construct).

I forgot to add a license to my code.
Consider it as under the MIT License and add code as you please. And also benchmark clang, because this here is a clang forum.

On cpu, I found that valarray, if accessed with the operator v[i] in a loop is quite fast. Valarray has, however, no pointer. I do not know how it works on the gpu.

I suspect that especially with asynchronous offloading, there is much room for optimization with these compilers.
Offloading more data while a computation is made on parts of the data could really speed things up, if it were to work properly... This holds especially for setups like shared_address or unified_shared_memory or default map where the offloading is done somewhat automatically... As it is a time critical process, benchmarks are certainly of value...

KaruroChori

Just a quick report as things do not sum up. There are two operation_gpu_without_mapping_with_pointers functions being overloaded, and the second does not much sense to me, since you are first unwrapping the data from vector and using the vector regardless. Could you check that?

KaruroChori

I have to stop here for today, but I added your tests to a repo, I am slowly refactoring the code a bit.
https://github.com/KaruroChori/test-openmp

I also placed assertions to ensure the resulting vectors are correct, but one of the original tests is was failing, it was just not detected before.

I wrote build.sh to fit my toolchains, so some changes might be needed on your side before running.

bschulz81

Author

Hi KaruroChori, please write the assertions only in the functions speedmeasurement and especially in the function speedmeasurement2 after the last mapping statement and not immediately after the loops, Since especially in speedmeasurement2 the data is not meant to be immediately downloaded after the loops... Perhaps the tests work then?

KaruroChori

Right, without an explicit fence in some of the test cases checks can be run without waiting for completion. I will change that later as you suggested.

KaruroChori

Ok, I have some results from my side with clang-21, you might want to rerun the same tests on gcc-15 as my copy broke and I cannot compile using openmp any longer with it :(.

I tested both -fopenmp-cuda-mode and not, but for this example there is no measurable difference. unified_shared_memory is enabled.
I also added two more tests:

One to verify cached pages were retained after transfer.
One to mirror on GPU already mapped vector with pointers with vectors.

Also, the compiler will complain a lot with warnings, since you applied simd to several cycles which do not accept them. But they are just ignored.

on GPU with mapping with pointers:26.9728
on GPU without mapping with vectors:376.658
on GPU without mapping with pointers:294.525
**on GPU without mapping with pointers II:0.419876** new
on GPU already mapped vector with pointers:0.504666
**on GPU already mapped vector with vectors:302.69** new
on CPU with pointers:9.28519
on CPU with vectors:8.36807

It really seems like pre-mapping memory for std::vector is not sufficient to avoid some expensive calls between host and device.
Also, just reading a page of memory is enough to invalidate it on all the other devices/host? I had to disable one of the assertion checks because of that. I have not been able to fully test this, but if so it is overly restrictive.

bschulz81

Author

Hi KaruroChori

Just a quick report as things do not sum up. There are two operation_gpu_without_mapping_with_pointers functions being overloaded, and the second does not much sense to me, since you are first unwrapping the data from vector and using the vector regardless. Could you check that?

I wrote that in a haste at 3 o clock in the night. In my original file, there is also some dead code. A target exit data statement after a return because I wrote this in such a haste.

In general, what the program does:

it maps the pointers and then does a loop over pointers.
it does does the compiler automatically map a vector and makes the same loop.
it does the compiler automatically map a vector data but uses pointers in the loop to see whether the vector access is as fast as the access by raw pointers.

of course to have the loop called as a function pointer, I sometimes pass it to the function that executes the loop in a cuda kernel as a vector but that should not matter. for the mapping, it is relevent if the addresses are mapped.

It really seems like pre-mapping memory for std::vector is not sufficient to avoid some expensive calls between host and device.

Well actually what your test data shows is that pointer access in clang is much faster than doing the loop over vectors. But that this automatical mapping, which is likely just having the vector in shared memory at the beginning, is very slow compared to mapping it to GPU memory, which has special sections for fast access memory.

Unfortunately, unlike omp_target_alloc, the mapping pragmas of openmp do not have a qualifier for the kind of memory on GPU where the data should be placed.

What you can now try is whether it gets faster by using asynchronous commands better. e.g. remove the taskwait barrier before the loop and put in a depend clause for the variables in the pragma imp target distribute...

But always, there is some offset for the copy. Has to be. It is sad that this offset for the copy is much slower than the loop itself.

So for linear problems like a scalar product with one vector and then another scalar product with another vector, GPU is not really the correct tool, unless perhaps one has an onboard GPU with really fast unified shared memory...

However, it is usually a compilers job to analyze into which memory one has to place variables or whether one should start a loop already on parts of the data that have been copied to make everything faster....

As cuda has many copy methods, several kinds of memory, that are not present on CPUs, there exist novel problems for the optimized of a compiler....

I hoped with my post to shed some light on this. It unfortunately does not suffice just to make the "STL work fast" or to make working cuda loops.

No, the copy into memory also has to be into consideration during the optimization. And loops may be started such that they work on partial data while the copy is finished... This is a somewhat novel situation for a compiler optimiser.

bschulz81

Author

@KaruroChori

I have not yet looked at your code but what is the difference between

on GPU without mapping with pointers II:0.419876 new
on GPU already mapped vector with pointers:0.504666

the function speed measurement 2 just measures the loop without the mapping taking into account.

What does without mapping with pointers mean?
In my benchmark "without mapping" stands for not issuing a manual mapping command. That produced, for a.loop over vectors and pointers, ultra slow results like this

on GPU without mapping with vectors:376.658
on GPU without mapping with pointers:294.525

although on GCC, the vector access is about the same as over pointers, apparently this is not yet so in clang. issuing the mapping command was like this

on GPU with mapping with pointers:26.9728

and when you mapped the pointers and took the copying time out of the benchmark, the result of the loop over pointers was this

on GPU already mapped vector with pointers:0.504666

so I wonder what this is in your benchmark. How did you speed that up additionally.

on GPU without mapping with pointers II:0.419876 new

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

Labels

clang:openmp

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

#pragma omp requires unified_shared_memory does not seem to work #127334

10 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

#pragma omp requires unified_shared_memory does not seem to work #127334

Description

Activity

pradt2 commented on Feb 20, 2025

bschulz81 commented on Feb 21, 2025

KaruroChori commented on Apr 14, 2025

KaruroChori commented on Apr 20, 2025

bschulz81 commented on May 11, 2025

bschulz81 commented on Jun 12, 2025

KaruroChori commented on Jun 12, 2025

bschulz81 commented on Jun 14, 2025

KaruroChori commented on Jun 14, 2025

bschulz81 commented on Jun 14, 2025

KaruroChori commented on Jun 14, 2025

10 remaining items

KaruroChori commented on Jun 14, 2025

bschulz81 commented on Jun 14, 2025

KaruroChori commented on Jun 14, 2025

bschulz81 commented on Jun 14, 2025

KaruroChori commented on Jun 14, 2025

bschulz81 commented on Jun 16, 2025

KaruroChori commented on Jun 16, 2025

bschulz81 commented on Jun 17, 2025

KaruroChori commented on Jun 17, 2025

KaruroChori commented on Jun 17, 2025

bschulz81 commented on Jun 17, 2025

KaruroChori commented on Jun 17, 2025

KaruroChori commented on Jun 18, 2025

bschulz81 commented on Jun 18, 2025

bschulz81 commented on Jun 18, 2025

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions