-
Notifications
You must be signed in to change notification settings - Fork 14.7k
Description
Unfortunately, openmp is very restricted in its
#teams distribute statement. it says:
"If a teams region is nested inside a target region, the corresponding target construct must not contain any statements, declarations or directives outside of the corresponding teams construct."
This leads to restrictions which are often not very practical.
One alternative, however, would be the unified shared memory of cuda/amd, which is available in openmp via
#pragma omp requires unified_shared_memory
yet this simple program fails. I start the compiler with this: -std=c++20 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -lgomp
#include "omp.h"
#include <math.h>
#include <vector>
#include <numeric>
using namespace std;
#pragma omp requires unified_shared_memory
int main()
{
std::vector<double> v2(10) ;
std::iota (std::begin(v2), std::end(v2), 0);
std::vector<double> v(10,0) ;
double*d1=v.data();
double*d2=v2.data();
//no mapping needed in unified memory
//#pragma omp target enter data map(to: d1[0:10],d2[0:10])
#pragma omp target teams distribute
for(size_t i=1;i<10;i++)
{
d2[i]=d1[i]+1;
}
}
"PluginInterface" error: Failure to synchronize stream (nil): Error in cuStreamSynchronize: an illegal memory access was encountered
omptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.
omptarget error: Source location information not present. Compile with -g or -gline-tables-only.
omptarget fatal error 1: failure of target construct while offloading is mandatoryProcess returned -1 (0xFFFFFFFF) execution time : 0.160 s
Press ENTER to continue.
Perhaps unified shared memory it is not supported?
My pascal gpu (an 1660 GTX super) is, admittedly, a bit old.
However, according to nvidia, my device has unified shared memory.
https://developer.nvidia.com/blog/unified-memory-in-cuda-6/
In CUDA 6, Unified Memory is supported starting with the Kepler GPU architecture (Compute Capability 3.0 or higher)
/opt/cuda/extras/demo_suite/deviceQuery
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA GeForce GTX 1660 SUPER"
CUDA Driver Version / Runtime Version 12.4 / 12.6
CUDA Capability Major/Minor version number: 7.5
Activity
pradt2 commentedon Feb 20, 2025
Can you re run your code with the environment variable
LIBOMPTARGET_INFO=-1
and paste the output here?bschulz81 commentedon Feb 21, 2025
No problem. to add a few more info, i changed the code a bit. Interestingly, i found that my host is device 1 and my target is device 0...
This is a bit confusing since intuitively, one things the device with which one starts the program (which is usually the host) should be device 0 and one then maps to device 1. but no, in my case i have num_devices=1, a device number=1 and a default device=0 when i start and within a target region (without specifying the device) is initial device gets false, and the device num is 0.
so, the gpu is device 0 and the host 1... that is somewhat counterintuitive. What if i had 2 gpu's? is then the host 1, the gpu1 is device 0 and the gpu 2 is device 2? but well, here is the code and then the output:
when i put the mapping command on, then i get
when i also remove the
#pragma omp requires unified shared memory
line, then i get the following which runs ok.
So I don't know, perhaps my nvidia gtx 1660 super has no shared memory? but the documents from nvidia seem to show otherwise....
KaruroChori commentedon Apr 14, 2025
Just to confirm your code and build command do not work on a 3090/A2000 either.
On modern drivers they support unified memory for sure.
I also added
-g
as suggested, but it still fails, just with a different error:Are you sure
map
definitions can be fully dropped just because you enabled the unified shared memory?KaruroChori commentedon Apr 20, 2025
I found the issue at least in my case. Debian official kernel images are built without CONFIG_DEVICE_PRIVATE.
Which is needed for this feature to work.
https://forums.developer.nvidia.com/t/how-to-enable-hmm-on-debian/316561
Time to rebuild the kernel.
Addendum: I can confirm that with the freshly baked kernel it works.
bschulz81 commentedon May 11, 2025
Hi there, on my system,
CONFIG_HMM_MIRROR and CONFIG_DEVICE_PRIVATE are both enabled.
I then looked up what gcc has to say on this issue. Gcc 15.1 has support for shared memory https://gcc.gnu.org/gcc-15/changes.html
however, further inquiry shows that OpenMP code that has a requires directive with self_maps or unified_shared_memory runs on nvptx devices if and only if all of those support the pageableMemoryAccess property
https://gcc.gnu.org/onlinedocs/gcc-15.1.0/libgomp/nvptx.html
My gpu is an nvidia gtx 1660 super. Despite supporting cuda capability 7.5, and supporting managedMemory it does not have pageableMemoryAccess in that case gcc then omits the shared memory directive...
So my current system only has managedMemory and not pageableMemoryAccess
At least gcc does interpret this as forbidding pragma omp requires unified_shared_memory.
The question is whether that is really true.
According to Nvidia's documentation:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-cc60
For devices with compute capability 6.x or higher but without pageable memory access (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements), CUDA Managed Memory is fully supported and coherent.
The programming model and performance tuning of unified memory is largely similar to the model as described in Unified memory on devices with full CUDA Unified Memory support https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-pageable-systems, with the notable exception that system allocators cannot be used to allocate memory. Thus, the following list of sub-sections do not apply:
System-Allocated Memory: in-depth examples https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-system-allocator
Hardware/Software Coherencyhttps://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-hw-coherency
This suggests that for such systems the compiler would just have to find out whether a pointer is needed within a target region and then replace all malloc and new calls with cudamallocmanaged, to get a shared pointer that can be used on the device and system....
But, well, perhaps i just need a more recent graphics card...
bschulz81 commentedon Jun 12, 2025
Hm, I now bought an nvidia RTX 5060 TI card, which has a feature called hmm (heterogeneous memory management)
https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/
By experimenting with gcc, I found out that hmm appears to translate into #pragma omp requires unified_address and not unified_shared_memory.
It is often noted in nvidia's developer forums that one may use hmm to work with stl vectors when hmm is enabled.
The following seems to compile with gcc but fails with clang
#include "omp.h"
#include <math.h>
#include
#include
#include <stdio.h>
using namespace std;
#pragma omp requires unified_address
int main()
{
std::vector v(10,1) ;
double*d1=v.data();
// A first attempt with mapping the data field of an stl vector.
#pragma omp target enter data map(to: d1[0:10])
#pragma omp target teams distribute parallel for
for(size_t i=1;i<10;i++)
{
printf("%f\n",d1[i]);
}
#pragma omp target exit data map(delete: d1[0:10])
//Now we change the data on the host.
#pragma omp parallel for
for (size_t i=1;i<10;i++)
v[i]=20;
int b=omp_is_initial_device();
if (b==1) printf("is initial device called from a host region. This statement should appear\n");
//no mapping with unified_address necessary, so commented out.
//#pragma omp target enter data map(tofrom: d1[0:10])
//we create a target region, check whether we are really on a target,
//we call a function of stl vector on the device, (the function pointers
//should point to the same function on host and device),
//and we look whether the values changed at host by a loop appear changed in the device.
#pragma omp target teams distribute
for(size_t i=1;i<10;i++)
{
int b=omp_is_initial_device();
if (b==1) printf("from a target region: is initial device. This statement should not appear.");
//since the host and target pointer coincide, the data changed by the
// host should appear and the stl function should be accessible on the target.
printf("%f\n",v.at(i));
}
}
If one replaces the last
printf("%f\n",v.at(i));
by
printf("%f\n",v.data()[i]);
then it compiles but the checks above say that the entire loop would work on the host...
This also happens when we replace it with
printf("%f\n",d1[i]);
In contrast, if compiled by gcc, the test in the loop reports that the target region would execute the last loop on target (which is the desired result). It also shows the correctly changed values.
So it appears that the trick to use Nvidia's hmm to work with stl vectors currently only works for nvidias nvc++ and maybe with gcc now...
KaruroChori commentedon Jun 12, 2025
I have not checked your code yet, but in theory latest GCC introduced c++ standard library support for offloaded devices from what I recall (I have not tested that myself).
In clang to do that the process is more complex, and it requires a custom build of the toolchain I guess.
bschulz81 commentedon Jun 14, 2025
Hi KaruroChori
Where do you have that from? C library perhaps but I would not know anything about the introduction of C++ library support. Do you have a document on this?
However, with hmm, well, when the pointers are shared in memory, then so should be the function pointers and everything should be automatically accessible, because the copying is done on a driver level, even though there is not much support for manually copying (mapping stl vectors) into the gpu by the compiler software. That is at least what I thought how this works and how this is described in the nvidia developer forum...
If gcc would now have stl support on gpu's that would be very interesting. But I can't find a document on this.
KaruroChori commentedon Jun 14, 2025
I thought I dreamt about that, but no, it is reported in the release notice:
https://gcc.gnu.org/gcc-15/changes.html#nvptx
https://gcc.gnu.org/gcc-15/changes.html#amdgcn
I have not tested this yet, since I am still on 14.2
bschulz81 commentedon Jun 14, 2025
Hi KaruroChori,
the statement:
"The standard C++ library (libstdc++) is now supported and enabled."
from the gcc manual https://gcc.gnu.org/gcc-15/changes.html#amdgcn is quite useless and dubious.
The C++stl has many classes which seem to be incompatible with the GPU, e.g exceptions with non-local gotos and tons of classes where one would like a manual.
For example the Open MPI device mapper wants variables. For a struct one has to map all its member sequentially, and then indicate possible array lengths, whose pointers are then linked implicitely to the mapped variable slots of the structs.
Lets say we write
` std::vector v(10,1) ;
For an ordinary array, I would have to do this:
Get a pointer:
double*d1=v.data();
map that pointer
#pragma omp target enter data map(to: d1[0:10])
do a loop
unmap
#pragma omp target exit data map(from: d1[0:10])
If I have a struct, I would have to map the members and the arrays as their length...
But now, well if there is stl support... lets try to work purely with that stl vector.
Well and this new gcc compile and run this:
With gcc 15, this even creates an exeutable which runs. And it runs even without me using a #pragma omp requires unified_shared_memory or a #pragma omp requires unified_address
this is highly dubious.
Without declaring that all the memory is shared, how would that mapper know the vector length?
Have they implemented this? If so, it would be highly appreciated and desired that the code automatically uploads the entire class. but obviously, that would go far ahead of the OpenMP specification, which says that the mapping pragma only maps variables or arrays, not entire stl templates...
So I think without more documentation, this is highly dubious what gcc does here.
But anyway, we are here on a clang forum...
On clang, if I use this one:
I get this here
omptarget fatal error 2: '#pragma omp requires unified_shared_memory' not used consistently!
despite my gpu blackwell gpu should support nvidia hmm https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/
What works on clang is this:
but not this one
Here clang complains about unused symbols. so the stl member functions are not implemented.
But it also complains about this
Copying? Well, i wrote
#pragma omp requires unified_address
So there should not be any copying needed by clang. At least as far as I understand nvidia, this should be done by the gpu driver...
So it appears that one can not use hmm as in the nvidia compiler where one can indeed then work with stl vectors with unified shared memory, despite the stl not being implemented entirely.
Unfortunately, nvidias nvc++ has other bugs. nvc++ has difficulties assuming overlapping pointers everywhere and I ran into bugs with gang loops and nvc++ has difficulties with OpenMP on the cpu,
So long...
Benjamin
KaruroChori commentedon Jun 14, 2025
No time to read now, so I will just reply to the very first point:
Most of the standard C++ library can work perfectly fine when compiled in noexcept mode. On exceptional events it just panic exit, and I assume the noexcept flag is used under the hood for all offloaded targets.
Operations which are not usually operable in the context of the offloaded device (like IO) are being handled via a remote procedure calls. At least this is how clang handles that, I assume gcc does something similar.
10 remaining items
KaruroChori commentedon Jun 14, 2025
Testing:
I can see load on the gpu when compiled for both clang and gcc. So I assume it is working as intended.
And removing the unified shared memory directive will see it working still for both compilers on my system.
With gcc (fragment):
With clang (fragment):
edited to also show the team number.
bschulz81 commentedon Jun 14, 2025
Interesting. The numbers seem to be correct in my case. What is irritating is that the gpu in my case is device 0 and the cpu device 1. If I change the code as follows, then one can clearly see that the cpu has just one team, and many threads and that the gpu (device 0) works mostly with teams and one thread for this small loop.
Also, I have, in the snipped below, not used the data pointer, but instead the real stl vector, without any mapping.
Maybe gcc uses shared memory automatically if the gpu supports it... Because copying an stl vector is certainly nontrivial.
For clang in version 20, the example below fails.
It makes copying warnings and linker problems...
Maybe clang21 works better in this respect.
But it is interesting that apparently one does not need any mapping with gcc anymore...
The order of the outputs is of course different but that is because printf needs to be flushed which is done after the loop...
KaruroChori commentedon Jun 14, 2025
Clang will fail out of the box even in the most recent version as libstdc++ is not linked. It is explained how to do that in here: #132429 (comment)
In theory since gcc allows to link it, we could do the same from clang and it should work.
works
bschulz81 commentedon Jun 14, 2025
Hm, in my case (clang20) this ends with..
Apart from that, your comment is very interesting.
When the mapping commands (that need a length for arrays) are not needed anymore and having just a pointer in the loop suffices, I wonder what happens if we give him hot a vector but a pointer to a memory mapped file?
in
v[i] could literally come from
double v=(double) mmap(NULL, length, prot, flags, fd, offset);
https://www.gnu.org/software/libc/manual/html_node/Memory_002dmapped-I_002fO.html
If that file is larger than the gpu memory, is this then loaded by the gpu driver upon need?
I.e. if we have a 60 GB file and vram is just 30 GB, does the driver then assess, we need for the current iteration of our thread group currently 100 MB, upload these asynchronically, then puts everything back (in the background) and then loads the next chunk for the next threads?
Or does it limit the threads such that they will only load the file into the vram until it is full?
And what happens with older cards that do not have hmm?
Still a bit nebulous how this works...
KaruroChori commentedon Jun 14, 2025
It should be working. I am counting on it for a library I am writing.
My guess is that pages for that virtual range of addresses are loaded and purged based on the order they are requested from the virtual memory space on the offloaded devices.
So blowing past the limit of physical VRAM should not stop the application from working, but it might be slowed down a lot.
The same logic could work for cards without support for the unified memory space, but it would require software to emulate the behaviour of an mmu to translate addresses and handle sync via traps? Not the simplest of things, and it would have a negative impact on performance I guess.
bschulz81 commentedon Jun 16, 2025
Hi KaruroChori,
I made some speed tests. Cuda should support asynchroneous memory transfer.
https://developer.nvidia.com/blog/controlling-data-movement-to-boost-performance-on-ampere-architecture/
That way, while parts of the loop are computed, data for the next chunk could be uploaded to the gpu and even linear operations would then be considerably fast.
Openmp supports asynchroneous offloading with nowait clauses on mapping commands and depend clauses on target constructs.
However, it turns out that at least for gcc, the automatic mechanism without mapping commands is, at least with gcc-15.1, currently very slow. When I upload manually, then there is a speedup of around 8x for a naive upload. When I use a more clever upload strategy (only alloc and then download the result), the speedup for manually calling target enter data (to and so on is around 16 times). Sadly, when I use asynchroneous offloading manually, then there there is no speed advantage on gcc-15.1.
When the program contains a #pragma omp requires unified_address, then the speed is the same as without that for my blackwell 5060 gpu with hmm. Sadly, when I use instead a #pragma omp requires unified shared memory then the speed slows down considerably. I do not know why.
Here is a benchmark program with some numbers for gcc-15.1.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120679
I suspect at least gcc has something to improve there.
I have not yet installed clang-2.1.
When it is unmasked in gentoo I will try that out and get the appropriate numbers for clang in that situation...
Hopefully, the two compiler engineers learn from each other and improve in a healthy competition...
KaruroChori commentedon Jun 16, 2025
Hi, since you already made the effort, would you make the code you used for testing public?just saw you made it public as an attachment in the bug report.I can package it with some benchmarking library, configure a testsuite and ensure we can run it programmatically against several flags, compiler and hardware.
bschulz81 commentedon Jun 17, 2025
Hi KaruroChori,
for a benchmark, I think one should add something that is not linear, like matrix multiplication, and then test it with various loop variants (teams distribute, parallel for, simd and perhals also the new loop construct).
I forgot to add a license to my code.
Consider it as under the MIT License and add code as you please. And also benchmark clang, because this here is a clang forum.
On cpu, I found that valarray, if accessed with the operator v[i] in a loop is quite fast. Valarray has, however, no pointer. I do not know how it works on the gpu.
I suspect that especially with asynchronous offloading, there is much room for optimization with these compilers.
Offloading more data while a computation is made on parts of the data could really speed things up, if it were to work properly... This holds especially for setups like shared_address or unified_shared_memory or default map where the offloading is done somewhat automatically... As it is a time critical process, benchmarks are certainly of value...
KaruroChori commentedon Jun 17, 2025
Just a quick report as things do not sum up. There are two
operation_gpu_without_mapping_with_pointers
functions being overloaded, and the second does not much sense to me, since you are first unwrapping the data from vector and using the vector regardless. Could you check that?KaruroChori commentedon Jun 17, 2025
I have to stop here for today, but I added your tests to a repo, I am slowly refactoring the code a bit.
https://github.com/KaruroChori/test-openmp
I also placed assertions to ensure the resulting vectors are correct, but one of the original tests is was failing, it was just not detected before.
I wrote
build.sh
to fit my toolchains, so some changes might be needed on your side before running.bschulz81 commentedon Jun 17, 2025
Hi KaruroChori, please write the assertions only in the functions speedmeasurement and especially in the function speedmeasurement2 after the last mapping statement and not immediately after the loops, Since especially in speedmeasurement2 the data is not meant to be immediately downloaded after the loops... Perhaps the tests work then?
KaruroChori commentedon Jun 17, 2025
Right, without an explicit fence in some of the test cases checks can be run without waiting for completion. I will change that later as you suggested.
KaruroChori commentedon Jun 18, 2025
Ok, I have some results from my side with clang-21, you might want to rerun the same tests on gcc-15 as my copy broke and I cannot compile using openmp any longer with it :(.
I tested both
-fopenmp-cuda-mode
and not, but for this example there is no measurable difference.unified_shared_memory
is enabled.I also added two more tests:
on GPU already mapped vector with pointers
with vectors.Also, the compiler will complain a lot with warnings, since you applied
simd
to several cycles which do not accept them. But they are just ignored.It really seems like pre-mapping memory for
std::vector
is not sufficient to avoid some expensive calls between host and device.Also, just reading a page of memory is enough to invalidate it on all the other devices/host? I had to disable one of the assertion checks because of that. I have not been able to fully test this, but if so it is overly restrictive.
bschulz81 commentedon Jun 18, 2025
Hi KaruroChori
In general, what the program does:
of course to have the loop called as a function pointer, I sometimes pass it to the function that executes the loop in a cuda kernel as a vector but that should not matter. for the mapping, it is relevent if the addresses are mapped.
Well actually what your test data shows is that pointer access in clang is much faster than doing the loop over vectors. But that this automatical mapping, which is likely just having the vector in shared memory at the beginning, is very slow compared to mapping it to GPU memory, which has special sections for fast access memory.
Unfortunately, unlike omp_target_alloc, the mapping pragmas of openmp do not have a qualifier for the kind of memory on GPU where the data should be placed.
What you can now try is whether it gets faster by using asynchronous commands better. e.g. remove the taskwait barrier before the loop and put in a depend clause for the variables in the pragma imp target distribute...
But always, there is some offset for the copy. Has to be. It is sad that this offset for the copy is much slower than the loop itself.
So for linear problems like a scalar product with one vector and then another scalar product with another vector, GPU is not really the correct tool, unless perhaps one has an onboard GPU with really fast unified shared memory...
However, it is usually a compilers job to analyze into which memory one has to place variables or whether one should start a loop already on parts of the data that have been copied to make everything faster....
As cuda has many copy methods, several kinds of memory, that are not present on CPUs, there exist novel problems for the optimized of a compiler....
I hoped with my post to shed some light on this. It unfortunately does not suffice just to make the "STL work fast" or to make working cuda loops.
No, the copy into memory also has to be into consideration during the optimization. And loops may be started such that they work on partial data while the copy is finished... This is a somewhat novel situation for a compiler optimiser.
bschulz81 commentedon Jun 18, 2025
@KaruroChori
I have not yet looked at your code but what is the difference between
the function speed measurement 2 just measures the loop without the mapping taking into account.
What does without mapping with pointers mean?
In my benchmark "without mapping" stands for not issuing a manual mapping command. That produced, for a.loop over vectors and pointers, ultra slow results like this
although on GCC, the vector access is about the same as over pointers, apparently this is not yet so in clang. issuing the mapping command was like this
on GPU with mapping with pointers:26.9728
and when you mapped the pointers and took the copying time out of the benchmark, the result of the loop over pointers was this
on GPU already mapped vector with pointers:0.504666
so I wonder what this is in your benchmark. How did you speed that up additionally.
on GPU without mapping with pointers II:0.419876 new