Description
Based on this discussion: https://groups.google.com/g/x86-64-abi/c/FMhl2vDl1D8
Currently, llvm passes __m256
and __m512
parameters/return values when it cannot use ymm/zmm registers as follows:
- Parameters are passed on the stack
- Return values are spanned accross 2-4
xmm
registers.
Further, when the avx/avx512f features are enabled at the function level (not globally, using __attribute__((target))
), it passes parameters/return values:
- Paramaters are passed on the stack
- Return values are placed in a single
ymm
/zmm
register.
In contrast the behaviour of gcc (which is apparantly the correct behaviour in both cases) is:
When ymm/zmm registers are unavailable:
- Parameters are passed on the stack
- Return values in memory (return pointer in rdi)
When ymm/zmm registers are available at the function level (using __attribute__((target))
), it passes and returns values as it does when the feature is available globally via a -m
flag.
The difference in behaviour can be demonstrated by https://godbolt.org/z/8sYcn6654.
Based on a short discussion on the x86-64 psABI mailing list, this appears to be entirely incorrect on behalf of llvm: When returning w/o the registers available, it must return in memory as the ABI requires it to place the 2nd SSEUP eightbyte in the 3rd eightbyte of xmm0
, which fails, and sends the entire value to memory. In the locally-enabled case, the registers are available, so it should be passing fully in ymm1
and returning fully in ymm0
(llvm seems to think that it is available given that it does return in ymm0
).