Description
Looking for potential violations of simd_*
intrinsic preconditions, I found this in stdarch:
/// Compute dot-product of BF16 (16-bit) floating-point pairs in a and b,
/// accumulating the intermediate single-precision (32-bit) floating-point elements
/// with elements in src, and store the results in dst using zeromask k
/// (elements are zeroed out when the corresponding mask bit is not set).
/// [Intel's documentation](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=1769,1651,1654,1657,1660&avx512techs=AVX512_BF16&text=_mm_maskz_dpbf16_ps)
#[inline]
#[target_feature(enable = "avx512bf16,avx512vl")]
#[unstable(feature = "stdarch_x86_avx512", issue = "111137")]
#[cfg_attr(test, assert_instr("vdpbf16ps"))]
pub fn _mm_maskz_dpbf16_ps(k: __mmask8, src: __m128, a: __m128bh, b: __m128bh) -> __m128 {
unsafe {
let rst = _mm_dpbf16_ps(src, a, b).as_f32x4();
let zero = _mm_set1_ps(0.0_f32).as_f32x4();
transmute(simd_select_bitmask(k, rst, zero))
}
}
simd_select_bitmask
is documented to require that all the "extra"/"padding" bits in the mask (not corresponding to a vector element) must be 0. Here, rst
and zero
are vectors of length 4, and the mask k
is a u8
, meaning there are 4 bits in k
that must be 0. However, nothing in the function actually ensures that.
I don't know the intended behavior of the intrinsic for that case (probably intel promises to just ignore the extra bits?), but this function recently got marked as safe (in rust-lang/stdarch#1714) and that is clearly in contradiction with our intrinsic docs. I assume the safety is correct as probably the intrinsic should have no precondition; in that case we have to
- either explicitly mask out the higher bits
- or figure out if we can remove the UB from
simd_select_bitmask
Activity
RalfJung commentedon Mar 3, 2025
This is a similar case:
I did not do an exhaustive search.
Amanieu commentedon Mar 3, 2025
__mmask8
is type alias foru8
where each bit represents one vector element.(misread your comment)
Amanieu commentedon Mar 3, 2025
Right, I believe the intent is to ignore the unused bits here.
RalfJung commentedon Mar 3, 2025
A quick look at Intel's docs confirms this.
So, the question is, can we change the implementation to use
k & 0xF
to mask out the higher bits (assuming I got the bit order right)? Will LLVM know that it can contractsimd_select_bitmask(k & 0xF, ...)
into a single instruction on x86 based on how that architecture behaves?Or do we have to dig into the
simd_select_bitmask
implementation and see if we can remove the UB? If I understand correctly what our LLVM backend does, ittrunc
s thei8
to ani4
and thenbitcast
s that to<4 x i1>
, so it does indeed ignore the other bits. But that also means it is likely the bitwise-and followed bytrunc
would get optimized to the Right Thing by LLVM.jhorstmann commentedon Mar 3, 2025
That is also my understanding of the llvm implementation. It will truncate to an integer with the number of bits corresponding to the number of lanes, then bitcast that to a vector:
rust/compiler/rustc_codegen_llvm/src/intrinsic.rs
Line 1284 in 81d8edc
So any higher bits will be ignored.
RalfJung commentedon Mar 3, 2025
I wonder if our other backends work the same; Cc @bjorn3 @GuillaumeGomez
bjorn3 commentedon Mar 3, 2025
cg_clif currently just checks if the respective lane in the mask is equal to 0 or not: https://github.com/rust-lang/rustc_codegen_cranelift/blob/0f9c09fb3a64ff11ea81446a96907cd5e86490c2/src/intrinsics/simd.rs#L788-L790
RalfJung commentedon Mar 3, 2025
Okay so that is also fine with arbitrary data in the "extra" bits.
bjorn3 commentedon Mar 3, 2025
Yeah, extra bits are ignored.
17 remaining items