Description
When creating references to #[repr(align)]
types wrapped in enums, LLVM generates suboptimal assembly with redundant memory operations, despite the reference being unused. This occurs even at opt-level=3
.
I tried this code: (opt-level=3)
https://godbolt.org/z/P8E4hsdbn
#![crate_type = "lib"]
#[repr(align(64))]
pub struct Align64(i32);
pub enum Enum64 {
A(Align64),
B(i32),
}
/// Processes data and returns an Enum64 variant
/// Logs intermediate state for debugging purposes
#[no_mangle]
pub fn process_data(a: Align64) -> Enum64 {
let result = Enum64::A(a);
// Common debugging pattern - logging intermediate values
log_intermediate(&result);
result
}
#[inline(never)]
fn log_intermediate(e: &Enum64) {
// The empty function still forces the reference to be created
}
I expected to see this happen:
process_data:
mov rax, rdi
movaps xmm0, xmmword ptr [rsi]
movaps xmm1, xmmword ptr [rsi + 16]
movaps xmm2, xmmword ptr [rsi + 32]
movaps xmm3, xmmword ptr [rsi + 48]
movaps xmmword ptr [rdi + 112], xmm3
movaps xmmword ptr [rdi + 96], xmm2
movaps xmmword ptr [rdi + 80], xmm1
movaps xmmword ptr [rdi + 64], xmm0
mov dword ptr [rdi], 0
ret
Instead, this happened:
process_data:
mov rax, rdi
movups xmm0, xmmword ptr [rsi]
movups xmm1, xmmword ptr [rsi + 16]
movups xmm2, xmmword ptr [rsi + 32]
movups xmm3, xmmword ptr [rsi + 48]
movups xmmword ptr [rsp - 16], xmm3
movups xmmword ptr [rsp - 32], xmm2
movups xmmword ptr [rsp - 48], xmm1
movups xmmword ptr [rsp - 64], xmm0
mov dword ptr [rdi], 0
movups xmm0, xmmword ptr [rsp - 124]
movups xmm1, xmmword ptr [rsp - 108]
movups xmm2, xmmword ptr [rsp - 92]
movups xmm3, xmmword ptr [rsp - 76]
movups xmmword ptr [rdi + 4], xmm0
movups xmmword ptr [rdi + 20], xmm1
movups xmmword ptr [rdi + 36], xmm2
movups xmmword ptr [rdi + 52], xmm3
movups xmm0, xmmword ptr [rsp - 60]
movups xmmword ptr [rdi + 68], xmm0
movups xmm0, xmmword ptr [rsp - 44]
movups xmmword ptr [rdi + 84], xmm0
movups xmm0, xmmword ptr [rsp - 28]
movups xmmword ptr [rdi + 100], xmm0
movups xmm0, xmmword ptr [rsp - 16]
movups xmmword ptr [rdi + 112], xmm0
ret
Performance Impact
1.Instruction Count: 24 vs 8 instructions (3x increase)
2.Memory Operations:
-2x bandwidth usage (128B vs 64B transferred)
-Unnecessary stack spills
3.Instruction Selection:
-Uses movups
(unaligned) instead of movaps
(aligned)
-Missed opportunity for aligned vector ops
Real-World Relevance
This pattern occurs in:
1.Debug logging (even when logs are disabled)
2.Generic code passing references
3.Derive macros (e.g., #[derive(Debug)]
)
4.Error handling paths
Could you please review the situation? Thank you!
Meta
rustc 1.85.0-nightly (d117b7f21 2024-12-31)
binary: rustc
commit-hash: d117b7f211835282b3b177dc64245fff0327c04c
commit-date: 2024-12-31
host: x86_64-unknown-linux-gnu
release: 1.85.0-nightly
LLVM version: 19.1.6