Skip to content

Possible performance loss with f32 arithmetic #91447

Closed
@Yuri6037

Description

@Yuri6037

I've tried, out of curiosity, a floating point arithmetic test and found quite a big difference between C++ and Rust.

The code used in rust

pub struct Stats
{
    x: f32,
    y: f32,
    z: f32
}

pub fn sum(a: &Stats, b: &Stats) -> Stats
{
    Stats {
        x: a.x + b.x,
        y: a.y + b.y,
        z: a.z + b.z
    }
}

The code used in C++

struct Stats
{
    float x;
    float y;
    float z;
};

Stats sum(const Stats &a, const Stats &b)
{
    return Stats {
        a.x + b.x,
        a.y + b.y,
        a.z + b.z
    };
}

Here is a link to a godbolt for side-by-side comparision of assembly output: https://godbolt.org/z/dqc4b74rv

Rust seem to absolutely want the floats back into e* registers instead of keeping them in xmm registers, C++ leaves them into the xmm registers. In some cases it might more advantageous to leave the floats in xmm registers for future operations on them rather then passing them back into the e* registers.

Activity

Urgau

Urgau commented on Dec 2, 2021

@Urgau
Member

After a quick godbolt I find that clang and rustc use the same representation (%Stats = type { float, float, float }) expect for rustc <= 1.47 (%Stats = type { [0 x i32], float, [0 x i32], float, [0 x i32], float, [0 x i32] }). Not sure if that change something because their all at 0.
Another difference between the two rustc is the return type: void for rustc <= 1.47 and i96 for rustc > 1.47 which cause some bitcast and load at the end, which might explain the difference in codegen. Also why is this i96 and not float like the rest ?
The last noticeable difference is that clang process the data in <2 x float>, float instead of float, float, float which allow more optimizations to be done.

My hypothesis is that this codegen issue is probably cause by the i96 return type, which is probably cause by a regression in codegen in rustc > 1.47 but I'm not an expert in rustc_codegen_llvm or llvm itself so this needs to be confirm by the rustc llvm workgroup.

clang (trunk with -O0)

Assembly with -O3
sum(Stats const&, Stats const&):                      # @sum(Stats const&, Stats const&)
        movss   xmm0, dword ptr [rdi]           # xmm0 = mem[0],zero,zero,zero
        movss   xmm1, dword ptr [rdi + 4]       # xmm1 = mem[0],zero,zero,zero
        addss   xmm0, dword ptr [rsi]
        addss   xmm1, dword ptr [rsi + 4]
        unpcklps        xmm0, xmm1                      # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
        movss   xmm1, dword ptr [rdi + 8]       # xmm1 = mem[0],zero,zero,zero
        addss   xmm1, dword ptr [rsi + 8]
        ret
%struct.Stats = type { float, float, float }

define dso_local { <2 x float>, float } @_Z3sumRK5StatsS1_(%struct.Stats* nonnull align 4 dereferenceable(12) %0, %struct.Stats* nonnull align 4 dereferenceable(12) %1) #0 !dbg !105 {
  %3 = alloca %struct.Stats, align 4
  %4 = alloca %struct.Stats*, align 8
  %5 = alloca %struct.Stats*, align 8
  %6 = alloca { <2 x float>, float }, align 8
  store %struct.Stats* %0, %struct.Stats** %4, align 8
  call void @llvm.dbg.declare(metadata %struct.Stats** %4, metadata !111, metadata !DIExpression()), !dbg !112
  store %struct.Stats* %1, %struct.Stats** %5, align 8
  call void @llvm.dbg.declare(metadata %struct.Stats** %5, metadata !113, metadata !DIExpression()), !dbg !114
  %7 = getelementptr inbounds %struct.Stats, %struct.Stats* %3, i32 0, i32 0, !dbg !115
  %8 = load %struct.Stats*, %struct.Stats** %4, align 8, !dbg !116
  %9 = getelementptr inbounds %struct.Stats, %struct.Stats* %8, i32 0, i32 0, !dbg !117
  %10 = load float, float* %9, align 4, !dbg !117
  %11 = load %struct.Stats*, %struct.Stats** %5, align 8, !dbg !118
  %12 = getelementptr inbounds %struct.Stats, %struct.Stats* %11, i32 0, i32 0, !dbg !119
  %13 = load float, float* %12, align 4, !dbg !119
  %14 = fadd float %10, %13, !dbg !120
  store float %14, float* %7, align 4, !dbg !115
  %15 = getelementptr inbounds %struct.Stats, %struct.Stats* %3, i32 0, i32 1, !dbg !115
  %16 = load %struct.Stats*, %struct.Stats** %4, align 8, !dbg !121
  %17 = getelementptr inbounds %struct.Stats, %struct.Stats* %16, i32 0, i32 1, !dbg !122
  %18 = load float, float* %17, align 4, !dbg !122
  %19 = load %struct.Stats*, %struct.Stats** %5, align 8, !dbg !123
  %20 = getelementptr inbounds %struct.Stats, %struct.Stats* %19, i32 0, i32 1, !dbg !124
  %21 = load float, float* %20, align 4, !dbg !124
  %22 = fadd float %18, %21, !dbg !125
  store float %22, float* %15, align 4, !dbg !115
  %23 = getelementptr inbounds %struct.Stats, %struct.Stats* %3, i32 0, i32 2, !dbg !115
  %24 = load %struct.Stats*, %struct.Stats** %4, align 8, !dbg !126
  %25 = getelementptr inbounds %struct.Stats, %struct.Stats* %24, i32 0, i32 2, !dbg !127
  %26 = load float, float* %25, align 4, !dbg !127
  %27 = load %struct.Stats*, %struct.Stats** %5, align 8, !dbg !128
  %28 = getelementptr inbounds %struct.Stats, %struct.Stats* %27, i32 0, i32 2, !dbg !129
  %29 = load float, float* %28, align 4, !dbg !129
  %30 = fadd float %26, %29, !dbg !130
  store float %30, float* %23, align 4, !dbg !115
  %31 = bitcast { <2 x float>, float }* %6 to i8*, !dbg !131
  %32 = bitcast %struct.Stats* %3 to i8*, !dbg !131
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %31, i8* align 4 %32, i64 12, i1 false), !dbg !131
  %33 = load { <2 x float>, float }, { <2 x float>, float }* %6, align 8, !dbg !131
  ret { <2 x float>, float } %33, !dbg !131
}

rustc (1.56 with -C opt-level=0 --emit=llvm-ir)

Assembly with -O
example::sum:
        movss   xmm0, dword ptr [rdi]
        addss   xmm0, dword ptr [rsi]
        movsd   xmm1, qword ptr [rdi + 4]
        movsd   xmm2, qword ptr [rsi + 4]
        addps   xmm2, xmm1
        movd    eax, xmm0
        movd    ecx, xmm2
        shufps  xmm2, xmm2, 85
        movd    edx, xmm2
        shl     rcx, 32
        or      rax, rcx
        ret
%Stats = type { float, float, float }

define i96 @_ZN7example3sum17h2dbae3ec3349ec5dE(%Stats* align 4 dereferenceable(12) %a, %Stats* align 4 dereferenceable(12) %b) unnamed_addr #0 !dbg !6 {
  %0 = alloca %Stats, align 4
  %1 = bitcast %Stats* %a to float*, !dbg !10
  %_4 = load float, float* %1, align 4, !dbg !10
  %2 = bitcast %Stats* %b to float*, !dbg !11
  %_5 = load float, float* %2, align 4, !dbg !11
  %_3 = fadd float %_4, %_5, !dbg !10
  %3 = getelementptr inbounds %Stats, %Stats* %a, i32 0, i32 1, !dbg !12
  %_7 = load float, float* %3, align 4, !dbg !12
  %4 = getelementptr inbounds %Stats, %Stats* %b, i32 0, i32 1, !dbg !13
  %_8 = load float, float* %4, align 4, !dbg !13
  %_6 = fadd float %_7, %_8, !dbg !12
  %5 = getelementptr inbounds %Stats, %Stats* %a, i32 0, i32 2, !dbg !14
  %_10 = load float, float* %5, align 4, !dbg !14
  %6 = getelementptr inbounds %Stats, %Stats* %b, i32 0, i32 2, !dbg !15
  %_11 = load float, float* %6, align 4, !dbg !15
  %_9 = fadd float %_10, %_11, !dbg !14
  %7 = bitcast %Stats* %0 to float*, !dbg !16
  store float %_3, float* %7, align 4, !dbg !16
  %8 = getelementptr inbounds %Stats, %Stats* %0, i32 0, i32 1, !dbg !16
  store float %_6, float* %8, align 4, !dbg !16
  %9 = getelementptr inbounds %Stats, %Stats* %0, i32 0, i32 2, !dbg !16
  store float %_9, float* %9, align 4, !dbg !16
  %10 = bitcast %Stats* %0 to i96*, !dbg !17
  %11 = load i96, i96* %10, align 4, !dbg !17
  ret i96 %11, !dbg !17
}

rustc (1.47 with -C opt-level=0 --emit=llvm-ir)

Assembly with -O
example::sum:
        movss   xmm0, dword ptr [rsi]
        movss   xmm1, dword ptr [rsi + 4]
        addss   xmm0, dword ptr [rdx]
        addss   xmm1, dword ptr [rdx + 4]
        movss   xmm2, dword ptr [rsi + 8]
        addss   xmm2, dword ptr [rdx + 8]
        mov     rax, rdi
        movss   dword ptr [rdi], xmm0
        movss   dword ptr [rdi + 4], xmm1
        movss   dword ptr [rdi + 8], xmm2
        ret
%Stats = type { [0 x i32], float, [0 x i32], float, [0 x i32], float, [0 x i32] }

define void @_ZN7example3sum17hdebae303c906e224E(%Stats* noalias nocapture sret dereferenceable(12) %0, %Stats* noalias readonly align 4 dereferenceable(12) %a, %Stats* noalias readonly align 4 dereferenceable(12) %b) unnamed_addr #0 !dbg !6 {
  %1 = bitcast %Stats* %a to float*, !dbg !10
  %_4 = load float, float* %1, align 4, !dbg !10
  %2 = bitcast %Stats* %b to float*, !dbg !11
  %_5 = load float, float* %2, align 4, !dbg !11
  %_3 = fadd float %_4, %_5, !dbg !10
  %3 = getelementptr inbounds %Stats, %Stats* %a, i32 0, i32 3, !dbg !12
  %_7 = load float, float* %3, align 4, !dbg !12
  %4 = getelementptr inbounds %Stats, %Stats* %b, i32 0, i32 3, !dbg !13
  %_8 = load float, float* %4, align 4, !dbg !13
  %_6 = fadd float %_7, %_8, !dbg !12
  %5 = getelementptr inbounds %Stats, %Stats* %a, i32 0, i32 5, !dbg !14
  %_10 = load float, float* %5, align 4, !dbg !14
  %6 = getelementptr inbounds %Stats, %Stats* %b, i32 0, i32 5, !dbg !15
  %_11 = load float, float* %6, align 4, !dbg !15
  %_9 = fadd float %_10, %_11, !dbg !14
  %7 = bitcast %Stats* %0 to float*, !dbg !16
  store float %_3, float* %7, align 4, !dbg !16
  %8 = getelementptr inbounds %Stats, %Stats* %0, i32 0, i32 3, !dbg !16
  store float %_6, float* %8, align 4, !dbg !16
  %9 = getelementptr inbounds %Stats, %Stats* %0, i32 0, i32 5, !dbg !16
  store float %_9, float* %9, align 4, !dbg !16
  ret void, !dbg !17
}
Urgau

Urgau commented on Dec 2, 2021

@Urgau
Member

I can confirm that the problem here is the return type i96 in the LLVM IR because if I use an out parameter the assembly is what I would expect (without the vector optimization done by clang) (godbolt):

pub fn sum(a: &Stats, b: &Stats, res: &mut Stats)
{
    res.x = a.x + b.x;
    res.y = a.y + b.y;
    res.z = a.z + b.z;
}
example::sum:
        movss   xmm0, dword ptr [rdi]
        addss   xmm0, dword ptr [rsi]
        movss   dword ptr [rdx], xmm0
        movss   xmm0, dword ptr [rdi + 4]
        addss   xmm0, dword ptr [rsi + 4]
        movss   dword ptr [rdx + 4], xmm0
        movss   xmm0, dword ptr [rdi + 8]
        addss   xmm0, dword ptr [rsi + 8]
        movss   dword ptr [rdx + 8], xmm0
        ret
LLVM IR at -C opt-level = 0
%Stats = type { float, float, float }

define void @_ZN7example3sum17h258bf76ba01aa602E(%Stats* align 4 dereferenceable(12) %a, %Stats* align 4 dereferenceable(12) %b, %Stats* align 4 dereferenceable(12) %res) unnamed_addr #0 !dbg !6 {
  %0 = bitcast %Stats* %a to float*, !dbg !10
  %_4 = load float, float* %0, align 4, !dbg !10
  %1 = bitcast %Stats* %b to float*, !dbg !11
  %_5 = load float, float* %1, align 4, !dbg !11
  %2 = bitcast %Stats* %res to float*, !dbg !12
  %3 = fadd float %_4, %_5, !dbg !12
  store float %3, float* %2, align 4, !dbg !12
  %4 = getelementptr inbounds %Stats, %Stats* %a, i32 0, i32 1, !dbg !13
  %_6 = load float, float* %4, align 4, !dbg !13
  %5 = getelementptr inbounds %Stats, %Stats* %b, i32 0, i32 1, !dbg !14
  %_7 = load float, float* %5, align 4, !dbg !14
  %6 = getelementptr inbounds %Stats, %Stats* %res, i32 0, i32 1, !dbg !15
  %7 = fadd float %_6, %_7, !dbg !15
  store float %7, float* %6, align 4, !dbg !15
  %8 = getelementptr inbounds %Stats, %Stats* %a, i32 0, i32 2, !dbg !16
  %_8 = load float, float* %8, align 4, !dbg !16
  %9 = getelementptr inbounds %Stats, %Stats* %b, i32 0, i32 2, !dbg !17
  %_9 = load float, float* %9, align 4, !dbg !17
  %10 = getelementptr inbounds %Stats, %Stats* %res, i32 0, i32 2, !dbg !18
  %11 = fadd float %_8, %_9, !dbg !18
  store float %11, float* %10, align 4, !dbg !18
  ret void, !dbg !19
}
Yuri6037

Yuri6037 commented on Dec 2, 2021

@Yuri6037
Author

The question then becomes why is rustc forcing an i96 as return type?

added
A-codegenArea: Code generation
A-floating-pointArea: Floating point numbers and arithmetic
A-LLVMArea: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.
C-bugCategory: This is a bug.
I-slowIssue: Problems and improvements with respect to performance of generated code.
T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.
on Dec 2, 2021
MSxDOS

MSxDOS commented on Dec 3, 2021

@MSxDOS

This seems to be the same regression as #85265

17 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-LLVMArea: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.A-codegenArea: Code generationA-floating-pointArea: Floating point numbers and arithmeticC-bugCategory: This is a bug.I-slowIssue: Problems and improvements with respect to performance of generated code.T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Participants

      @Urgau@Yuri6037@MSxDOS@scottmcm@inquisitivecrystal

      Issue actions

        Possible performance loss with f32 arithmetic · Issue #91447 · rust-lang/rust