Skip to content

[X86] Worse runtime performance on Zen CPU when optimizing for Zen #90985

Closed
@Systemcluster

Description

@Systemcluster

The following code compiled with -O3 -march=znver4 (or any other znver) runs around 25% slower on Zen hardware than when compiled with -O3 -march=x86-64-v4 or the baseline x86-64.

bool check_prime(int64_t n) {
    if (n < 2) {
        return true;
    }
    int64_t lim = (int64_t)ceil((double)n / 2.0);
    for (int64_t i = 2; i < lim; i++) {
        if (n % i == 0) {
            return false;
        }
    }
    return true;
}
Full code
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <math.h>
#include <time.h>

bool check_prime(int64_t n) {
    if (n < 2) {
        return true;
    }
    int64_t lim = (int64_t)ceil((double)n / 2.0);
    for (int64_t i = 2; i < lim; i++) {
        if (n % i == 0) {
            return false;
        }
    }
    return true;
}

int main() {
    clock_t now = clock();
    int sum = 0;
    for (int i = 0; i < 1000000; i++) {
        if (check_prime(i)) {
            sum += 1;
        }
    }
    printf("%f, %d\n", (double)(clock() - now) / CLOCKS_PER_SEC, sum);
    return 0;
}

Running on a Ryzen 7950X:

> clang.exe -std=c11 -O3 -march=znver4 ./src/perf.c && ./a.exe
24.225000 seconds, 78501

> clang.exe -std=c11 -O3 -march=x86-64-v4 ./src/perf.c && ./a.exe
20.866000 seconds, 78501

> clang.exe -std=c11 -O3 ./src/perf.c && ./a.exe                  
20.819000 seconds, 78501
> clang.exe --version
clang version 18.1.4
Target: x86_64-pc-windows-msvc
Thread model: posix
InstalledDir: C:\Program Files\LLVM\bin

Disassembly here: https://godbolt.org/z/orssnKP74

I originally noticed the issue with Rust: https://godbolt.org/z/Kh1v3G74K

Activity

RKSimon

RKSimon commented on May 3, 2024

@RKSimon
Collaborator

unrolling seems to have gone out of control - most likely due to the insane LoopMicroOpBufferSize value znver3/4 scheduler model uses

ganeshgit

ganeshgit commented on May 3, 2024

@ganeshgit
Contributor

unrolling seems to have gone out of control - most likely due to the insane LoopMicroOpBufferSize value znver3/4 scheduler model uses

@RKSimon It's a conscious decision to have some value for LoopMicroOpBufferSize. The value that we use is not really representing the actual buffer size that this parameter intends. I would prefer to remove the dependency on this parameter altogether rather than having incorrect values. Let me know your opinion.

Systemcluster

Systemcluster commented on May 3, 2024

@Systemcluster
Author

The result is the same with znver1, I don't see LoopMicroOpBufferSize being set in in https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86ScheduleZnver1.td

> clang.exe -std=c11 -O3 -march=znver1 ./src/perf.c && ./a.exe
24.384000 seconds, 78501

The disassembly looks to be the same as well regardless which znver is targeted.

llvmbot

llvmbot commented on May 4, 2024

@llvmbot
Member

@llvm/issue-subscribers-backend-x86

Author: Chris (Systemcluster)

The following code compiled with `-O3 -march=znver4` (or any other `znver`) runs around 25% slower on Zen hardware than when compiled with `-O3 -march=x86-64-v4` or the baseline `x86-64`.
bool check_prime(int64_t n) {
    if (n &lt; 2) {
        return true;
    }
    int64_t lim = (int64_t)ceil((double)n / 2.0);
    for (int64_t i = 2; i &lt; lim; i++) {
        if (n % i == 0) {
            return false;
        }
    }
    return true;
}

<details>
<summary>Full code</summary>

#include &lt;stdbool.h&gt;
#include &lt;stdint.h&gt;
#include &lt;stdio.h&gt;
#include &lt;math.h&gt;
#include &lt;time.h&gt;

bool check_prime(int64_t n) {
    if (n &lt; 2) {
        return true;
    }
    int64_t lim = (int64_t)ceil((double)n / 2.0);
    for (int64_t i = 2; i &lt; lim; i++) {
        if (n % i == 0) {
            return false;
        }
    }
    return true;
}

int main() {
    clock_t now = clock();
    int sum = 0;
    for (int i = 0; i &lt; 1000000; i++) {
        if (check_prime(i)) {
            sum += 1;
        }
    }
    printf("%f, %d\n", (double)(clock() - now) / CLOCKS_PER_SEC, sum);
    return 0;
}

</details>

Running on a Ryzen 7950X:

&gt; clang.exe -std=c11 -O3 -march=znver4 ./src/perf.c &amp;&amp; ./a.exe
24.225000 seconds, 78501

&gt; clang.exe -std=c11 -O3 -march=x86-64-v4 ./src/perf.c &amp;&amp; ./a.exe
20.866000 seconds, 78501

&gt; clang.exe -std=c11 -O3 ./src/perf.c &amp;&amp; ./a.exe                  
20.819000 seconds, 78501
&gt; clang.exe --version
clang version 18.1.4
Target: x86_64-pc-windows-msvc
Thread model: posix
InstalledDir: C:\Program Files\LLVM\bin

Disassembly here: https://godbolt.org/z/orssnKP74

I originally noticed the issue with Rust: https://godbolt.org/z/Kh1v3G74K

self-assigned this
on May 4, 2024
nikic

nikic commented on May 5, 2024

@nikic
Contributor

Related patch: #67657

RKSimon

RKSimon commented on May 5, 2024

@RKSimon
Collaborator

OK, got an idea on whats going on now. This is a combo of things - as well as the LoopMicroOpBufferSize issue making this a whole lot messier, zen cpus don't include the TuningSlowDivide64 flag (meaning there's no attempt to check if the i64 div args can be represented with i32) - the 25% regression on znver4 makes sense as the r32 vs r64 latency is 14 vs 19cy on znver3/4 according to uops.info.

I'll create PRs for this shortly.

Systemcluster

Systemcluster commented on May 6, 2024

@Systemcluster
Author

There's no noticeable runtime difference between optimization targets when using i32 instead of i64 in the check_prime example, it seems that indeed accounts for the majority of the regression there.

I found another example where optimizing for znver4 runs over 300% slower on Zen 4 than when optimizing for znver3, I assume there it's mainly caused by the aggressive unrolling? https://godbolt.org/z/zdMrP6aG7

RKSimon

RKSimon commented on May 6, 2024

@RKSimon
Collaborator

That second case might be due to excessive gather instructions on znver4 codegen

12 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Participants

    @nikic@RKSimon@Systemcluster@ganeshgit@EugeneZelenko

    Issue actions

      [X86] Worse runtime performance on Zen CPU when optimizing for Zen · Issue #90985 · llvm/llvm-project