Closed
Description
The following code compiled with -O3 -march=znver4
(or any other znver
) runs around 25% slower on Zen hardware than when compiled with -O3 -march=x86-64-v4
or the baseline x86-64
.
bool check_prime(int64_t n) {
if (n < 2) {
return true;
}
int64_t lim = (int64_t)ceil((double)n / 2.0);
for (int64_t i = 2; i < lim; i++) {
if (n % i == 0) {
return false;
}
}
return true;
}
Full code
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <math.h>
#include <time.h>
bool check_prime(int64_t n) {
if (n < 2) {
return true;
}
int64_t lim = (int64_t)ceil((double)n / 2.0);
for (int64_t i = 2; i < lim; i++) {
if (n % i == 0) {
return false;
}
}
return true;
}
int main() {
clock_t now = clock();
int sum = 0;
for (int i = 0; i < 1000000; i++) {
if (check_prime(i)) {
sum += 1;
}
}
printf("%f, %d\n", (double)(clock() - now) / CLOCKS_PER_SEC, sum);
return 0;
}
Running on a Ryzen 7950X:
> clang.exe -std=c11 -O3 -march=znver4 ./src/perf.c && ./a.exe
24.225000 seconds, 78501
> clang.exe -std=c11 -O3 -march=x86-64-v4 ./src/perf.c && ./a.exe
20.866000 seconds, 78501
> clang.exe -std=c11 -O3 ./src/perf.c && ./a.exe
20.819000 seconds, 78501
> clang.exe --version
clang version 18.1.4
Target: x86_64-pc-windows-msvc
Thread model: posix
InstalledDir: C:\Program Files\LLVM\bin
Disassembly here: https://godbolt.org/z/orssnKP74
I originally noticed the issue with Rust: https://godbolt.org/z/Kh1v3G74K
Activity
RKSimon commentedon May 3, 2024
unrolling seems to have gone out of control - most likely due to the insane LoopMicroOpBufferSize value znver3/4 scheduler model uses
ganeshgit commentedon May 3, 2024
@RKSimon It's a conscious decision to have some value for LoopMicroOpBufferSize. The value that we use is not really representing the actual buffer size that this parameter intends. I would prefer to remove the dependency on this parameter altogether rather than having incorrect values. Let me know your opinion.
Systemcluster commentedon May 3, 2024
The result is the same with
znver1
, I don't seeLoopMicroOpBufferSize
being set in in https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86ScheduleZnver1.tdThe disassembly looks to be the same as well regardless which
znver
is targeted.llvmbot commentedon May 4, 2024
@llvm/issue-subscribers-backend-x86
Author: Chris (Systemcluster)
<details>
<summary>Full code</summary>
</details>
Running on a Ryzen 7950X:
> clang.exe --version clang version 18.1.4 Target: x86_64-pc-windows-msvc Thread model: posix InstalledDir: C:\Program Files\LLVM\bin
Disassembly here: https://godbolt.org/z/orssnKP74
I originally noticed the issue with Rust: https://godbolt.org/z/Kh1v3G74K
nikic commentedon May 5, 2024
Related patch: #67657
RKSimon commentedon May 5, 2024
OK, got an idea on whats going on now. This is a combo of things - as well as the LoopMicroOpBufferSize issue making this a whole lot messier, zen cpus don't include the TuningSlowDivide64 flag (meaning there's no attempt to check if the i64 div args can be represented with i32) - the 25% regression on znver4 makes sense as the r32 vs r64 latency is 14 vs 19cy on znver3/4 according to uops.info.
I'll create PRs for this shortly.
[X86] Add slow div64/lea3 tuning flags to Nehalem target
[X86] Add slow div643 tuning flags to Nehalem target
Systemcluster commentedon May 6, 2024
There's no noticeable runtime difference between optimization targets when using
i32
instead ofi64
in thecheck_prime
example, it seems that indeed accounts for the majority of the regression there.I found another example where optimizing for
znver4
runs over 300% slower on Zen 4 than when optimizing forznver3
, I assume there it's mainly caused by the aggressive unrolling? https://godbolt.org/z/zdMrP6aG7RKSimon commentedon May 6, 2024
That second case might be due to excessive gather instructions on znver4 codegen
12 remaining items