Skip to content

Performance regression: rustc failed to optimize specific x86-64 SIMD intrinsics after 1.75.0 #124216

Open
@Nugine

Description

@Nugine
Contributor

Code

I tried this code:

https://rust.godbolt.org/z/KG4cT6aPK

use std::arch::x86_64::*;

#[target_feature(enable = "avx2")]
pub unsafe fn decode(
    x: __m256i,
    ch: __m256i,
    ct: __m256i,
    dh: __m256i,
    dt: __m256i,
) -> Result<__m256i, ()> {
    let shr3 = _mm256_srli_epi32::<3>(x);

    let h1 = _mm256_avg_epu8(shr3, _mm256_shuffle_epi8(ch, x));
    let h2 = _mm256_avg_epu8(shr3, _mm256_shuffle_epi8(dh, x));

    let o1 = _mm256_shuffle_epi8(ct, h1);
    let o2 = _mm256_shuffle_epi8(dt, h2);

    let c1 = _mm256_adds_epi8(x, o1);
    let c2 = _mm256_add_epi8(x, o2);

    if _mm256_movemask_epi8(c1) != 0 {
        return Err(());
    }

    Ok(c2)
}

I expected to see this happen: This code should emit two vpavgb instructions.

Instead, this happened: One of the vpavgb instructions is missing.

Nugine/simd#43

Version it worked on

It most recently worked on: 1.74.1

Version with regression

1.75.0 ~ nightly

rustc 1.79.0-nightly (dbce3b43b 2024-04-20)
binary: rustc
commit-hash: dbce3b43b6cb34dd3ba12c3ec6f708fe68e9c3df
commit-date: 2024-04-20
host: x86_64-unknown-linux-gnu
release: 1.79.0-nightly
LLVM version: 18.1.4

@rustbot modify labels: +regression-from-stable-to-stable -regression-untriaged

Activity

added
C-bugCategory: This is a bug.
regression-untriagedUntriaged performance or correctness regression.
on Apr 21, 2024
added
I-prioritizeIssue: Indicates that prioritization has been requested for this issue.
needs-triageThis issue may need triage. Remove it if it has been sufficiently triaged.
regression-from-stable-to-stablePerformance or correctness regression from one stable version to another.
and removed
regression-untriagedUntriaged performance or correctness regression.
on Apr 21, 2024
added
A-SIMDArea: SIMD (Single Instruction Multiple Data)
T-libsRelevant to the library team, which will review and decide on the PR/issue.
on Apr 21, 2024
saethlin

saethlin commented on Apr 21, 2024

@saethlin
Member

Blaming rust-lang/stdarch#1477

Did you confirm that this is the responsible change or are you guessing?

added
E-needs-bisectionCall for participation: This issue needs bisection: https://github.com/rust-lang/cargo-bisect-rustc
I-heavyIssue: Problems and improvements with respect to binary size of generated code.
on Apr 21, 2024
workingjubilee

workingjubilee commented on Apr 21, 2024

@workingjubilee
Member

@Nugine This is definitely more instructions and more bytes on each, so I'm marking it with I-heavy, but it appears this comes with a performance regression. Can you be precise about which of the ~19 benchmarks you appear to run have regressed, and on what architecture?

I would rather we not make the 2nd vpavgb instruction come back only for your algorithm to still be dog-slow because some of the other instructions are different.

Also, can you be more precise on what architectures and with what target features you're testing on? GitHub is allowed to change the CPU you run benchmarks on, and does, because their fleet is not perfectly uniform, so -Ctarget-cpu=native makes it more likely your benchmarks can be run-to-run and job-to-job inconsistent.

Nugine

Nugine commented on Apr 21, 2024

@Nugine
ContributorAuthor

Base64-decode in base64-simd has been slower than radix64 since Rust 1.75.0. By comparing the asm generated by 1.74.1 and 1.75.0, I found that one of vpavgb is missing. LLVM doesn't emit vpavgb for one of _mm256_avg_epu8, but a lot of equivalent instructions.

rust-lang/stdarch#1477 made the change. However, the root cause may be elsewhere, possibly LLVM.

To see the asm, you can use the following commands.

git clone https://github.com/Nugine/simd.git
cd simd
rustup override set 1.74.1 # or 1.75.0
RUSTFLAGS="--cfg vsimd_dump_symbols" cargo asm -p base64-simd --lib --simplify --target x86_64-unknown-linux-gnu  --context 1 -- base64_simd::multiversion::decode::avx2 > base64-decode-avx2.asm
cat base64-decode-avx2.asm

Target: x86_64-unknown-linux-gnu
Instruction: AVX2

I have extracted the decode function and reproduced the regression. https://rust.godbolt.org/z/KG4cT6aPK
I'm looking for:

  • a stable workaround method to generate vpavgb
  • why the optimization is missing
workingjubilee

workingjubilee commented on Apr 21, 2024

@workingjubilee
Member

@Nugine re: the workaround: On current Rust, stable, the decode_asm function here recovers exactly equivalent output to what you had before: https://rust.godbolt.org/z/fGEaYME1h

21 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-LLVMArea: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.A-SIMDArea: SIMD (Single Instruction Multiple Data)C-bugCategory: This is a bug.I-heavyIssue: Problems and improvements with respect to binary size of generated code.I-slowIssue: Problems and improvements with respect to performance of generated code.P-mediumMedium priorityT-libsRelevant to the library team, which will review and decide on the PR/issue.regression-from-stable-to-stablePerformance or correctness regression from one stable version to another.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @jhorstmann@eduardosm@Deniskore@apiraino@moxian

        Issue actions

          Performance regression: rustc failed to optimize specific x86-64 SIMD intrinsics after 1.75.0 · Issue #124216 · rust-lang/rust