Open
Description
Code
I tried this code:
https://rust.godbolt.org/z/KG4cT6aPK
use std::arch::x86_64::*;
#[target_feature(enable = "avx2")]
pub unsafe fn decode(
x: __m256i,
ch: __m256i,
ct: __m256i,
dh: __m256i,
dt: __m256i,
) -> Result<__m256i, ()> {
let shr3 = _mm256_srli_epi32::<3>(x);
let h1 = _mm256_avg_epu8(shr3, _mm256_shuffle_epi8(ch, x));
let h2 = _mm256_avg_epu8(shr3, _mm256_shuffle_epi8(dh, x));
let o1 = _mm256_shuffle_epi8(ct, h1);
let o2 = _mm256_shuffle_epi8(dt, h2);
let c1 = _mm256_adds_epi8(x, o1);
let c2 = _mm256_add_epi8(x, o2);
if _mm256_movemask_epi8(c1) != 0 {
return Err(());
}
Ok(c2)
}
I expected to see this happen: This code should emit two vpavgb
instructions.
Instead, this happened: One of the vpavgb
instructions is missing.
Version it worked on
It most recently worked on: 1.74.1
Version with regression
1.75.0 ~ nightly
rustc 1.79.0-nightly (dbce3b43b 2024-04-20)
binary: rustc
commit-hash: dbce3b43b6cb34dd3ba12c3ec6f708fe68e9c3df
commit-date: 2024-04-20
host: x86_64-unknown-linux-gnu
release: 1.79.0-nightly
LLVM version: 18.1.4
@rustbot modify labels: +regression-from-stable-to-stable -regression-untriaged
Metadata
Metadata
Assignees
Labels
Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.Area: SIMD (Single Instruction Multiple Data)Category: This is a bug.Issue: Problems and improvements with respect to binary size of generated code.Issue: Problems and improvements with respect to performance of generated code.Medium priorityRelevant to the library team, which will review and decide on the PR/issue.Performance or correctness regression from one stable version to another.
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
saethlin commentedon Apr 21, 2024
Did you confirm that this is the responsible change or are you guessing?
workingjubilee commentedon Apr 21, 2024
@Nugine This is definitely more instructions and more bytes on each, so I'm marking it with I-heavy, but it appears this comes with a performance regression. Can you be precise about which of the ~19 benchmarks you appear to run have regressed, and on what architecture?
I would rather we not make the 2nd vpavgb instruction come back only for your algorithm to still be dog-slow because some of the other instructions are different.
Also, can you be more precise on what architectures and with what target features you're testing on? GitHub is allowed to change the CPU you run benchmarks on, and does, because their fleet is not perfectly uniform, so
-Ctarget-cpu=native
makes it more likely your benchmarks can be run-to-run and job-to-job inconsistent.Nugine commentedon Apr 21, 2024
Base64-decode in
base64-simd
has been slower thanradix64
since Rust 1.75.0. By comparing the asm generated by 1.74.1 and 1.75.0, I found that one ofvpavgb
is missing. LLVM doesn't emitvpavgb
for one of_mm256_avg_epu8
, but a lot of equivalent instructions.rust-lang/stdarch#1477 made the change. However, the root cause may be elsewhere, possibly LLVM.
To see the asm, you can use the following commands.
Target: x86_64-unknown-linux-gnu
Instruction: AVX2
I have extracted the decode function and reproduced the regression. https://rust.godbolt.org/z/KG4cT6aPK
I'm looking for:
vpavgb
workingjubilee commentedon Apr 21, 2024
@Nugine re: the workaround: On current Rust, stable, the
decode_asm
function here recovers exactly equivalent output to what you had before: https://rust.godbolt.org/z/fGEaYME1h21 remaining items