Skip to content

Result that uses niches resulting in a final size of 16 bytes emits poor LLVM IR due to ABI #97540

Closed
@asquared31415

Description

@asquared31415
Contributor

The following code:

pub fn bad() -> Result<u64, core::ptr::NonNull<()>> {
    Ok(0)
}

#[allow(improper_ctypes_definitions)]
pub extern "C" fn good() -> Result<u64, core::ptr::NonNull<()>> {
    Ok(0)
}

godbolt LLVM IR
godbolt ASM

Both functions should be identical, returning the 128-bit value in RAX and RDX per the sysv-64 ABI, which is the default for the environment godbolt uses by default.

Instead, when returning the value without specifying any ABI, the compiler chooses to return the value using an out pointer argument.

Note: This affects Result<usize, std::io::Error>, which is a particularly useful type and affects a significant portion of std::io, such as Read and Write

Meta

This has improved and regressed since 1.43.0, the first version I could find which would use 128 bits for the type.
All versions tested used the expected code generation for the extern "C" function.
All tests done using https://godbolt.org and no target overrides, which in practice is a x86_64 linux machine using the sysv-64 ABI.

..= 1.42.0: niche not used, cannot be compared
1.43.0 ..= 1.47.0: codegen like today, uses out ptr for return
1.48.0 ..= 1.60.0: returns using an LLVM i128, which produces the most desirable code on sysv-64
1.61.0 .. (current 2022-05-28 nightly): poor codegen that uses an out ptr for return

@rustbot label +regression-from-stable-to-stable +A-codegen

Activity

added
A-codegenArea: Code generation
regression-from-stable-to-stablePerformance or correctness regression from one stable version to another.
I-prioritizeIssue: Indicates that prioritization has been requested for this issue.
on May 30, 2022
added
I-slowIssue: Problems and improvements with respect to performance of generated code.
on May 30, 2022
thomcc

thomcc commented on May 30, 2022

@thomcc
Member

To be clear, it's debatable if this is a regression (if it is, it's just a performance issue), since we make no guarantee about the ABI of extern "Rust".

That said, this likely applies to all 128 bit Result types, so could have a non-negligible impact on overall performance.

erikdesjardins

erikdesjardins commented on May 30, 2022

@erikdesjardins
Contributor

This is #26494 / #77434, reverted by #85265 / #94570

asquared31415

asquared31415 commented on May 30, 2022

@asquared31415
ContributorAuthor

This is interesting, because tuples or structs seem to have some exceptions for two usize-sized elements specifically. (u64, u64) and (u64, *mut ()) return by value, while (u32, u32, u32, u32), (u64, u32), and [u32; 4] (and all arrays) return with a pointer. Additionally structs containing two usize values exactly are returned by value (both with and without #[repr(C)].

The Result niche usage is specifically inhibiting this optimization, since otherwise the data returned would look like a (u64, *mut ()) in the Result<u64, core::ptr::NonNull<()>> or std::io::Result<usize> cases.

nbdd0121

nbdd0121 commented on May 30, 2022

@nbdd0121
Contributor

To be clear, it's debatable if this is a regression (if it is, it's just a performance issue), since we make no guarantee about the ABI of extern "Rust".

Performance regression is a regression. And I think this is a very bad one.

If I replace the Ok(0) with Ok(1) then it even starts to loading from memory.

I don't think this is related to niche though; Result<u64, core::num::NonZeroUsize> still have the good old codegen. Somehow Result<u64, core::ptr::NonNull<()>> ceased to be passed as a scalar pair, while Result<u64, core::num::NonZeroUsize> still does.

asquared31415

asquared31415 commented on May 30, 2022

@asquared31415
ContributorAuthor

Ah, NonZeroUsize or NonZeroU64 don't reproduce it, but NonZeroU{32, 16, 8} do return via an out ptr.

nbdd0121

nbdd0121 commented on May 30, 2022

@nbdd0121
Contributor

Okay, I located the reason. Abi::ScalarPair is only produced if all variant' data part are scalar and have the same abi::Primitive. NonZeroUsize and NonZeroU64 are all Primitive::Int(Integer::I64, false) which matches that of u64, so it's treated as a scalar pair. However NonZeroU{32,16,8} have different integer size, and NonNull have Primitive::Pointer which never matches with any integer, so ScalarPair is not produced and it's treated as Aggregate.

Before #94570, something this small is passed in registers, so it is optimized to be the same as if it's passed as scalar pair, but it's not, it's actually just an small aggregated passed directly. #94570 forces it to be an indirect passing, causing the perf regression.

Given the motivation of #94570 is to workaround LLVM inability to autovectorize things passed in register, I think a fix would to be use some heurstics to determine if a 2-usize type can be a target of autovectorization -- if so, pass indirectly, otherwise, pass by value.

Urgau

Urgau commented on May 30, 2022

@Urgau
Member

@nbdd0121 Look at #93564 which was a more targeted fix. Maybe it could be reopened ?

EDIT: Looking into it right now.

nbdd0121

nbdd0121 commented on May 30, 2022

@nbdd0121
Contributor

Thanks for the pointer. That PR looks like a better approach to me. I think the compile-time perf regression in that PR results from excessive homogeneous_aggregate calls when the type is smaller or equal to pointer size (which includes almost all primitives!). I suspect that if you change the nesting of ifs, so that it's only called on types of size between 1usize and 2usize, then the perf regression should be mitigated.

Urgau

Urgau commented on May 30, 2022

@Urgau
Member

@nbdd0121 I've reopened #93564 as #97559. I've confirmed that with the PR the regression is fixed and hopefully the perf regression of the PR is also fixed.

34 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-ABIArea: Concerning the application binary interface (ABI)A-codegenArea: Code generationC-bugCategory: This is a bug.I-slowIssue: Problems and improvements with respect to performance of generated code.P-mediumMedium priorityT-compilerRelevant to the compiler team, which will review and decide on the PR/issue.regression-from-stable-to-stablePerformance or correctness regression from one stable version to another.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Participants

      @nikic@riking@thomcc@Urgau@nbdd0121

      Issue actions

        `Result` that uses niches resulting in a final size of 16 bytes emits poor LLVM IR due to ABI · Issue #97540 · rust-lang/rust