Description
The following code:
pub fn bad() -> Result<u64, core::ptr::NonNull<()>> {
Ok(0)
}
#[allow(improper_ctypes_definitions)]
pub extern "C" fn good() -> Result<u64, core::ptr::NonNull<()>> {
Ok(0)
}
Both functions should be identical, returning the 128-bit value in RAX and RDX per the sysv-64 ABI, which is the default for the environment godbolt uses by default.
Instead, when returning the value without specifying any ABI, the compiler chooses to return the value using an out pointer argument.
Note: This affects Result<usize, std::io::Error>
, which is a particularly useful type and affects a significant portion of std::io
, such as Read
and Write
Meta
This has improved and regressed since 1.43.0, the first version I could find which would use 128 bits for the type.
All versions tested used the expected code generation for the extern "C"
function.
All tests done using https://godbolt.org and no target overrides, which in practice is a x86_64 linux machine using the sysv-64 ABI.
..= 1.42.0: niche not used, cannot be compared
1.43.0 ..= 1.47.0: codegen like today, uses out ptr for return
1.48.0 ..= 1.60.0: returns using an LLVM i128
, which produces the most desirable code on sysv-64
1.61.0 .. (current 2022-05-28 nightly): poor codegen that uses an out ptr for return
@rustbot label +regression-from-stable-to-stable +A-codegen
Activity
thomcc commentedon May 30, 2022
To be clear, it's debatable if this is a regression (if it is, it's just a performance issue), since we make no guarantee about the ABI of
extern "Rust"
.That said, this likely applies to all 128 bit
Result
types, so could have a non-negligible impact on overall performance.erikdesjardins commentedon May 30, 2022
This is #26494 / #77434, reverted by #85265 / #94570
asquared31415 commentedon May 30, 2022
This is interesting, because tuples or structs seem to have some exceptions for two usize-sized elements specifically.
(u64, u64)
and(u64, *mut ())
return by value, while(u32, u32, u32, u32)
,(u64, u32)
, and[u32; 4]
(and all arrays) return with a pointer. Additionally structs containing two usize values exactly are returned by value (both with and without#[repr(C)]
.The
Result
niche usage is specifically inhibiting this optimization, since otherwise the data returned would look like a(u64, *mut ())
in theResult<u64, core::ptr::NonNull<()>>
orstd::io::Result<usize>
cases.nbdd0121 commentedon May 30, 2022
Performance regression is a regression. And I think this is a very bad one.
If I replace the
Ok(0)
withOk(1)
then it even starts to loading from memory.I don't think this is related to niche though;
Result<u64, core::num::NonZeroUsize>
still have the good old codegen. SomehowResult<u64, core::ptr::NonNull<()>>
ceased to be passed as a scalar pair, whileResult<u64, core::num::NonZeroUsize>
still does.asquared31415 commentedon May 30, 2022
Ah,
NonZeroUsize
orNonZeroU64
don't reproduce it, butNonZeroU{32, 16, 8}
do return via an out ptr.nbdd0121 commentedon May 30, 2022
Okay, I located the reason.
Abi::ScalarPair
is only produced if all variant' data part are scalar and have the sameabi::Primitive
.NonZeroUsize
andNonZeroU64
are allPrimitive::Int(Integer::I64, false)
which matches that ofu64
, so it's treated as a scalar pair. HoweverNonZeroU{32,16,8}
have different integer size, andNonNull
havePrimitive::Pointer
which never matches with any integer, soScalarPair
is not produced and it's treated asAggregate
.Before #94570, something this small is passed in registers, so it is optimized to be the same as if it's passed as scalar pair, but it's not, it's actually just an small aggregated passed directly. #94570 forces it to be an indirect passing, causing the perf regression.
Given the motivation of #94570 is to workaround LLVM inability to autovectorize things passed in register, I think a fix would to be use some heurstics to determine if a 2-usize type can be a target of autovectorization -- if so, pass indirectly, otherwise, pass by value.
Urgau commentedon May 30, 2022
@nbdd0121 Look at #93564 which was a more targeted fix. Maybe it could be reopened ?
EDIT: Looking into it right now.
nbdd0121 commentedon May 30, 2022
Thanks for the pointer. That PR looks like a better approach to me. I think the compile-time perf regression in that PR results from excessive
homogeneous_aggregate
calls when the type is smaller or equal to pointer size (which includes almost all primitives!). I suspect that if you change the nesting of ifs, so that it's only called on types of size between 1usize and 2usize, then the perf regression should be mitigated.Urgau commentedon May 30, 2022
@nbdd0121 I've reopened #93564 as #97559. I've confirmed that with the PR the regression is fixed and hopefully the perf regression of the PR is also fixed.
34 remaining items