Closed
Description
Using iter().skip() functions leads to poor optimization compared to the manually done loop with range.
https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=b7ed8bf9e4fc3341a92f301fa5185cc5
pub fn test_1(a: [i32; 10]) -> i32 {
let mut sum = 0;
for v in a.iter().skip(8) {
sum += v;
}
sum
}
pub fn test_2(a: [i32; 10]) -> i32 {
let mut sum = 0;
for index in 8..10 {
sum += a[index];
}
sum
}
This produces the following asm output:
playground::test_1:
movq %rdi, %r8
addq $40, %r8
xorl %esi, %esi
movl $8, %edx
xorl %eax, %eax
testb $1, %sil
jne .LBB0_2
.LBB0_5:
leaq -1(%rdx), %rsi
movq %r8, %rcx
subq %rdi, %rcx
shrq $2, %rcx
cmpq %rsi, %rcx
jbe .LBB0_4
leaq (%rdi,%rdx,4), %rdi
.LBB0_2:
cmpq %r8, %rdi
je .LBB0_4
testq %rdi, %rdi
je .LBB0_4
addl (%rdi), %eax
addq $4, %rdi
movb $1, %sil
xorl %edx, %edx
testb $1, %sil
je .LBB0_5
jmp .LBB0_2
.LBB0_4:
retq
playground::test_2:
movl 36(%rdi), %eax
addl 32(%rdi), %eax
retq
Considering the zero-cost abstraction rule and the fact that the compiler knows the size of the array, it should optimize test_1 to at least the same form as test_2 where it correctly detected that we only need two values summed. Instead, there's quite a chunk of asm with lots of branches.
The issue is present both in the stable version (1.63.0) and nightly/beta channels.
Metadata
Metadata
Assignees
Labels
Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.Category: This is a bug.Call for participation: An issue has been fixed and does not reproduce, but no test has been added.Issue: Problems and improvements with respect to performance of generated code.Relevant to the compiler team, which will review and decide on the PR/issue.
Activity
MatiF100 commentedon Sep 14, 2022
Rewriting the function the following way produces the same assembly as the better optimized variant. Seems like the issue happens when using both iterators and for loop at the same time.
https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=bcb4551c22ef765a682c1d6c41eb285f
nikic commentedon Sep 14, 2022
This general class of problem is well known -- optimization of exterior iteration in Rust is very challenging. Using interior iteration (as in the previous comment) will generally optimize much better.
That said, in this case optimization is likely feasible. Looking at the IR (https://rust.godbolt.org/z/cevdKWcTn) there is a clear opportunity for peeling based on phi invariance here, which should allow follow-on optimization. Would have to investigate closer to find out why it does not trigger.
Tearth commentedon Sep 14, 2022
Thanks, I wasn't aware that the compiler can have this kind of trouble with exterior iterations, but it's understandable - I will leave this issue open if you're saying that this case has the potential to improve.
Remove #[inline(always)] from CompressorOxide::default()
Remove #[inline(always)] from CompressorOxide::default() (#125)
nikic commentedon Sep 27, 2022
I took a closer look, and the reason why this doesn't peel are multiple checks in canPeel(): https://github.com/llvm/llvm-project/blob/2769ceb0e7a4b4f11c2bf5bd21fd69c154c17ff8/llvm/lib/Transforms/Utils/LoopPeel.cpp#L88 We have a non-exiting latch here, and because of that the non-latch exits are also not terminated by unreachable. It should be possible to relax these requirements, but would need some effort to support branch weight updates.
nikic commentedon Sep 28, 2022
Upstream patch: https://reviews.llvm.org/D134803
iter.for_each()
instead offor x in iter
nnethercote/perf-book#51nikic commentedon Apr 3, 2023
Fixed by the LLVM 16 upgrade.
Add codegen tests for issues fixed by LLVM 16
Rollup merge of rust-lang#109895 - nikic:llvm-16-tests, r=cuviper
.next()
vs internal.try_fold()
iteration nnethercote/perf-book#70Some rewrites, prefer using iters over .collect::<Vec<_>>