Poor unrolling prevents vectorization opportunities

|  |  |
| --- | --- |
| Bugzilla Link | [51108](https://llvm.org/bz51108) |
| Version | trunk |
| OS | Windows NT |
| CC | @adibiagio,@fhahn,@LebedevRI,@MattPD,@OCHyams,@rotateright |

## Extended Description 
https://godbolt.org/z/jE1e9rT5j
(NOTE: disabled fma on gcc to prevent fmul+fadd->fma diff)
```
constexpr int SIZE = 128;
float A[SIZE][16];
float B[SIZE][16];

__attribute__((__noinline__))
float foo()
{
    float sum = 0.0f;
    for (int i = 1; i < 32; ++i)
        for (int j = 0; j < 4; ++j)
            sum += A[i][j] * B[i][j];
    
    return sum;
}
```
clang -g0 -O3 -march=znver2
```
_Z3foov:
        vxorps  %xmm0, %xmm0, %xmm0
        movq    $-1984, %rax                    # imm = 0xF840
.LBB0_1:
        vmovss  A+2048(%rax), %xmm1             # xmm1 = mem[0],zero,zero,zero
        vmovsd  B+2052(%rax), %xmm2             # xmm2 = mem[0],zero
        vmulss  B+2048(%rax), %xmm1, %xmm1
        vaddss  %xmm1, %xmm0, %xmm0
        vmovsd  A+2052(%rax), %xmm1             # xmm1 = mem[0],zero
        vmulps  %xmm2, %xmm1, %xmm1
        vaddss  %xmm1, %xmm0, %xmm0
        vmovshdup       %xmm1, %xmm1            # xmm1 = xmm1[1,1,3,3]
        vaddss  %xmm1, %xmm0, %xmm0
        vmovss  A+2060(%rax), %xmm1             # xmm1 = mem[0],zero,zero,zero
        vmulss  B+2060(%rax), %xmm1, %xmm1
        addq    $64, %rax
        vaddss  %xmm1, %xmm0, %xmm0
        jne     .LBB0_1
        retq
```
The clang code has several issues:

1 - if we'd used a better indvar we could have avoided some very large offsets on the address math (put A and B in registers and use a better range/increment for %rax).

2 - GCC recognises that the array is fully dereferencable allowing it to use fewer (vector) loads and then extract/shuffle the elements that it requires

3 - we fail to ensure the per-loop reduction is in a form that we can use HADDPS (on targets where its fast)

4 - the LoopMicroOpBufferSize in the znver3 model has a VERY unexpected effect on unrolling - I'm not sure clang's interpretation of the buffer size is the same as just copying AMD's hardware specs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Poor unrolling prevents vectorization opportunities #50452

Extended Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development


Bugzilla Link	51108
Version	trunk
OS	Windows NT
CC	@adibiagio,@fhahn,@LebedevRI,@MattPD,@OCHyams,@rotateright

Poor unrolling prevents vectorization opportunities #50452

Description

Extended Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions