Skip to content

Poor unrolling prevents vectorization opportunities #50452

Open
@RKSimon

Description

@RKSimon
Bugzilla Link 51108
Version trunk
OS Windows NT
CC @adibiagio,@fhahn,@LebedevRI,@MattPD,@OCHyams,@rotateright

Extended Description

https://godbolt.org/z/jE1e9rT5j
(NOTE: disabled fma on gcc to prevent fmul+fadd->fma diff)

constexpr int SIZE = 128;
float A[SIZE][16];
float B[SIZE][16];

__attribute__((__noinline__))
float foo()
{
    float sum = 0.0f;
    for (int i = 1; i < 32; ++i)
        for (int j = 0; j < 4; ++j)
            sum += A[i][j] * B[i][j];
    
    return sum;
}

clang -g0 -O3 -march=znver2

_Z3foov:
        vxorps  %xmm0, %xmm0, %xmm0
        movq    $-1984, %rax                    # imm = 0xF840
.LBB0_1:
        vmovss  A+2048(%rax), %xmm1             # xmm1 = mem[0],zero,zero,zero
        vmovsd  B+2052(%rax), %xmm2             # xmm2 = mem[0],zero
        vmulss  B+2048(%rax), %xmm1, %xmm1
        vaddss  %xmm1, %xmm0, %xmm0
        vmovsd  A+2052(%rax), %xmm1             # xmm1 = mem[0],zero
        vmulps  %xmm2, %xmm1, %xmm1
        vaddss  %xmm1, %xmm0, %xmm0
        vmovshdup       %xmm1, %xmm1            # xmm1 = xmm1[1,1,3,3]
        vaddss  %xmm1, %xmm0, %xmm0
        vmovss  A+2060(%rax), %xmm1             # xmm1 = mem[0],zero,zero,zero
        vmulss  B+2060(%rax), %xmm1, %xmm1
        addq    $64, %rax
        vaddss  %xmm1, %xmm0, %xmm0
        jne     .LBB0_1
        retq

The clang code has several issues:

1 - if we'd used a better indvar we could have avoided some very large offsets on the address math (put A and B in registers and use a better range/increment for %rax).

2 - GCC recognises that the array is fully dereferencable allowing it to use fewer (vector) loads and then extract/shuffle the elements that it requires

3 - we fail to ensure the per-loop reduction is in a form that we can use HADDPS (on targets where its fast)

4 - the LoopMicroOpBufferSize in the znver3 model has a VERY unexpected effect on unrolling - I'm not sure clang's interpretation of the buffer size is the same as just copying AMD's hardware specs

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions