Skip to content

[AArch64] treat @llvm.ssub.sat the same as @llvm.aarch64.neon.sqsub #140454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

folkertdev
Copy link
Contributor

fixes #94463

This probably not ready to merge, but I figured it'd be easier if we can look at/comment on the code. I know eventually all commits should be squashed, but for now having multiple commits makes it easier to revert changes.

cc @davemgreen @SpencerAbson

Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@folkertdev folkertdev force-pushed the neon-optimize-generic-saturating-intrinsics branch from d95389d to ef9b896 Compare May 18, 2025 14:01
@folkertdev
Copy link
Contributor Author

Yeah, two tests fail in llvm/test/CodeGen/AArch64/arm64-neon-2velem.ll, I believe this is a regression

define <4 x float> @optimize_dup(<4 x float> %a, <4 x float> %b, <4 x float> %c, <4 x float> %v) {
; CHECK-LABEL: optimize_dup:
; CHECK:       // %bb.0: // %entry
; CHECK-NEXT:    fmla v0.4s, v1.4s, v3.s[3]
; CHECK-NEXT:    fmls v0.4s, v2.4s, v3.s[3]
; CHECK-NEXT:    ret
entry:
  %lane1 = shufflevector <4 x float> %v, <4 x float> undef, <4 x i32> <i32 3, i32 3, i32 3, i32 3>
  %0 = tail call <4 x float> @llvm.fma.v4f32(<4 x float> %lane1, <4 x float> %b, <4 x float> %a)
  %lane2 = shufflevector <4 x float> %v, <4 x float> undef, <4 x i32> <i32 3, i32 3, i32 3, i32 3>
	%1 = fmul <4 x float> %lane2, %c
	%s = fsub <4 x float> %0, %1
  ret <4 x float> %s
}

define <4 x float> @no_optimize_dup(<4 x float> %a, <4 x float> %b, <4 x float> %c, <4 x float> %v) {
; CHECK-LABEL: no_optimize_dup:
; CHECK:       // %bb.0: // %entry
; CHECK-NEXT:    fmla v0.4s, v1.4s, v3.s[3]
; CHECK-NEXT:    fmls v0.4s, v2.4s, v3.s[1]
; CHECK-NEXT:    ret
entry:
  %lane1 = shufflevector <4 x float> %v, <4 x float> undef, <4 x i32> <i32 3, i32 3, i32 3, i32 3>
  %0 = tail call <4 x float> @llvm.fma.v4f32(<4 x float> %lane1, <4 x float> %b, <4 x float> %a)
  %lane2 = shufflevector <4 x float> %v, <4 x float> undef, <4 x i32> <i32 1, i32 1, i32 1, i32 1>
	%1 = fmul <4 x float> %lane2, %c
	%s = fsub <4 x float> %0, %1
  ret <4 x float> %s
}
        3738:  .type optimize_dup,@function 
        3739: optimize_dup: // @optimize_dup 
        3740:  .cfi_startproc 
        3741: // %bb.0: // %entry 
        3742:  fmul v2.4s, v2.4s, v3.s[3] 
        3743:  fmla v0.4s, v1.4s, v3.s[3] 
next:4507      !~~~~~~~~~~~~~~~~~~~~~~~~~  error: match on wrong line
        3744:  fsub v0.4s, v0.4s, v2.4s 
        3745:  ret 
        3746: .Lfunc_end298: 
        3747:  .size optimize_dup, .Lfunc_end298-optimize_dup 
        3748:  .cfi_endproc 
        3749:  // -- End function 
        3750:  .globl no_optimize_dup // -- Begin function no_optimize_dup 
        3751:  .p2align 2 
        3752:  .type no_optimize_dup,@function 
        3753: no_optimize_dup: // @no_optimize_dup 
        3754:  .cfi_startproc 
        3755: // %bb.0: // %entry 
        3756:  fmul v2.4s, v2.4s, v3.s[1] 
        3757:  fmla v0.4s, v1.4s, v3.s[3] 
next:4522      !~~~~~~~~~~~~~~~~~~~~~~~~~  error: match on wrong line
        3758:  fsub v0.4s, v0.4s, v2.4s 
        3759:  ret 
        3760: .Lfunc_end299: 
        3761:  .size no_optimize_dup, .Lfunc_end299-no_optimize_dup 
        3762:  .cfi_endproc 

It looks like the test assumes that the operations are fused, and with these changes they no longer are.

Does that sound right? I guess there is some other rule with fma that we've missed?

@folkertdev
Copy link
Contributor Author

hmm, apparently I messed up some flags. On CI, and with the command from the top of the file, it looks like some movs get inserted in an unexpected place. I have no idea why that would be the case.

Also a bunch of the ; CHECK-GI-NEXT: warning: lines added in the first command have disappeared when I run the tests now?!

Copy link

github-actions bot commented May 18, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@davemgreen
Copy link
Collaborator

Also a bunch of the ; CHECK-GI-NEXT: warning: lines added in the first command have disappeared when I run the tests now?!

That sounds good - it mean that GISel is falling back to SDAG less often. If the tests are less efficient (for the CHECK-GI lines) that's OK if it is falling back less. There are a number of issues in gisel that we are still working through.

@SpencerAbson
Copy link
Contributor

Thank you for doing this, I'll take a look as soon as I can.

@nikic
Copy link
Contributor

nikic commented May 24, 2025

Would it be possible to entirely remove these neon intrinsics and auto-upgrade them to the generic ones?

@nikic nikic changed the title [LLVM] treat @llvm.ssub.sat the same as @llvm.aarch64.neon.sqsub [AArch64] treat @llvm.ssub.sat the same as @llvm.aarch64.neon.sqsub May 24, 2025
@davemgreen
Copy link
Collaborator

Would it be possible to entirely remove these neon intrinsics and auto-upgrade them to the generic ones?

For other intrinsics (abs, min/max) that is a good way to go. For these though, the instructions technically set a Q flag on saturation that the users might try and use. That is not something we support at the moment (we don't model the sideeffects), but having the intrinsics present allows us to pivot if we need to in the future.

@davemgreen
Copy link
Collaborator

This is still a draft, but probably deserves to be a proper review at this point.

@folkertdev folkertdev marked this pull request as ready for review May 27, 2025 16:07
@llvmbot
Copy link
Member

llvmbot commented May 27, 2025

@llvm/pr-subscribers-backend-aarch64

Author: Folkert de Vries (folkertdev)

Changes

fixes #94463

This probably not ready to merge, but I figured it'd be easier if we can look at/comment on the code. I know eventually all commits should be squashed, but for now having multiple commits makes it easier to revert changes.

cc @davemgreen @SpencerAbson


Patch is 26.54 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/140454.diff

6 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+20)
  • (modified) llvm/lib/Target/AArch64/AArch64InstrFormats.td (+16-33)
  • (modified) llvm/lib/Target/AArch64/AArch64InstrInfo.td (+8-16)
  • (modified) llvm/lib/Target/AArch64/GISel/AArch64LegalizerInfo.cpp (+24-13)
  • (modified) llvm/test/CodeGen/AArch64/arm64-neon-3vdiff.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/arm64-vmul.ll (+184)
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 293292d47dd48..0905901c5f69b 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -6235,6 +6235,26 @@ SDValue AArch64TargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
           DAG.getNode(
               AArch64ISD::URSHR_I, dl, Op.getOperand(1).getValueType(), Op.getOperand(1), Op.getOperand(2)));
     return SDValue();
+  case Intrinsic::aarch64_neon_sqadd:
+    if (Op.getValueType().isVector())
+      return DAG.getNode(ISD::SADDSAT, dl, Op.getValueType(), Op.getOperand(1),
+                         Op.getOperand(2));
+    return SDValue();
+  case Intrinsic::aarch64_neon_sqsub:
+    if (Op.getValueType().isVector())
+      return DAG.getNode(ISD::SSUBSAT, dl, Op.getValueType(), Op.getOperand(1),
+                         Op.getOperand(2));
+    return SDValue();
+  case Intrinsic::aarch64_neon_uqadd:
+    if (Op.getValueType().isVector())
+      return DAG.getNode(ISD::UADDSAT, dl, Op.getValueType(), Op.getOperand(1),
+                         Op.getOperand(2));
+    return SDValue();
+  case Intrinsic::aarch64_neon_uqsub:
+    if (Op.getValueType().isVector())
+      return DAG.getNode(ISD::USUBSAT, dl, Op.getValueType(), Op.getOperand(1),
+                         Op.getOperand(2));
+    return SDValue();
   case Intrinsic::aarch64_sve_whilelt:
     return optimizeIncrementingWhile(Op.getNode(), DAG, /*IsSigned=*/true,
                                      /*IsEqual=*/false);
diff --git a/llvm/lib/Target/AArch64/AArch64InstrFormats.td b/llvm/lib/Target/AArch64/AArch64InstrFormats.td
index 5489541fcb318..6adf84879052f 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrFormats.td
+++ b/llvm/lib/Target/AArch64/AArch64InstrFormats.td
@@ -6256,24 +6256,6 @@ multiclass SIMDThreeSameVector<bit U, bits<5> opc, string asm,
          [(set (v2i64 V128:$Rd), (OpNode (v2i64 V128:$Rn), (v2i64 V128:$Rm)))]>;
 }
 
-multiclass SIMDThreeSameVectorExtraPatterns<string inst, SDPatternOperator OpNode> {
-  def : Pat<(v8i8 (OpNode V64:$LHS, V64:$RHS)),
-          (!cast<Instruction>(inst#"v8i8") V64:$LHS, V64:$RHS)>;
-  def : Pat<(v4i16 (OpNode V64:$LHS, V64:$RHS)),
-          (!cast<Instruction>(inst#"v4i16") V64:$LHS, V64:$RHS)>;
-  def : Pat<(v2i32 (OpNode V64:$LHS, V64:$RHS)),
-          (!cast<Instruction>(inst#"v2i32") V64:$LHS, V64:$RHS)>;
-
-  def : Pat<(v16i8 (OpNode V128:$LHS, V128:$RHS)),
-          (!cast<Instruction>(inst#"v16i8") V128:$LHS, V128:$RHS)>;
-  def : Pat<(v8i16 (OpNode V128:$LHS, V128:$RHS)),
-          (!cast<Instruction>(inst#"v8i16") V128:$LHS, V128:$RHS)>;
-  def : Pat<(v4i32 (OpNode V128:$LHS, V128:$RHS)),
-          (!cast<Instruction>(inst#"v4i32") V128:$LHS, V128:$RHS)>;
-  def : Pat<(v2i64 (OpNode V128:$LHS, V128:$RHS)),
-          (!cast<Instruction>(inst#"v2i64") V128:$LHS, V128:$RHS)>;
-}
-
 // As above, but D sized elements unsupported.
 multiclass SIMDThreeSameVectorBHS<bit U, bits<5> opc, string asm,
                                   SDPatternOperator OpNode> {
@@ -9861,14 +9843,15 @@ multiclass SIMDIndexedLongSD<bit U, bits<4> opc, string asm,
 }
 
 multiclass SIMDIndexedLongSQDMLXSDTied<bit U, bits<4> opc, string asm,
-                                       SDPatternOperator Accum> {
+                                       SDPatternOperator VecAcc,
+                                       SDPatternOperator ScalAcc> {
   def v4i16_indexed : BaseSIMDIndexedTied<0, U, 0, 0b01, opc,
                                       V128, V64,
                                       V128_lo, VectorIndexH,
                                       asm, ".4s", ".4s", ".4h", ".h",
     [(set (v4i32 V128:$dst),
-          (Accum (v4i32 V128:$Rd),
-                 (v4i32 (int_aarch64_neon_sqdmull
+          (VecAcc (v4i32 V128:$Rd),
+                  (v4i32 (int_aarch64_neon_sqdmull
                              (v4i16 V64:$Rn),
                              (dup_v8i16 (v8i16 V128_lo:$Rm),
                                          VectorIndexH:$idx)))))]> {
@@ -9883,8 +9866,8 @@ multiclass SIMDIndexedLongSQDMLXSDTied<bit U, bits<4> opc, string asm,
                                       V128_lo, VectorIndexH,
                                       asm#"2", ".4s", ".4s", ".8h", ".h",
     [(set (v4i32 V128:$dst),
-          (Accum (v4i32 V128:$Rd),
-                 (v4i32 (int_aarch64_neon_sqdmull
+          (VecAcc (v4i32 V128:$Rd),
+                  (v4i32 (int_aarch64_neon_sqdmull
                             (extract_high_v8i16 (v8i16 V128:$Rn)),
                             (extract_high_dup_v8i16 (v8i16 V128_lo:$Rm), VectorIndexH:$idx)))))]> {
     bits<3> idx;
@@ -9898,8 +9881,8 @@ multiclass SIMDIndexedLongSQDMLXSDTied<bit U, bits<4> opc, string asm,
                                       V128, VectorIndexS,
                                       asm, ".2d", ".2d", ".2s", ".s",
     [(set (v2i64 V128:$dst),
-        (Accum (v2i64 V128:$Rd),
-               (v2i64 (int_aarch64_neon_sqdmull
+        (VecAcc (v2i64 V128:$Rd),
+                (v2i64 (int_aarch64_neon_sqdmull
                           (v2i32 V64:$Rn),
                           (dup_v4i32 (v4i32 V128:$Rm), VectorIndexS:$idx)))))]> {
     bits<2> idx;
@@ -9912,8 +9895,8 @@ multiclass SIMDIndexedLongSQDMLXSDTied<bit U, bits<4> opc, string asm,
                                       V128, VectorIndexS,
                                       asm#"2", ".2d", ".2d", ".4s", ".s",
     [(set (v2i64 V128:$dst),
-          (Accum (v2i64 V128:$Rd),
-                 (v2i64 (int_aarch64_neon_sqdmull
+          (VecAcc (v2i64 V128:$Rd),
+                  (v2i64 (int_aarch64_neon_sqdmull
                             (extract_high_v4i32 (v4i32 V128:$Rn)),
                             (extract_high_dup_v4i32 (v4i32 V128:$Rm), VectorIndexS:$idx)))))]> {
     bits<2> idx;
@@ -9930,8 +9913,8 @@ multiclass SIMDIndexedLongSQDMLXSDTied<bit U, bits<4> opc, string asm,
     let Inst{20} = idx{0};
   }
 
-  def : Pat<(i32 (Accum (i32 FPR32Op:$Rd),
-                        (i32 (vector_extract
+  def : Pat<(i32 (ScalAcc (i32 FPR32Op:$Rd),
+                          (i32 (vector_extract
                                     (v4i32 (int_aarch64_neon_sqdmull
                                                 (v4i16 V64:$Rn),
                                                 (v4i16 V64:$Rm))),
@@ -9942,8 +9925,8 @@ multiclass SIMDIndexedLongSQDMLXSDTied<bit U, bits<4> opc, string asm,
                         (INSERT_SUBREG (IMPLICIT_DEF), V64:$Rm, dsub),
                         (i64 0))>;
 
-  def : Pat<(i32 (Accum (i32 FPR32Op:$Rd),
-                        (i32 (vector_extract
+  def : Pat<(i32 (ScalAcc (i32 FPR32Op:$Rd),
+                          (i32 (vector_extract
                                     (v4i32 (int_aarch64_neon_sqdmull
                                                 (v4i16 V64:$Rn),
                                                 (dup_v8i16 (v8i16 V128_lo:$Rm),
@@ -9959,8 +9942,8 @@ multiclass SIMDIndexedLongSQDMLXSDTied<bit U, bits<4> opc, string asm,
                                       FPR64Op, FPR32Op, V128, VectorIndexS,
                                       asm, ".s", "", "", ".s",
     [(set (i64 FPR64Op:$dst),
-          (Accum (i64 FPR64Op:$Rd),
-                 (i64 (int_aarch64_neon_sqdmulls_scalar
+          (ScalAcc (i64 FPR64Op:$Rd),
+                   (i64 (int_aarch64_neon_sqdmulls_scalar
                             (i32 FPR32Op:$Rn),
                             (i32 (vector_extract (v4i32 V128:$Rm),
                                                  VectorIndexS:$idx))))))]> {
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.td b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
index 010c7c391527f..9b256b2a7a878 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
@@ -5811,12 +5811,12 @@ defm SMAXP    : SIMDThreeSameVectorBHS<0,0b10100,"smaxp", int_aarch64_neon_smaxp
 defm SMAX     : SIMDThreeSameVectorBHS<0,0b01100,"smax", smax>;
 defm SMINP    : SIMDThreeSameVectorBHS<0,0b10101,"sminp", int_aarch64_neon_sminp>;
 defm SMIN     : SIMDThreeSameVectorBHS<0,0b01101,"smin", smin>;
-defm SQADD    : SIMDThreeSameVector<0,0b00001,"sqadd", int_aarch64_neon_sqadd>;
+defm SQADD    : SIMDThreeSameVector<0,0b00001,"sqadd", saddsat>;
 defm SQDMULH  : SIMDThreeSameVectorHS<0,0b10110,"sqdmulh",int_aarch64_neon_sqdmulh>;
 defm SQRDMULH : SIMDThreeSameVectorHS<1,0b10110,"sqrdmulh",int_aarch64_neon_sqrdmulh>;
 defm SQRSHL   : SIMDThreeSameVector<0,0b01011,"sqrshl", int_aarch64_neon_sqrshl>;
 defm SQSHL    : SIMDThreeSameVector<0,0b01001,"sqshl", int_aarch64_neon_sqshl>;
-defm SQSUB    : SIMDThreeSameVector<0,0b00101,"sqsub", int_aarch64_neon_sqsub>;
+defm SQSUB    : SIMDThreeSameVector<0,0b00101,"sqsub", ssubsat>;
 defm SRHADD   : SIMDThreeSameVectorBHS<0,0b00010,"srhadd", avgceils>;
 defm SRSHL    : SIMDThreeSameVector<0,0b01010,"srshl", int_aarch64_neon_srshl>;
 defm SSHL     : SIMDThreeSameVector<0,0b01000,"sshl", int_aarch64_neon_sshl>;
@@ -5830,10 +5830,10 @@ defm UMAXP    : SIMDThreeSameVectorBHS<1,0b10100,"umaxp", int_aarch64_neon_umaxp
 defm UMAX     : SIMDThreeSameVectorBHS<1,0b01100,"umax", umax>;
 defm UMINP    : SIMDThreeSameVectorBHS<1,0b10101,"uminp", int_aarch64_neon_uminp>;
 defm UMIN     : SIMDThreeSameVectorBHS<1,0b01101,"umin", umin>;
-defm UQADD    : SIMDThreeSameVector<1,0b00001,"uqadd", int_aarch64_neon_uqadd>;
+defm UQADD    : SIMDThreeSameVector<1,0b00001,"uqadd", uaddsat>;
 defm UQRSHL   : SIMDThreeSameVector<1,0b01011,"uqrshl", int_aarch64_neon_uqrshl>;
 defm UQSHL    : SIMDThreeSameVector<1,0b01001,"uqshl", int_aarch64_neon_uqshl>;
-defm UQSUB    : SIMDThreeSameVector<1,0b00101,"uqsub", int_aarch64_neon_uqsub>;
+defm UQSUB    : SIMDThreeSameVector<1,0b00101,"uqsub", usubsat>;
 defm URHADD   : SIMDThreeSameVectorBHS<1,0b00010,"urhadd", avgceilu>;
 defm URSHL    : SIMDThreeSameVector<1,0b01010,"urshl", int_aarch64_neon_urshl>;
 defm USHL     : SIMDThreeSameVector<1,0b01000,"ushl", int_aarch64_neon_ushl>;
@@ -5842,12 +5842,6 @@ defm SQRDMLAH : SIMDThreeSameVectorSQRDMLxHTiedHS<1,0b10000,"sqrdmlah",
 defm SQRDMLSH : SIMDThreeSameVectorSQRDMLxHTiedHS<1,0b10001,"sqrdmlsh",
                                                     int_aarch64_neon_sqrdmlsh>;
 
-// Extra saturate patterns, other than the intrinsics matches above
-defm : SIMDThreeSameVectorExtraPatterns<"SQADD", saddsat>;
-defm : SIMDThreeSameVectorExtraPatterns<"UQADD", uaddsat>;
-defm : SIMDThreeSameVectorExtraPatterns<"SQSUB", ssubsat>;
-defm : SIMDThreeSameVectorExtraPatterns<"UQSUB", usubsat>;
-
 defm AND : SIMDLogicalThreeVector<0, 0b00, "and", and>;
 defm BIC : SIMDLogicalThreeVector<0, 0b01, "bic",
                                   BinOpFrag<(and node:$LHS, (vnot node:$RHS))> >;
@@ -6563,10 +6557,8 @@ defm SMLAL   : SIMDLongThreeVectorTiedBHS<0, 0b1000, "smlal",
 defm SMLSL   : SIMDLongThreeVectorTiedBHS<0, 0b1010, "smlsl",
     TriOpFrag<(sub node:$LHS, (AArch64smull node:$MHS, node:$RHS))>>;
 defm SMULL   : SIMDLongThreeVectorBHS<0, 0b1100, "smull", AArch64smull>;
-defm SQDMLAL : SIMDLongThreeVectorSQDMLXTiedHS<0, 0b1001, "sqdmlal",
-                                               int_aarch64_neon_sqadd>;
-defm SQDMLSL : SIMDLongThreeVectorSQDMLXTiedHS<0, 0b1011, "sqdmlsl",
-                                               int_aarch64_neon_sqsub>;
+defm SQDMLAL : SIMDLongThreeVectorSQDMLXTiedHS<0, 0b1001, "sqdmlal", saddsat>;
+defm SQDMLSL : SIMDLongThreeVectorSQDMLXTiedHS<0, 0b1011, "sqdmlsl", ssubsat>;
 defm SQDMULL : SIMDLongThreeVectorHS<0, 0b1101, "sqdmull",
                                      int_aarch64_neon_sqdmull>;
 defm SSUBL   : SIMDLongThreeVectorBHS<0, 0b0010, "ssubl",
@@ -8125,9 +8117,9 @@ defm SMLAL : SIMDVectorIndexedLongSDTied<0, 0b0010, "smlal",
 defm SMLSL : SIMDVectorIndexedLongSDTied<0, 0b0110, "smlsl",
     TriOpFrag<(sub node:$LHS, (AArch64smull node:$MHS, node:$RHS))>>;
 defm SMULL : SIMDVectorIndexedLongSD<0, 0b1010, "smull", AArch64smull>;
-defm SQDMLAL : SIMDIndexedLongSQDMLXSDTied<0, 0b0011, "sqdmlal",
+defm SQDMLAL : SIMDIndexedLongSQDMLXSDTied<0, 0b0011, "sqdmlal", saddsat,
                                            int_aarch64_neon_sqadd>;
-defm SQDMLSL : SIMDIndexedLongSQDMLXSDTied<0, 0b0111, "sqdmlsl",
+defm SQDMLSL : SIMDIndexedLongSQDMLXSDTied<0, 0b0111, "sqdmlsl", ssubsat,
                                            int_aarch64_neon_sqsub>;
 defm SQRDMLAH : SIMDIndexedSQRDMLxHSDTied<1, 0b1101, "sqrdmlah",
                                           int_aarch64_neon_sqrdmlah>;
diff --git a/llvm/lib/Target/AArch64/GISel/AArch64LegalizerInfo.cpp b/llvm/lib/Target/AArch64/GISel/AArch64LegalizerInfo.cpp
index 7ff2e55e802c5..fdaad067fc69d 100644
--- a/llvm/lib/Target/AArch64/GISel/AArch64LegalizerInfo.cpp
+++ b/llvm/lib/Target/AArch64/GISel/AArch64LegalizerInfo.cpp
@@ -1622,8 +1622,10 @@ bool AArch64LegalizerInfo::legalizeSmallCMGlobalValue(
 
 bool AArch64LegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
                                              MachineInstr &MI) const {
-  auto LowerBinOp = [&MI](unsigned Opcode) {
-    MachineIRBuilder MIB(MI);
+  MachineIRBuilder MIB(MI);
+  MachineRegisterInfo &MRI = *MIB.getMRI();
+
+  auto LowerBinOp = [&MI, &MIB](unsigned Opcode) {
     MIB.buildInstr(Opcode, {MI.getOperand(0)},
                    {MI.getOperand(2), MI.getOperand(3)});
     MI.eraseFromParent();
@@ -1642,7 +1644,6 @@ bool AArch64LegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
     MachineFunction &MF = *MI.getMF();
     auto Val = MF.getRegInfo().createGenericVirtualRegister(
         LLT::scalar(VaListSize * 8));
-    MachineIRBuilder MIB(MI);
     MIB.buildLoad(Val, MI.getOperand(2),
                   *MF.getMachineMemOperand(MachinePointerInfo(),
                                            MachineMemOperand::MOLoad,
@@ -1664,14 +1665,12 @@ bool AArch64LegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
     assert(MI.getOpcode() == TargetOpcode::G_INTRINSIC_W_SIDE_EFFECTS);
     // Anyext the value being set to 64 bit (only the bottom 8 bits are read by
     // the instruction).
-    MachineIRBuilder MIB(MI);
     auto &Value = MI.getOperand(3);
     Register ExtValueReg = MIB.buildAnyExt(LLT::scalar(64), Value).getReg(0);
     Value.setReg(ExtValueReg);
     return true;
   }
   case Intrinsic::aarch64_prefetch: {
-    MachineIRBuilder MIB(MI);
     auto &AddrVal = MI.getOperand(1);
 
     int64_t IsWrite = MI.getOperand(2).getImm();
@@ -1694,8 +1693,6 @@ bool AArch64LegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
   case Intrinsic::aarch64_neon_smaxv:
   case Intrinsic::aarch64_neon_uminv:
   case Intrinsic::aarch64_neon_sminv: {
-    MachineIRBuilder MIB(MI);
-    MachineRegisterInfo &MRI = *MIB.getMRI();
     bool IsSigned = IntrinsicID == Intrinsic::aarch64_neon_saddv ||
                     IntrinsicID == Intrinsic::aarch64_neon_smaxv ||
                     IntrinsicID == Intrinsic::aarch64_neon_sminv;
@@ -1720,8 +1717,6 @@ bool AArch64LegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
   }
   case Intrinsic::aarch64_neon_uaddlp:
   case Intrinsic::aarch64_neon_saddlp: {
-    MachineIRBuilder MIB(MI);
-
     unsigned Opc = IntrinsicID == Intrinsic::aarch64_neon_uaddlp
                        ? AArch64::G_UADDLP
                        : AArch64::G_SADDLP;
@@ -1732,9 +1727,6 @@ bool AArch64LegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
   }
   case Intrinsic::aarch64_neon_uaddlv:
   case Intrinsic::aarch64_neon_saddlv: {
-    MachineIRBuilder MIB(MI);
-    MachineRegisterInfo &MRI = *MIB.getMRI();
-
     unsigned Opc = IntrinsicID == Intrinsic::aarch64_neon_uaddlv
                        ? AArch64::G_UADDLV
                        : AArch64::G_SADDLV;
@@ -1790,11 +1782,30 @@ bool AArch64LegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
     return LowerBinOp(AArch64::G_UMULL);
   case Intrinsic::aarch64_neon_abs: {
     // Lower the intrinsic to G_ABS.
-    MachineIRBuilder MIB(MI);
     MIB.buildInstr(TargetOpcode::G_ABS, {MI.getOperand(0)}, {MI.getOperand(2)});
     MI.eraseFromParent();
     return true;
   }
+  case Intrinsic::aarch64_neon_sqadd: {
+    if (MRI.getType(MI.getOperand(0).getReg()).isVector())
+      return LowerBinOp(TargetOpcode::G_SADDSAT);
+    break;
+  }
+  case Intrinsic::aarch64_neon_sqsub: {
+    if (MRI.getType(MI.getOperand(0).getReg()).isVector())
+      return LowerBinOp(TargetOpcode::G_SSUBSAT);
+    break;
+  }
+  case Intrinsic::aarch64_neon_uqadd: {
+    if (MRI.getType(MI.getOperand(0).getReg()).isVector())
+      return LowerBinOp(TargetOpcode::G_UADDSAT);
+    break;
+  }
+  case Intrinsic::aarch64_neon_uqsub: {
+    if (MRI.getType(MI.getOperand(0).getReg()).isVector())
+      return LowerBinOp(TargetOpcode::G_USUBSAT);
+    break;
+  }
 
   case Intrinsic::vector_reverse:
     // TODO: Add support for vector_reverse
diff --git a/llvm/test/CodeGen/AArch64/arm64-neon-3vdiff.ll b/llvm/test/CodeGen/AArch64/arm64-neon-3vdiff.ll
index 9fb8e4c8fe031..bd28d13973f9c 100644
--- a/llvm/test/CodeGen/AArch64/arm64-neon-3vdiff.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-neon-3vdiff.ll
@@ -2539,16 +2539,16 @@ define <8 x i16> @cmplx_mul_combined_re_im(<8 x i16> noundef %a, i64 %scale.coer
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    lsr x8, x0, #16
 ; CHECK-NEXT:    movi v1.2d, #0xffff0000ffff0000
-; CHECK-NEXT:    fmov d5, x0
 ; CHECK-NEXT:    rev32 v4.8h, v0.8h
 ; CHECK-NEXT:    dup v2.8h, w8
 ; CHECK-NEXT:    sqneg v3.8h, v2.8h
 ; CHECK-NEXT:    bsl v1.16b, v2.16b, v3.16b
-; CHECK-NEXT:    sqdmull v2.4s, v0.4h, v5.h[0]
-; CHECK-NEXT:    sqdmull2 v0.4s, v0.8h, v5.h[0]
-; CHECK-NEXT:    sqdmlal v2.4s, v4.4h, v1.4h
-; CHECK-NEXT:    sqdmlal2 v0.4s, v4.8h, v1.8h
-; CHECK-NEXT:    uzp2 v0.8h, v2.8h, v0.8h
+; CHECK-NEXT:    fmov d3, x0
+; CHECK-NEXT:    sqdmull v2.4s, v4.4h, v1.4h
+; CHECK-NEXT:    sqdmull2 v1.4s, v4.8h, v1.8h
+; CHECK-NEXT:    sqdmlal v2.4s, v0.4h, v3.h[0]
+; CHECK-NEXT:    sqdmlal2 v1.4s, v0.8h, v3.h[0]
+; CHECK-NEXT:    uzp2 v0.8h, v2.8h, v1.8h
 ; CHECK-NEXT:    ret
 entry:
   %scale.sroa.2.0.extract.shift23 = lshr i64 %scale.coerce, 16
diff --git a/llvm/test/CodeGen/AArch64/arm64-vmul.ll b/llvm/test/CodeGen/AArch64/arm64-vmul.ll
index 499786470d4ac..937a17ca6c1e0 100644
--- a/llvm/test/CodeGen/AArch64/arm64-vmul.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-vmul.ll
@@ -2863,3 +2863,187 @@ define <1 x i64> @test_mul_v1i64(<1 x i64> %lhs, <1 x i64> %rhs) nounwind {
   %prod = mul <1 x i64> %lhs, %rhs
   ret <1 x i64> %prod
 }
+
+define <4 x i32> @sqdmlal4s_lib(<4 x i32> %dst, <4 x i16> %v1, <4 x i16> %v2) {
+; CHECK-LABEL: sqdmlal4s_lib:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    sqdmlal.4s v0, v1, v2
+; CHECK-NEXT:    ret
+  %tmp  = call <4 x i32> @llvm.aarch64.neon.sqdmull.v4i32(<4 x i16> %v1, <4 x i16> %v2)
+  %sum = call <4 x i32> @llvm.sadd.sat.v4i32(<4 x i32> %dst, <4 x i32> %tmp)
+  ret <4 x i32> %sum
+}
+
+define <2 x i64> @sqdmlal2d_lib(<2 x i64> %dst, <2 x i32> %v1, <2 x i32> %v2) {
+; CHECK-LABEL: sqdmlal2d_lib:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    sqdmlal.2d v0, v1, v2
+; CHECK-NEXT:    ret
+  %tmp  = call <2 x i64> @llvm.aarch64.neon.sqdmull.v2i64(<2 x i32> %v1, <2 x i32> %v2)
+  %sum = call <2 x i64> @llvm.sadd.sat.v2i64(<2 x i64> %dst, <2 x i64> %tmp)
+  ret <2 x i64> %sum
+}
+
+define <4 x i32> @sqdmlal2_4s_lib(<4 x i32> %dst, <8 x i16> %v1, <8 x i16> %v2) {
+; CHECK-LABEL: sqdmlal2_4s_lib:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    sqdmlal2.4s v0, v1, v2
+; CHECK-NEXT:    ret
+  %tmp0 = shufflevector <8 x i16> %v1, <8 x i16> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
+  %tmp1 = shufflevector <8 x i16> %v2, <8 x i16> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
+  %tmp2  = call <4 x i32> @llvm.aarch64.neon.sqdmull.v4i32(<4 x i16> %tmp0, <4 x i16> %tmp1)
+  %sum = call <4 x i32> @llvm.sadd.sat.v4i32(<4 x i32> %dst, <4 x i32> %tmp2)
+  ret <4 x i32> %sum
+}
+
+define <2 x i64> @sqdmlal2_2d_lib(<2 x i64> %dst, <4 x i32> %v1, <4 x i32> %v2) {
+; CHECK-LABEL: sqdmlal2_2d_lib:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    sqdmlal2.2d v0, v1, v2
+; CHECK-NEXT:    ret
+  %tmp0 = shufflevector <4 x i32> %v1, <4 x i32> poison, <2 x i32> <i32 2, i32 3>
+  %tmp1 = shufflevector <4 x i32> %v2, <4 x i32> poison, <2 x i32> <i32 2, i32 3>
+  %tmp2  = call <2 x i64> @llvm.aarch64.neon.sqdmull.v2i64(<2 x i32> %tmp0, <2 x i...
[truncated]

@folkertdev
Copy link
Contributor Author

I marked it as read. there is currently the request for a comment above, otherwise I think all previous comments have been handled.

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - this LGTM if there are no other comments from @SpencerAbson.

@nikic
Copy link
Contributor

nikic commented May 28, 2025

Would it be possible to entirely remove these neon intrinsics and auto-upgrade them to the generic ones?

For other intrinsics (abs, min/max) that is a good way to go. For these though, the instructions technically set a Q flag on saturation that the users might try and use. That is not something we support at the moment (we don't model the sideeffects), but having the intrinsics present allows us to pivot if we need to in the future.

Hm, if you wanted to model the Q flag, I'd assume you would change the intrinsic to return something like { <4 x i32>, i1 } instead?

That seems like a reasonable motivation for a separate intrinsic -- but for the case where the i1 result is unused, wouldn't you still want to use the target-independent ones anyway?

@davemgreen
Copy link
Collaborator

With the way that the intrinsics are defined in https://developer.arm.com/architectures/instruction-sets/intrinsics/#f:@navigationhierarchiessimdisa=[Neon]&q=qadd, they are stand-alone and the q flag would be a global side effect. I agree that if the dependency between the intrinsic and the flag is known then they would want to modelled as data dependencies.

I'm not against removing the intrinsics TBH. We just haven't done so so far because we might want to reverse the decision and model Q flags conservatively, as gcc does. We can always re-add them if necessary. The neon.abs and neon.smin/smax/umin/max are more obviously OK to remove (providing the scalar types do not get worse).

Copy link
Contributor

@SpencerAbson SpencerAbson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, minor comment but LGTM.

@folkertdev folkertdev force-pushed the neon-optimize-generic-saturating-intrinsics branch from 849468d to f5b04f3 Compare May 29, 2025 13:27
@folkertdev
Copy link
Contributor Author

I think this PR is ready to merge?

Thanks for all the help here btw!

@davemgreen davemgreen merged commit 3a98934 into llvm:main May 31, 2025
11 checks passed
Copy link

@folkertdev Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

davemgreen added a commit to davemgreen/llvm-project that referenced this pull request Jun 2, 2025
Add basic handling for v1i64 saddsat, ssubsat, uaddsat and usubsat. We missed
that these were not upgrading in llvm#140454 due to a lack of test coverage.

Fixes llvm#142323
davemgreen added a commit that referenced this pull request Jun 3, 2025
Add basic handling for v1i64 saddsat, ssubsat, uaddsat and usubsat. We
missed that these were not upgrading in #140454 due to a lack of test
coverage, and for some reason the generic v1i64 nodes were not being
treated as legal like they should.

Fixes #142323
rorth pushed a commit to rorth/llvm-project that referenced this pull request Jun 11, 2025
Add basic handling for v1i64 saddsat, ssubsat, uaddsat and usubsat. We
missed that these were not upgrading in llvm#140454 due to a lack of test
coverage, and for some reason the generic v1i64 nodes were not being
treated as legal like they should.

Fixes llvm#142323
DhruvSrivastavaX pushed a commit to DhruvSrivastavaX/lldb-for-aix that referenced this pull request Jun 12, 2025
DhruvSrivastavaX pushed a commit to DhruvSrivastavaX/lldb-for-aix that referenced this pull request Jun 12, 2025
Add basic handling for v1i64 saddsat, ssubsat, uaddsat and usubsat. We
missed that these were not upgrading in llvm#140454 due to a lack of test
coverage, and for some reason the generic v1i64 nodes were not being
treated as legal like they should.

Fixes llvm#142323
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

generic @llvm.ssub.sat optimizes less well than target-specific @llvm.aarch64.neon.sqsub
5 participants