Notable changes
Compiler
- Added support for AMD Radeon 9060XT and Radeon AI PRO R9700 GPUs #21035.
- Defined the new gfx950 target within AMDGPU support, incorporating several novel MFMAs (Matrix-Fused Multiply-Add operations) #20623.
- Introduced CombineLayoutTransformation to consolidate transpose, reshape, and slice into a single
iree_linalg_ext.map_scatter
operation #20655. - Supported medium-sized expanded-shape FP8 in the pingpong strategy #20735. And removed dynamic M bounds checks for 'pingpong' strategies in AMDGPU support #20738.
- Enhanced
CombineLayoutTransformation
to support folding tensor.pad operations into map_scatter operations #20797. - Constrained
GPUAllocPrivateMemoryForDPSOps
pass to pure tensor semantics, addressing recent implementation issues #20939. - The handling capability of the SpecializeEncodings pass to accommodate pad-based encodings was enhanced, allowing non-serializable encodings to be converted into serializable forms #20845.
- A refinement was made to prevent the hoisting of set_encoding and unset_encoding operations related to padding encodings #20733. Additionally, the dispatch creation mechanism has been updated to utilize patterns that facilitate the bubbling up of expand_shape operations across collapse_shape operations #20648. The attention operation was optimized by removing unit dimensions from the mask operand #20796 and new logic was introduced to impose limits on the application of padding encoding during dispatch creation #20732.
- Added support for the F8E8M0FNU type, ensuring validity as a HAL element type #20783. Expanded the range of valid HAL element types to include various scaled MFMA types, specifically f8E8M0FNU, f6E3M2FN, and f4E2M1FN.
- Linalg Extensions Dialect improvements (#20688,#20728,#20747,#20776, #19719, #20827, #20863 , #20568, #20916)
Runtime
- Introduced
hal.executable.export
condition regions, enhancing dispatch decision-making at each site based on device capabilities and workload parameters #20739. - Added user-defined
IREE_ALLOCATOR_SYSTEM
support #20727, providing the ability for external override of the allocator control function. - Extended support for mimalloc v3 as an optional system allocator #20730, enabling integration by setting
-DIREE_ALLOCATOR_SYSTEM=mimalloc
to statically link mimalloc into iree::base. - Enabled the import of external streams into HIP for scenarios requiring close integration with external applications #20972.
- Heterogenous device support is under development, which will allow compiled programs to allocate buffers across compatible devices and synchronize operations via semaphores. This intial phase will support CPU-only configurations, aiming for seamless integration #20851.
- The IREE PJRT plugin now supports memory related APIs, logging control and has been updated to the latest API version #20911.
- An experimental
#hal.device.optimal<...>
affinity attribute for runtime-resolvable device affinities, initially focusing on allocation-related operations has been implemented #20879.
New Contributors
- @javidcf made their first contribution in #19781
- @TheCBaH made their first contribution in #20719
- @RSchwan made their first contribution in #20761
- @AaronStGeorge made their first contribution in #20795
- @erieaton-amd made their first contribution in #20997
Full changelog
List of changes
- Move ASM to the end of languages list in CMakeLists.txt by @javidcf in #19781
- Add MathToROCDL patterns in ConvertToROCLPass. by @benvanik in #20684
- [Codegen] Add pass to bufferize dispatch.tensor.load/store ops by @Max191 in #20627
- Raise an error in demotion passes if illegal extern funcs are present. by @benvanik in #20679
- Properly handle unaligned refs in VM ABI marshaling. by @benvanik in #20671
- Integrate llvm/llvm-project@7b70fc7 by @IanWood1 in #20674
- [Codegen][GPU] Add placeholder op for buffer casts on tensors by @qedawkins in #20589
- Call IREE_TRACE_APP_ENTER/EXIT in compiler tool main functions. by @benvanik in #20686
- Revert "Call IREE_TRACE_APP_ENTER/EXIT in compiler tool main functions." by @benvanik in #20691
- Revert "Update workflows to run on macOS 15 (#20675)" by @marbre in #20690
- Add
flow.tensor.bitcast
for torch view as complex/real. by @benvanik in #20689 - [Dispatch Creation] Fix infinite reshape loop by @IanWood1 in #20162
- Fold iree_tensor_ext.dispatch.workload.ordinal on constants. by @benvanik in #20687
- Add relative error for buffer comparison by @nirvedhmeshram in #19464
- [Encoding] Allow
PadEncodingAttribute
to support dynamic padding. by @MaheshRavishankar in #20662 - [Codegen] Clean up TilingInterfaceUtils. NFC. by @kuhar in #20661
- [iree-benchmark] Ensure destructors run before
IREE_TRACE_APP_EXIT
by @rkayaith in #20694 - [DispatchCreation] Set padding encodings on intermediate tensors. by @MaheshRavishankar in #20634
- [GPU] Cross lane reduction rather than serial by @pashu123 in #20680
- [Codegen] Drop read_only from LoadFromMemrefOp. by @hanhanW in #20693
- [Im2col] Remain batch dimension untiled during decomposition when it is contiguous and innermost by @yzhang93 in #20633
- Runtime float type conversion helpers: Fix handling of denormals. by @bjacob in #20676
- [NFC] Converting the VM dialect to use tablegen passes. by @benvanik in #20698
- [LLVMGPU] Vector distribute config to handle dyn dims by @pashu123 in #20603
- [VectorDistribution] Improve vector.broadcast distribution by @Groverkss in #20652
- [Codegen][GPU] Improve intrinsic based Attention heuristics by @Groverkss in #20695
- Adding pass documentation for IREE dialects and pipelines. by @benvanik in #20705
- [AMDGPU] Support mask optimization for multiple users by @nirvedhmeshram in #20697
- [Im2col] Fix bug when there is no batch dimension by @yzhang93 in #20711
- Runtime float conversion helpers: add FP6, FP4 and E8M0 types. by @bjacob in #20707
- [Codegen][DerivedConfig] Add support to set outermost tile size as vector size by @yzhang93 in #20692
- Fixing missing allocator arg on the vulkan dynamic symbol table. by @benvanik in #20712
- [Encoding] Add convertType interface to generalize type conversion by @jtuyls in #20700
- [Codegen][GPU] Keep range and divisibility annotations on push constants by @krzysz00 in #19348
- [Codegen][AMDGPU] Add pingpong to default gfx942 tuning by @qedawkins in #20678
- [Encoding] Drop resolver interface implementation for SpecializedEncodingAttr by @hanhanW in #20718
- Bump version to 3.5.0 after 3.4.0 release. by @ScottTodd in #20721
- [LinalgExt] Implement tiling interface for map_scatter by @Max191 in #20688
- [NFC][Encoding] Move materializeEncodingValueFn to type converter by @jtuyls in #20720
- Adding user-defined IREE_ALLOCATOR_SYSTEM support. by @benvanik in #20727
- Added missing dependencies for the bazel build by @TheCBaH in #20719
- Print opt flags to more accurately reproduce (.linked -> .optimized) by @newling in #20716
- [NFC] Use ShapedType::isDynamicShape when possible. by @hanhanW in #20731
- [Codegen][Common] Add transform op to check for lowering configs when matching by @qedawkins in #20724
- [DispatchCreation] Avoid hoisting set encoding operations with padding encodings by @MaheshRavishankar in #20733
- [Codegen] Add pass to combine layout transformations by @Max191 in #20655
- [NFC] Move pack/unpack e2e tests to linalg/. by @hanhanW in #20728
- Adding mimalloc v3 as an optional system allocator. by @benvanik in #20730
- [Docs] Make op/attr/type summary styles consistent by @qedawkins in #20726
- Integrate llvm/llvm-project@15f7c6e by @IanWood1 in #20725
- [Codegen] Fix invalid use of iterators in
PropagateReshapesByExpansion
by @rkayaith in #20740 - [AMDGPU] Drop dynamic M bounds checks for pingpong by @qedawkins in #20738
- [DispatchCreation] Use patterns to bubble up expand shape across collapse shapes. by @MaheshRavishankar in #20648
- Continue trying executable loaders when a loader reports NOT_FOUND. by @benvanik in #20745
- Adding a
hal.executable.export
condition region. by @benvanik in #20739 - [LinalgExt][NFC] Move transformtion method declarations to Transforms.h by @hanhanW in #20747
- Pingpong: add medium-sized expanded-shape FP8 by @bjacob in #20735
- Add regression test for #20740 / #20736 by @rkayaith in #20750
- [Encoding] Use struct directive for encodingAttr assembly format by @jtuyls in #20746
- [hip] Add flag for disabling caching for async allocations. by @AWoloszyn in #20753
- e2e matmul test improvements: faster diagnostics, finer control with environment variables by @bjacob in #20755
- [Flow] Improve DumpDispatchGraph pass for programs at model level. by @hanhanW in #20756
- [GPU] Enable vector distribute pipeline for Matvecs by default by @pashu123 in #20706
- Adding quality and benchmark config docs by @geomin12 in #20759
- Integrate LLVM to llvm/llvm-project@8404b29 by @MaheshRavishankar in #20757
- Fix iree-codegen-llvmcpu-configuration-pipeline registration by @RSchwan in #20761
- Metal HAL: remove shadowed variable by @ziereis in #20760
- [AMDGPU] Define gfx950 target and its MFMAs by @krzysz00 in #20623
- Emit a warning when one of the iree-input-demote-* passes is used. by @benvanik in #20784
- Workaround for stack overflow in stream refine usage. by @benvanik in #20749
- [NFC][Codegen] Move EncodingNop LayoutAttrInterface to external model by @jtuyls in #20778
- [HAL] Add F8E8M0FNU by @tgymnich in #20783
- [Codegen][NFC] Move bufferization test out from LLVMCPU/test. by @hanhanW in #20789
- [DispatchCreation] White list ops that can be cloned. by @MaheshRavishankar in #20791
- [CPU][NFC] Lit tests cleanup and improvements. by @hanhanW in #20790
- [NFC][Codegen] Move getEncodingInfo to PackedLayoutAttrInterface by @jtuyls in #20780
- [NFC] Move LayoutAttrInterface to Encoding by @jtuyls in #20782
- [LinalgExt] Remove region from LinalgExt::GatherOp by @Groverkss in #20776
- [NFC][Encoding] Move convertType to LayoutAttrInterface by @jtuyls in #20794
- [Codegen][LLVMGPU] Optionally linearize the number of workgroups specified by @MaheshRavishankar in #20787
- [GPU] Enable vector distribute on reduction operations by default by @pashu123 in #20751
- [Flow] Set known dimensions on concat output by @AaronStGeorge in #20795
- [DispatchCreation] Remove unit dim from attn mask by @IanWood1 in #20796
- [ROCm] Set ABI version control variable correctly by @krzysz00 in #20800
- Removing invalid folder for vm.add + vm.sub ops. by @benvanik in #20808
- [Encoding] Add getOffsetSizesStrides interface for load/store materialization by @jtuyls in #20741
- [NFC] Refactor duplicated getEncodingInfo logic by @jtuyls in #20820
- [GPU] Increase the VAE benchmark threshold by @pashu123 in #20809
- [PJRT] Fix tensor element type for signed integers by @PragmaTwice in #19496
- [Codegen] split-k on argmax op by @bangtianliu in #20717
- [GlobalOptimization] Do not hoist fill-like operations by @Groverkss in #19719
- [NFC] Cleaning up flow canonicalize pass. by @benvanik in #20826
- [DispatchCreation] Remove CollapseReductionDimensionsPass by @IanWood1 in #20829
- [DispatchCreation] Set limits on when padding encoding is applied. by @MaheshRavishankar in #20732
- [iree-test-suite] Update the sharktank models benchmark time by @pashu123 in #20830
- [LinalgExt] Add a canonicalization pattern to drop unused results from sort op by @Muzammiluddin-Syed-ECE in #20827
- Update
hanhanW
for CODEOWNERS based on recent activities. by @hanhanW in #20840 - Sink cast-like flow ops across flow.tensor.transfer/barrier. by @benvanik in #20839
- [Codegen][GPU] Add support for allocating private memory for unused DPS results by @Muzammiluddin-Syed-ECE in #20793
- [Codegen] Make ReconcileTranslationInfo work with multiple exports by @qedawkins in #20801
- Cleaning up iree_hal_module_debug_sink_t destroy. by @benvanik in #20841
- [Codegen][ROCDL] Drop nominal support for dynamic shared mem by @qedawkins in #20805
- Integrate llvm-project@faf5d747f174cc by @krzysz00 in #20828
- [Codegen][NFC] Refresh remove_single_iteration_loop.mlir test. by @hanhanW in #20842
- [TensorExt] Drop space from count_from_slice printer by @qedawkins in #20850
- [LinalgExt] Clone iree_linalg_ext.gather (5/5) by @IanWood1 in #20563
- Fix logic for yieldReplacements in tileDispatchUsingForall by @pashu123 in #20844
- [Dispatch Creation] Handle
linalg.fill
in collapse dimensions by @IanWood1 in #20863 - [VectorExt] Fix transfer_gather printer by @IanWood1 in #20860
- [Codegen] Fix dominance issue in collapse shape fusion by @jtuyls in #20864
- [VectorExt] Vectorize
iree_linalg_ext.gather
by @IanWood1 in #20807 - [Dispatch Creation] Clone iree_linalg_ext.gather for attn by @IanWood1 in #20866
- [LinalgExt] Add map_scatter e2e tests for CPU and VMVX backends. by @hanhanW in #20861
- [Codegen][GPU] Support padding in CombineLayoutTransformation by @Max191 in #20797
- Fix padding to nop encoding specialization by @jtuyls in #20837
- [CodeGen] Fix a MemoryEffectsOpInterface bug in FuseConsumerOp. by @hanhanW in #20869
- [GPU] Vector distribution support for multiple stores by @pashu123 in #20816
- [AMDGPU] Rewrite some gpu.shuffle xor to ds_swizzle, per upstream by @krzysz00 in #20868
- [Codegen] Support multiple forall ops in ReconcileTranslationInfo by @Max191 in #20848
- [Codegen] Add ukernel support for argmax on BF16 and enable optional max value return by @bangtianliu in #20768
- [VectorExt] Fix illegal transfer_read during gather vectorization by @IanWood1 in #20876
- Align iree_hal_sync_device_t allocation to 16 bytes. by @FantasqueX in #20773
- [LinalgExt] Canonicalize gather to an extract_slice by @IanWood1 in #20878
- [NFC][Codegen] Rename early bufferization op operands by @Max191 in #20874
- [LinalgExt] Fold unit dims for iree_linalg_ext.gather by @Groverkss in #20877
- [Preprocessing] Add SinkReshapesPass in MakeSingleDispatchPassPipeline by @yzhang93 in #20882
- [Codegen] Add patterns to fold reshapes into load_from/store_to_memref by @Max191 in #20881
- [Flow] Dump affinity info in DumpDispatchGraph pass. by @hanhanW in #20888
- [Dispatch Creation] Fix GatherFusionPattern crash by @IanWood1 in #20887
- Temporary automatic reference counting(ish) pass for inserting async deallocations. by @benvanik in #20765
- Adding tryLookupResourceUsageAffinity. by @benvanik in #20891
- Adding support for
#hal.device.optimal<...>
through to runtime. by @benvanik in #20879 - Fix Link error when
IREECompiler.lib
hits 4GiB by @amd-justchen in #20892 - [Codegen][NFC] Make namespace usage follow IREE::[Encoding|Codegen]. by @hanhanW in #20894
- Integrate llvm-project@7a8090c037255b54895d61df2eb141fee48d6d83 by @Groverkss in #20873
- Add support for dynamic unit trip scf.for to scf.if by @nirvedhmeshram in #20880
- Adding --iree-rocm-container-type= flag. by @benvanik in #20902
- [NFC] Rename load_from/store_to_memref to load_from/store_to_buffer by @Max191 in #20897
- [Codegen] Add pass for specializing executable variants by @qedawkins in #20771
- Lower
linalg.copy
to direct global load by @lialan in #20568 - [PJRT] Support PJRT_Memory related APIs in IREE PJRT plugin by @PragmaTwice in #20911
- Removing canonicalization from CloneToConsumersPass. by @benvanik in #20917
- Speeding up two hotspots in large programs. by @benvanik in #20909
- Ignoring single-user ops in CloneToConsumers. by @benvanik in #20921
- Cleaning up some solver/affinity logging/comments. by @benvanik in #20922
- [Integrate] Make IREE compatible with the new memref.assume_alignment semantic change. by @hanhanW in #20913
- [NFC] Simplify constant checks with isZeroInteger and isOneInteger utils. by @hanhanW in #20915
- Refresh the uses of memref.assume_alignment in lit tests. by @hanhanW in #20925
- [PJRT] Enable CUDA build for PJRT plugin in pkgci by @PragmaTwice in #20927
- [LinalgExt] Add gather reshape propagation by @IanWood1 in #20916
- Adding affinity solver max iterations flag and upping default. by @benvanik in #20923
- Integrate llvm-project@d45031ce5281 by @hanhanW in #20924
- [PJRT] Add simple level control to the logger in PJRT plugin by @PragmaTwice in #20932
- [CPU][RISCV][NFC] Trim IRs from lowering strategy selection tests. by @hanhanW in #20933
- [Integrate] Drop two reverts for getBackwardSlice changes. by @hanhanW in #20934
- [GPU] Fix reduction kernel config for vectordistribute by @pashu123 in #20903
- [LinalgExt][NFC] Check for tensor semantics directly by @Muzammiluddin-Syed-ECE in #20936
- Revert "[Integrate] Drop two reverts for getBackwardSlice changes." by @hanhanW in #20941
- Register
ub
dialect andgpu
passes by @rkayaith in #20938 - [Codegen] Propagate relayout ops before combining by @Max191 in #20901
- [Codegen] Enable reshape into buffer folding in BlockDynamicDimensions by @Max191 in #20898
- Restrict GPUAllocPrivateMemoryForDPSOps to pure tensor semantics by @bjacob in #20939
- [PJRT] Update PJRT API version to 0.68 by @PragmaTwice in #20930
- [NFC][LinalgExt] Remove duplicate logic from ReshapeFusion by @IanWood1 in #20940
- Adds a limit to the solver update-on-initialize recursion depth. by @benvanik in #20944
- Integrate llvm-project@28eb66b79413 by @hanhanW in #20942
- [NFC][PJRT] Add a notice for log level setting to PJRT README by @PragmaTwice in #20951
- Remove reference to linalg op tests in iree-test-suites. by @ScottTodd in #20956
- Precomputing pinned value affinities during analysis. by @benvanik in #20945
- [NFC] Fix test issues on Windows. by @lialan in #20957
- Remove IREE_DISABLE_THREAD_SAFETY_ANALYSIS by @bjacob in #20954
- Integrate llvm-project@587d6fcbb685e3a57 by @hanhanW in #20948
- [CPU][NFC] Trim IRs from tile_and_fuse and illegal_configuration tests. by @hanhanW in #20967
- Fix 'failed to legalize' in padding materialization by @jtuyls in #20969
- Use adaptor instead of storeOp to fix 'failed to legalize' by @jtuyls in #20971
- [Codegen][GPU][NFC] Remove dead MMA interface method by @krzysz00 in #20960
- [GPU] Add overflow flag to index addition in prefetching pass by @nirvedhmeshram in #20975
- Run windows_x64_msvc on postsubmit and opt-in on presubmit (retry). by @ScottTodd in #20958
- [DumpExecutableBenchmarks] Use MapVector to simplify code by @rkayaith in #20976
- Allow importing an external stream into HIP. by @AWoloszyn in #20972
- [Codegen][GPU] Refactor the way use_direct_load is propagated. by @lialan in #20926
- [CodeGen][NFC] Delete empty file that was accidentally added. by @hanhanW in #20983
- [Encoding] Teach specialize encodings to handle pad encodings. by @MaheshRavishankar in #20845
- [Integrate] Mirror and prioritize the old ConvertVectorStore pattern. by @hanhanW in #20981
- [NFC] Refresh the interface names for Encoding dialect and data-tiling specifics. by @hanhanW in #20985
- [Util] Improve Util::FoldDimOp folder to handle memref.assume_alignment ops. by @hanhanW in #20984
- Assign optimal affinities to allocations during ScheduleAllocation. by @benvanik in #20965
- Prefetch shared memory in presence of scf.if by @nirvedhmeshram in #20904
- Bump dawidd6/action-download-artifact from 9 to 10 in the github-actions group by @dependabot in #20978
- Integrate llvm-project@7797824297e17d4c02fbb1cb904c7919f21af47e by @nirvedhmeshram in #20987
- [CodeGen] Drop the workaround for memref.assume_alignment chain. by @hanhanW in #20973
- [Codegen][GPU][NFC] More MMA dead code removal by @krzysz00 in #20980
- [Codegen] Use attributes to define default tuning specs by @qedawkins in #20979
- Release our buffer reference regardless of buffer_view success. by @AWoloszyn in #20988
- Disable flaky tensorcore_vectorization test by @qedawkins in #21002
- [docs] Update sharktuner documentation by @Muzammiluddin-Syed-ECE in #20704
- [CPU][NFC] Trim more redundant IRs from lit tests. by @hanhanW in #21003
- [Stream] Adding AffinityTopologyAttrInterface and HAL implementation. by @ziereis in #20885
- [Codegen][GPU] Lower gpu.subgroup_reduce to DPP intrinsics on AMD GPUs by @Muzammiluddin-Syed-ECE in #20468
- [Codegen][NFC] Create tiling utilities file by @AaronStGeorge in #20961
- [Codegen] Create
scf.forall
->scf.for
pass by @AaronStGeorge in #20962 - [ROCM] Fix tuning module string parameter by @qedawkins in #21008
- [Dispatch Creation] Merge bubbling of expand and extract by @IanWood1 in #20989
- Integrate llvm-20250604 by @nirvedhmeshram in #21010
- Remove restriction on IGEMM lowering for dilated convolutions by @yzhang93 in #21011
- Add note about GPU time synchronization by @erieaton-amd in #20997
- Revert "[Codegen][ROCDL] Drop nominal support for dynamic shared mem … by @pravg-amd in #21020
- [LLVMGPU] Fix linking error when one of the variants has no modules. by @MaheshRavishankar in #21027
- Folding util.assume.int values that have a single possible value. by @benvanik in #21025
- [NFC][DispatchCreation] Add better extract of expand test by @IanWood1 in #21013
- Properly order multiple emplaced dispatch results. by @benvanik in #21026
- Fix LHS addressing in medium pingpong f16 by @bjacob in #21017
- [GPU] When Prefetching do not duplicate read stage ops in write stage by @nirvedhmeshram in #21031
- Integrates/llvm 20250606 by @nirvedhmeshram in #21030
- Adding IREE::HAL::AnnotateTargetDevicesPass. by @benvanik in #21022
- [ROCm][Vulkan] Add known targets for Radeon R9070 and 9060XT by @kuhar in #21035
- Adding AMDGPU HAL driver skeleton. by @benvanik in #20990
- [Codegen] split-k on argmax to ensure ukernel support by @bangtianliu in #20906
Commit history: v3.4.0...v3.5.0