Release Release v3.5.0 · iree-org/iree

Notable changes

Compiler

Added support for AMD Radeon 9060XT and Radeon AI PRO R9700 GPUs #21035.
Defined the new gfx950 target within AMDGPU support, incorporating several novel MFMAs (Matrix-Fused Multiply-Add operations) #20623.
Introduced CombineLayoutTransformation to consolidate transpose, reshape, and slice into a single iree_linalg_ext.map_scatter operation #20655.
Supported medium-sized expanded-shape FP8 in the pingpong strategy #20735. And removed dynamic M bounds checks for 'pingpong' strategies in AMDGPU support #20738.
Enhanced CombineLayoutTransformation to support folding tensor.pad operations into map_scatter operations #20797.
Constrained GPUAllocPrivateMemoryForDPSOps pass to pure tensor semantics, addressing recent implementation issues #20939.
The handling capability of the SpecializeEncodings pass to accommodate pad-based encodings was enhanced, allowing non-serializable encodings to be converted into serializable forms #20845.
A refinement was made to prevent the hoisting of set_encoding and unset_encoding operations related to padding encodings #20733. Additionally, the dispatch creation mechanism has been updated to utilize patterns that facilitate the bubbling up of expand_shape operations across collapse_shape operations #20648. The attention operation was optimized by removing unit dimensions from the mask operand #20796 and new logic was introduced to impose limits on the application of padding encoding during dispatch creation #20732.
Added support for the F8E8M0FNU type, ensuring validity as a HAL element type #20783. Expanded the range of valid HAL element types to include various scaled MFMA types, specifically f8E8M0FNU, f6E3M2FN, and f4E2M1FN.
Linalg Extensions Dialect improvements (#20688,#20728,#20747,#20776, #19719, #20827, #20863 , #20568, #20916)

Runtime

Introduced hal.executable.export condition regions, enhancing dispatch decision-making at each site based on device capabilities and workload parameters #20739.
Added user-defined IREE_ALLOCATOR_SYSTEM support #20727, providing the ability for external override of the allocator control function.
Extended support for mimalloc v3 as an optional system allocator #20730, enabling integration by setting -DIREE_ALLOCATOR_SYSTEM=mimalloc to statically link mimalloc into iree::base.
Enabled the import of external streams into HIP for scenarios requiring close integration with external applications #20972.
Heterogenous device support is under development, which will allow compiled programs to allocate buffers across compatible devices and synchronize operations via semaphores. This intial phase will support CPU-only configurations, aiming for seamless integration #20851.
The IREE PJRT plugin now supports memory related APIs, logging control and has been updated to the latest API version #20911.
An experimental #hal.device.optimal<...> affinity attribute for runtime-resolvable device affinities, initially focusing on allocation-related operations has been implemented #20879.

New Contributors

@javidcf made their first contribution in #19781
@TheCBaH made their first contribution in #20719
@RSchwan made their first contribution in #20761
@AaronStGeorge made their first contribution in #20795
@erieaton-amd made their first contribution in #20997

Full changelog

List of changes

Move ASM to the end of languages list in CMakeLists.txt by @javidcf in #19781
Add MathToROCDL patterns in ConvertToROCLPass. by @benvanik in #20684
[Codegen] Add pass to bufferize dispatch.tensor.load/store ops by @Max191 in #20627
Raise an error in demotion passes if illegal extern funcs are present. by @benvanik in #20679
Properly handle unaligned refs in VM ABI marshaling. by @benvanik in #20671
Integrate llvm/llvm-project@7b70fc7 by @IanWood1 in #20674
[Codegen][GPU] Add placeholder op for buffer casts on tensors by @qedawkins in #20589
Call IREE_TRACE_APP_ENTER/EXIT in compiler tool main functions. by @benvanik in #20686
Revert "Call IREE_TRACE_APP_ENTER/EXIT in compiler tool main functions." by @benvanik in #20691
Revert "Update workflows to run on macOS 15 (#20675)" by @marbre in #20690
Add flow.tensor.bitcast for torch view as complex/real. by @benvanik in #20689
[Dispatch Creation] Fix infinite reshape loop by @IanWood1 in #20162
Fold iree_tensor_ext.dispatch.workload.ordinal on constants. by @benvanik in #20687
Add relative error for buffer comparison by @nirvedhmeshram in #19464
[Encoding] Allow PadEncodingAttribute to support dynamic padding. by @MaheshRavishankar in #20662
[Codegen] Clean up TilingInterfaceUtils. NFC. by @kuhar in #20661
[iree-benchmark] Ensure destructors run before IREE_TRACE_APP_EXIT by @rkayaith in #20694
[DispatchCreation] Set padding encodings on intermediate tensors. by @MaheshRavishankar in #20634
[GPU] Cross lane reduction rather than serial by @pashu123 in #20680
[Codegen] Drop read_only from LoadFromMemrefOp. by @hanhanW in #20693
[Im2col] Remain batch dimension untiled during decomposition when it is contiguous and innermost by @yzhang93 in #20633
Runtime float type conversion helpers: Fix handling of denormals. by @bjacob in #20676
[NFC] Converting the VM dialect to use tablegen passes. by @benvanik in #20698
[LLVMGPU] Vector distribute config to handle dyn dims by @pashu123 in #20603
[VectorDistribution] Improve vector.broadcast distribution by @Groverkss in #20652
[Codegen][GPU] Improve intrinsic based Attention heuristics by @Groverkss in #20695
Adding pass documentation for IREE dialects and pipelines. by @benvanik in #20705
[AMDGPU] Support mask optimization for multiple users by @nirvedhmeshram in #20697
[Im2col] Fix bug when there is no batch dimension by @yzhang93 in #20711
Runtime float conversion helpers: add FP6, FP4 and E8M0 types. by @bjacob in #20707
[Codegen][DerivedConfig] Add support to set outermost tile size as vector size by @yzhang93 in #20692
Fixing missing allocator arg on the vulkan dynamic symbol table. by @benvanik in #20712
[Encoding] Add convertType interface to generalize type conversion by @jtuyls in #20700
[Codegen][GPU] Keep range and divisibility annotations on push constants by @krzysz00 in #19348
[Codegen][AMDGPU] Add pingpong to default gfx942 tuning by @qedawkins in #20678
[Encoding] Drop resolver interface implementation for SpecializedEncodingAttr by @hanhanW in #20718
Bump version to 3.5.0 after 3.4.0 release. by @ScottTodd in #20721
[LinalgExt] Implement tiling interface for map_scatter by @Max191 in #20688
[NFC][Encoding] Move materializeEncodingValueFn to type converter by @jtuyls in #20720
Adding user-defined IREE_ALLOCATOR_SYSTEM support. by @benvanik in #20727
Added missing dependencies for the bazel build by @TheCBaH in #20719
Print opt flags to more accurately reproduce (.linked -> .optimized) by @newling in #20716
[NFC] Use ShapedType::isDynamicShape when possible. by @hanhanW in #20731
[Codegen][Common] Add transform op to check for lowering configs when matching by @qedawkins in #20724
[DispatchCreation] Avoid hoisting set encoding operations with padding encodings by @MaheshRavishankar in #20733
[Codegen] Add pass to combine layout transformations by @Max191 in #20655
[NFC] Move pack/unpack e2e tests to linalg/. by @hanhanW in #20728
Adding mimalloc v3 as an optional system allocator. by @benvanik in #20730
[Docs] Make op/attr/type summary styles consistent by @qedawkins in #20726
Integrate llvm/llvm-project@15f7c6e by @IanWood1 in #20725
[Codegen] Fix invalid use of iterators in PropagateReshapesByExpansion by @rkayaith in #20740
[AMDGPU] Drop dynamic M bounds checks for pingpong by @qedawkins in #20738
[DispatchCreation] Use patterns to bubble up expand shape across collapse shapes. by @MaheshRavishankar in #20648
Continue trying executable loaders when a loader reports NOT_FOUND. by @benvanik in #20745
Adding a hal.executable.export condition region. by @benvanik in #20739
[LinalgExt][NFC] Move transformtion method declarations to Transforms.h by @hanhanW in #20747
Pingpong: add medium-sized expanded-shape FP8 by @bjacob in #20735
Add regression test for #20740 / #20736 by @rkayaith in #20750
[Encoding] Use struct directive for encodingAttr assembly format by @jtuyls in #20746
[hip] Add flag for disabling caching for async allocations. by @AWoloszyn in #20753
e2e matmul test improvements: faster diagnostics, finer control with environment variables by @bjacob in #20755
[Flow] Improve DumpDispatchGraph pass for programs at model level. by @hanhanW in #20756
[GPU] Enable vector distribute pipeline for Matvecs by default by @pashu123 in #20706
Adding quality and benchmark config docs by @geomin12 in #20759
Integrate LLVM to llvm/llvm-project@8404b29 by @MaheshRavishankar in #20757
Fix iree-codegen-llvmcpu-configuration-pipeline registration by @RSchwan in #20761
Metal HAL: remove shadowed variable by @ziereis in #20760
[AMDGPU] Define gfx950 target and its MFMAs by @krzysz00 in #20623
Emit a warning when one of the iree-input-demote-* passes is used. by @benvanik in #20784
Workaround for stack overflow in stream refine usage. by @benvanik in #20749
[NFC][Codegen] Move EncodingNop LayoutAttrInterface to external model by @jtuyls in #20778
[HAL] Add F8E8M0FNU by @tgymnich in #20783
[Codegen][NFC] Move bufferization test out from LLVMCPU/test. by @hanhanW in #20789
[DispatchCreation] White list ops that can be cloned. by @MaheshRavishankar in #20791
[CPU][NFC] Lit tests cleanup and improvements. by @hanhanW in #20790
[NFC][Codegen] Move getEncodingInfo to PackedLayoutAttrInterface by @jtuyls in #20780
[NFC] Move LayoutAttrInterface to Encoding by @jtuyls in #20782
[LinalgExt] Remove region from LinalgExt::GatherOp by @Groverkss in #20776
[NFC][Encoding] Move convertType to LayoutAttrInterface by @jtuyls in #20794
[Codegen][LLVMGPU] Optionally linearize the number of workgroups specified by @MaheshRavishankar in #20787
[GPU] Enable vector distribute on reduction operations by default by @pashu123 in #20751
[Flow] Set known dimensions on concat output by @AaronStGeorge in #20795
[DispatchCreation] Remove unit dim from attn mask by @IanWood1 in #20796
[ROCm] Set ABI version control variable correctly by @krzysz00 in #20800
Removing invalid folder for vm.add + vm.sub ops. by @benvanik in #20808
[Encoding] Add getOffsetSizesStrides interface for load/store materialization by @jtuyls in #20741
[NFC] Refactor duplicated getEncodingInfo logic by @jtuyls in #20820
[GPU] Increase the VAE benchmark threshold by @pashu123 in #20809
[PJRT] Fix tensor element type for signed integers by @PragmaTwice in #19496
[Codegen] split-k on argmax op by @bangtianliu in #20717
[GlobalOptimization] Do not hoist fill-like operations by @Groverkss in #19719
[NFC] Cleaning up flow canonicalize pass. by @benvanik in #20826
[DispatchCreation] Remove CollapseReductionDimensionsPass by @IanWood1 in #20829
[DispatchCreation] Set limits on when padding encoding is applied. by @MaheshRavishankar in #20732
[iree-test-suite] Update the sharktank models benchmark time by @pashu123 in #20830
[LinalgExt] Add a canonicalization pattern to drop unused results from sort op by @Muzammiluddin-Syed-ECE in #20827
Update hanhanW for CODEOWNERS based on recent activities. by @hanhanW in #20840
Sink cast-like flow ops across flow.tensor.transfer/barrier. by @benvanik in #20839
[Codegen][GPU] Add support for allocating private memory for unused DPS results by @Muzammiluddin-Syed-ECE in #20793
[Codegen] Make ReconcileTranslationInfo work with multiple exports by @qedawkins in #20801
Cleaning up iree_hal_module_debug_sink_t destroy. by @benvanik in #20841
[Codegen][ROCDL] Drop nominal support for dynamic shared mem by @qedawkins in #20805
Integrate llvm-project@faf5d747f174cc by @krzysz00 in #20828
[Codegen][NFC] Refresh remove_single_iteration_loop.mlir test. by @hanhanW in #20842
[TensorExt] Drop space from count_from_slice printer by @qedawkins in #20850
[LinalgExt] Clone iree_linalg_ext.gather (5/5) by @IanWood1 in #20563
Fix logic for yieldReplacements in tileDispatchUsingForall by @pashu123 in #20844
[Dispatch Creation] Handle linalg.fill in collapse dimensions by @IanWood1 in #20863
[VectorExt] Fix transfer_gather printer by @IanWood1 in #20860
[Codegen] Fix dominance issue in collapse shape fusion by @jtuyls in #20864
[VectorExt] Vectorize iree_linalg_ext.gather by @IanWood1 in #20807
[Dispatch Creation] Clone iree_linalg_ext.gather for attn by @IanWood1 in #20866
[LinalgExt] Add map_scatter e2e tests for CPU and VMVX backends. by @hanhanW in #20861
[Codegen][GPU] Support padding in CombineLayoutTransformation by @Max191 in #20797
Fix padding to nop encoding specialization by @jtuyls in #20837
[CodeGen] Fix a MemoryEffectsOpInterface bug in FuseConsumerOp. by @hanhanW in #20869
[GPU] Vector distribution support for multiple stores by @pashu123 in #20816
[AMDGPU] Rewrite some gpu.shuffle xor to ds_swizzle, per upstream by @krzysz00 in #20868
[Codegen] Support multiple forall ops in ReconcileTranslationInfo by @Max191 in #20848
[Codegen] Add ukernel support for argmax on BF16 and enable optional max value return by @bangtianliu in #20768
[VectorExt] Fix illegal transfer_read during gather vectorization by @IanWood1 in #20876
Align iree_hal_sync_device_t allocation to 16 bytes. by @FantasqueX in #20773
[LinalgExt] Canonicalize gather to an extract_slice by @IanWood1 in #20878
[NFC][Codegen] Rename early bufferization op operands by @Max191 in #20874
[LinalgExt] Fold unit dims for iree_linalg_ext.gather by @Groverkss in #20877
[Preprocessing] Add SinkReshapesPass in MakeSingleDispatchPassPipeline by @yzhang93 in #20882
[Codegen] Add patterns to fold reshapes into load_from/store_to_memref by @Max191 in #20881
[Flow] Dump affinity info in DumpDispatchGraph pass. by @hanhanW in #20888
[Dispatch Creation] Fix GatherFusionPattern crash by @IanWood1 in #20887
Temporary automatic reference counting(ish) pass for inserting async deallocations. by @benvanik in #20765
Adding tryLookupResourceUsageAffinity. by @benvanik in #20891
Adding support for #hal.device.optimal<...> through to runtime. by @benvanik in #20879
Fix Link error when IREECompiler.lib hits 4GiB by @amd-justchen in #20892
[Codegen][NFC] Make namespace usage follow IREE::[Encoding|Codegen]. by @hanhanW in #20894
Integrate llvm-project@7a8090c037255b54895d61df2eb141fee48d6d83 by @Groverkss in #20873
Add support for dynamic unit trip scf.for to scf.if by @nirvedhmeshram in #20880
Adding --iree-rocm-container-type= flag. by @benvanik in #20902
[NFC] Rename load_from/store_to_memref to load_from/store_to_buffer by @Max191 in #20897
[Codegen] Add pass for specializing executable variants by @qedawkins in #20771
Lower linalg.copy to direct global load by @lialan in #20568
[PJRT] Support PJRT_Memory related APIs in IREE PJRT plugin by @PragmaTwice in #20911
Removing canonicalization from CloneToConsumersPass. by @benvanik in #20917
Speeding up two hotspots in large programs. by @benvanik in #20909
Ignoring single-user ops in CloneToConsumers. by @benvanik in #20921
Cleaning up some solver/affinity logging/comments. by @benvanik in #20922
[Integrate] Make IREE compatible with the new memref.assume_alignment semantic change. by @hanhanW in #20913
[NFC] Simplify constant checks with isZeroInteger and isOneInteger utils. by @hanhanW in #20915
Refresh the uses of memref.assume_alignment in lit tests. by @hanhanW in #20925
[PJRT] Enable CUDA build for PJRT plugin in pkgci by @PragmaTwice in #20927
[LinalgExt] Add gather reshape propagation by @IanWood1 in #20916
Adding affinity solver max iterations flag and upping default. by @benvanik in #20923
Integrate llvm-project@d45031ce5281 by @hanhanW in #20924
[PJRT] Add simple level control to the logger in PJRT plugin by @PragmaTwice in #20932
[CPU][RISCV][NFC] Trim IRs from lowering strategy selection tests. by @hanhanW in #20933
[Integrate] Drop two reverts for getBackwardSlice changes. by @hanhanW in #20934
[GPU] Fix reduction kernel config for vectordistribute by @pashu123 in #20903
[LinalgExt][NFC] Check for tensor semantics directly by @Muzammiluddin-Syed-ECE in #20936
Revert "[Integrate] Drop two reverts for getBackwardSlice changes." by @hanhanW in #20941
Register ub dialect and gpu passes by @rkayaith in #20938
[Codegen] Propagate relayout ops before combining by @Max191 in #20901
[Codegen] Enable reshape into buffer folding in BlockDynamicDimensions by @Max191 in #20898
Restrict GPUAllocPrivateMemoryForDPSOps to pure tensor semantics by @bjacob in #20939
[PJRT] Update PJRT API version to 0.68 by @PragmaTwice in #20930
[NFC][LinalgExt] Remove duplicate logic from ReshapeFusion by @IanWood1 in #20940
Adds a limit to the solver update-on-initialize recursion depth. by @benvanik in #20944
Integrate llvm-project@28eb66b79413 by @hanhanW in #20942
[NFC][PJRT] Add a notice for log level setting to PJRT README by @PragmaTwice in #20951
Remove reference to linalg op tests in iree-test-suites. by @ScottTodd in #20956
Precomputing pinned value affinities during analysis. by @benvanik in #20945
[NFC] Fix test issues on Windows. by @lialan in #20957
Remove IREE_DISABLE_THREAD_SAFETY_ANALYSIS by @bjacob in #20954
Integrate llvm-project@587d6fcbb685e3a57 by @hanhanW in #20948
[CPU][NFC] Trim IRs from tile_and_fuse and illegal_configuration tests. by @hanhanW in #20967
Fix 'failed to legalize' in padding materialization by @jtuyls in #20969
Use adaptor instead of storeOp to fix 'failed to legalize' by @jtuyls in #20971
[Codegen][GPU][NFC] Remove dead MMA interface method by @krzysz00 in #20960
[GPU] Add overflow flag to index addition in prefetching pass by @nirvedhmeshram in #20975
Run windows_x64_msvc on postsubmit and opt-in on presubmit (retry). by @ScottTodd in #20958
[DumpExecutableBenchmarks] Use MapVector to simplify code by @rkayaith in #20976
Allow importing an external stream into HIP. by @AWoloszyn in #20972
[Codegen][GPU] Refactor the way use_direct_load is propagated. by @lialan in #20926
[CodeGen][NFC] Delete empty file that was accidentally added. by @hanhanW in #20983
[Encoding] Teach specialize encodings to handle pad encodings. by @MaheshRavishankar in #20845
[Integrate] Mirror and prioritize the old ConvertVectorStore pattern. by @hanhanW in #20981
[NFC] Refresh the interface names for Encoding dialect and data-tiling specifics. by @hanhanW in #20985
[Util] Improve Util::FoldDimOp folder to handle memref.assume_alignment ops. by @hanhanW in #20984
Assign optimal affinities to allocations during ScheduleAllocation. by @benvanik in #20965
Prefetch shared memory in presence of scf.if by @nirvedhmeshram in #20904
Bump dawidd6/action-download-artifact from 9 to 10 in the github-actions group by @dependabot in #20978
Integrate llvm-project@7797824297e17d4c02fbb1cb904c7919f21af47e by @nirvedhmeshram in #20987
[CodeGen] Drop the workaround for memref.assume_alignment chain. by @hanhanW in #20973
[Codegen][GPU][NFC] More MMA dead code removal by @krzysz00 in #20980
[Codegen] Use attributes to define default tuning specs by @qedawkins in #20979
Release our buffer reference regardless of buffer_view success. by @AWoloszyn in #20988
Disable flaky tensorcore_vectorization test by @qedawkins in #21002
[docs] Update sharktuner documentation by @Muzammiluddin-Syed-ECE in #20704
[CPU][NFC] Trim more redundant IRs from lit tests. by @hanhanW in #21003
[Stream] Adding AffinityTopologyAttrInterface and HAL implementation. by @ziereis in #20885
[Codegen][GPU] Lower gpu.subgroup_reduce to DPP intrinsics on AMD GPUs by @Muzammiluddin-Syed-ECE in #20468
[Codegen][NFC] Create tiling utilities file by @AaronStGeorge in #20961
[Codegen] Create scf.forall -> scf.for pass by @AaronStGeorge in #20962
[ROCM] Fix tuning module string parameter by @qedawkins in #21008
[Dispatch Creation] Merge bubbling of expand and extract by @IanWood1 in #20989
Integrate llvm-20250604 by @nirvedhmeshram in #21010
Remove restriction on IGEMM lowering for dilated convolutions by @yzhang93 in #21011
Add note about GPU time synchronization by @erieaton-amd in #20997
Revert "[Codegen][ROCDL] Drop nominal support for dynamic shared mem … by @pravg-amd in #21020
[LLVMGPU] Fix linking error when one of the variants has no modules. by @MaheshRavishankar in #21027
Folding util.assume.int values that have a single possible value. by @benvanik in #21025
[NFC][DispatchCreation] Add better extract of expand test by @IanWood1 in #21013
Properly order multiple emplaced dispatch results. by @benvanik in #21026
Fix LHS addressing in medium pingpong f16 by @bjacob in #21017
[GPU] When Prefetching do not duplicate read stage ops in write stage by @nirvedhmeshram in #21031
Integrates/llvm 20250606 by @nirvedhmeshram in #21030
Adding IREE::HAL::AnnotateTargetDevicesPass. by @benvanik in #21022
[ROCm][Vulkan] Add known targets for Radeon R9070 and 9060XT by @kuhar in #21035
Adding AMDGPU HAL driver skeleton. by @benvanik in #20990
[Codegen] split-k on argmax to ensure ukernel support by @bangtianliu in #20906

Commit history: v3.4.0...v3.5.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release v3.5.0

Notable changes

Compiler

Runtime

New Contributors

Full changelog

Contributors

Uh oh!