feat(optimizer): Rewrite bucketed semi-join to inner join by kaikalur · Pull Request #27510 · prestodb/presto

kaikalur · 2026-04-04T22:46:52Z

Summary

When both sides of a semi-join are backed by tables bucketed on the semi-join key, rewrite the SemiJoinNode to a colocated INNER JOIN with a DISTINCT on the build side. This avoids unnecessary data shuffles since both sides are already co-partitioned by the join key.

The rewrite runs early (before other semi-join/join optimizers) so the resulting JoinNode participates in downstream join ordering.

Transformation

SemiJoin(source, filteringSource, key, semiJoinOutput)
-> Project(semiJoinOutput := TRUE)
    -> InnerJoin(source, Distinct(filteringSource), key)

Changes

FeaturesConfig / SystemSessionProperties / TestFeaturesConfig: Add session property optimizer.rewrite-bucketed-semi-join-to-inner-join (default: false)
RewriteBucketedSemiJoinToInnerJoin: New iterative optimizer rule
PlanOptimizers: Register rule before LeftJoinNullFilterToSemiJoin
TestRewriteBucketedSemiJoinToInnerJoin: 8 test cases with mock bucketed connector

Test Plan

Unit tests via TestRewriteBucketedSemiJoinToInnerJoin with mock bucketed connector
8 test cases covering positive rewrite, session property gating, non-bucketed tables, different bucket keys, non-table-scan sources, and filter/project traversal
TestFeaturesConfig updated for the new session property

Release Notes

== RELEASE NOTES ==
General Changes
* Add a new optimizer rule that rewrites semi-joins to colocated inner joins when both sides are bucketed on the join key, controlled by session property `rewrite_bucketed_semi_join_to_inner_join` (default: disabled).

sourcery-ai · 2026-04-04T22:47:00Z

Reviewer's Guide

Implements a new iterative optimizer rule that rewrites eligible bucketed semi-joins into inner joins with a DISTINCT build side, guarded by a new session property and wired into the optimizer pipeline with comprehensive unit tests using a mock bucketed TPCH connector.

Sequence diagram for RewriteBucketedSemiJoinToInnerJoin rule application

sequenceDiagram
    participant Session
    participant PlanOptimizers
    participant IterativeOptimizer
    participant RewriteRule as RewriteBucketedSemiJoinToInnerJoin
    participant Context
    participant Metadata

    Session->>PlanOptimizers: create with FeaturesConfig
    PlanOptimizers->>IterativeOptimizer: register rule RewriteBucketedSemiJoinToInnerJoin

    loop optimization_iterations
        IterativeOptimizer->>RewriteRule: isEnabled(session)
        RewriteRule->>Session: isRewriteBucketedSemiJoinToInnerJoinEnabled
        Session-->>RewriteRule: boolean enabled
        alt rule disabled
            RewriteRule-->>IterativeOptimizer: disabled
        else rule enabled
            IterativeOptimizer->>RewriteRule: apply(SemiJoinNode, captures, context)
            RewriteRule->>Context: getLookup.resolve(source)
            RewriteRule->>Context: getLookup.resolve(filteringSource)

            par resolve_source_table_scan
                RewriteRule->>Context: findTableScanAndResolveVariable(source)
                Context-->>RewriteRule: Optional TableScanInfo
            and resolve_filter_table_scan
                RewriteRule->>Context: findTableScanAndResolveVariable(filteringSource)
                Context-->>RewriteRule: Optional TableScanInfo
            end

            alt missing_table_scan
                RewriteRule-->>IterativeOptimizer: Result.empty
            else found_table_scans
                RewriteRule->>Metadata: getTableMetadata(source.table)
                Metadata-->>RewriteRule: properties including bucketed_by
                RewriteRule->>Metadata: getColumnMetadata(source.table, source.columnHandle)
                Metadata-->>RewriteRule: source column name

                RewriteRule->>Metadata: getTableMetadata(filtering.table)
                Metadata-->>RewriteRule: properties including bucketed_by
                RewriteRule->>Metadata: getColumnMetadata(filtering.table, filtering.columnHandle)
                Metadata-->>RewriteRule: filtering column name

                alt both_sides_bucketed_by_join_key
                    RewriteRule->>Context: isOutputDistinct(filteringSource)
                    Context-->>RewriteRule: boolean isDistinct
                    alt not_distinct
                        RewriteRule->>Context: create AggregationNode DISTINCT on filteringSource
                    else already_distinct
                        RewriteRule->>RewriteRule: reuse existing filteringSource
                    end

                    RewriteRule->>Context: create InnerJoinNode(source, distinctFilteringSource)
                    RewriteRule->>Context: create ProjectNode(InnerJoinNode, semiJoinOutput := TRUE)
                    RewriteRule-->>IterativeOptimizer: Result.ofPlanNode(ProjectNode)
                else not_bucketed
                    RewriteRule-->>IterativeOptimizer: Result.empty
                end
            end
        end
    end

Updated class diagram for bucketed semi-join rewrite and configuration

classDiagram
    class FeaturesConfig {
        - boolean addExchangeBelowPartialAggregationOverGroupId
        - boolean addDistinctBelowSemiJoinBuild
        - boolean rewriteBucketedSemiJoinToInnerJoin
        - boolean mergeMaxByMinByAggregationsEnabled
        + FeaturesConfig setRewriteBucketedSemiJoinToInnerJoin(boolean rewriteBucketedSemiJoinToInnerJoin)
        + boolean isRewriteBucketedSemiJoinToInnerJoin()
    }

    class SystemSessionProperties {
        <<final>>
        + String REWRITE_BUCKETED_SEMI_JOIN_TO_INNER_JOIN
        + booleanProperty(String name, String description, boolean defaultValue, boolean hidden)
        + boolean isRewriteBucketedSemiJoinToInnerJoinEnabled(Session session)
    }

    class PlanOptimizers {
        + PlanOptimizers(Metadata metadata, RuleStats ruleStats, StatsCalculator statsCalculator, CostCalculator estimatedExchangesCostCalculator)
    }

    class IterativeOptimizer {
        + IterativeOptimizer(Metadata metadata, RuleStats ruleStats, StatsCalculator statsCalculator, CostCalculator estimatedExchangesCostCalculator, Set~Rule~ rules)
        + Result optimize(Session session, PlanNode plan)
    }

    class RewriteBucketedSemiJoinToInnerJoin {
        - Metadata metadata
        - String BUCKETED_BY_PROPERTY
        + RewriteBucketedSemiJoinToInnerJoin(Metadata metadata)
        + Pattern~SemiJoinNode~ getPattern()
        + boolean isEnabled(Session session)
        + Result apply(SemiJoinNode node, Captures captures, Context context)
        - Optional~TableScanInfo~ findTableScanAndResolveVariable(PlanNode node, VariableReferenceExpression variable, Context context)
        - boolean isBucketedByColumn(TableScanInfo info, Session session)
        - boolean isOutputDistinct(PlanNode node, VariableReferenceExpression output, Context context)
    }

    class TableScanInfo {
        - TableScanNode tableScan
        - ColumnHandle columnHandle
        + TableScanInfo(TableScanNode tableScan, ColumnHandle columnHandle)
    }

    class SemiJoinNode {
        + PlanNode getSource()
        + PlanNode getFilteringSource()
        + VariableReferenceExpression getSourceJoinVariable()
        + VariableReferenceExpression getFilteringSourceJoinVariable()
        + VariableReferenceExpression getSemiJoinOutput()
    }

    class JoinNode {
        + JoinType joinType
        + List~EquiJoinClause~ criteria
    }

    class AggregationNode {
        + Step step
        + List~VariableReferenceExpression~ groupingKeys
        + static boolean isDistinct(AggregationNode node)
        + static GroupingSetDescriptor singleGroupingSet(List~VariableReferenceExpression~ keys)
    }

    class ProjectNode {
        + Assignments assignments
    }

    class Metadata {
        + TableMetadata getTableMetadata(Session session, TableHandle table)
        + ColumnMetadata getColumnMetadata(Session session, TableHandle table, ColumnHandle columnHandle)
    }

    class Session
    class RuleStats
    class StatsCalculator
    class CostCalculator
    class Rule
    class Pattern
    class Captures
    class Context {
        + Session getSession()
        + IdAllocator getIdAllocator()
        + Lookup getLookup()
    }
    class Lookup {
        + PlanNode resolve(PlanNode node)
    }

    RewriteBucketedSemiJoinToInnerJoin ..> Metadata : uses
    RewriteBucketedSemiJoinToInnerJoin ..> TableScanInfo : creates
    RewriteBucketedSemiJoinToInnerJoin ..> SemiJoinNode : transforms
    RewriteBucketedSemiJoinToInnerJoin ..> JoinNode : creates
    RewriteBucketedSemiJoinToInnerJoin ..> AggregationNode : creates
    RewriteBucketedSemiJoinToInnerJoin ..> ProjectNode : creates
    RewriteBucketedSemiJoinToInnerJoin ..> Context : uses
    RewriteBucketedSemiJoinToInnerJoin ..> Pattern : returns
    RewriteBucketedSemiJoinToInnerJoin ..|> Rule

    TableScanInfo --> TableScanNode
    SystemSessionProperties ..> FeaturesConfig : reads defaults
    SystemSessionProperties ..> Session : reads system property

    PlanOptimizers ..> IterativeOptimizer : composes
    IterativeOptimizer ..> RewriteBucketedSemiJoinToInnerJoin : contains rule set
    IterativeOptimizer ..> Session : uses
    IterativeOptimizer ..> PlanNode : rewrites

Flow diagram for rewriting bucketed semi-join to inner join

flowchart TD
    SemiJoinNode[SemiJoinNode input]
    ResolveSource[Resolve source plan node]
    ResolveFiltering[Resolve filteringSource plan node]
    FindSourceScan[Find TableScan and join column for source]
    FindFilterScan[Find TableScan and join column for filteringSource]
    CheckSourceBucket[Check source is bucketed by join key]
    CheckFilterBucket[Check filteringSource is bucketed by join key]
    CheckDistinct[Check if filteringSource output is already DISTINCT]
    BuildDistinct[Wrap filteringSource in AggregationNode DISTINCT if needed]
    BuildJoin[Build InnerJoinNode with DISTINCT filteringSource]
    BuildProject[Build ProjectNode adding semiJoinOutput TRUE]
    OutputPlan[Rewritten plan Project -> InnerJoin -> DISTINCT filteringSource]

    SemiJoinNode --> ResolveSource
    SemiJoinNode --> ResolveFiltering
    ResolveSource --> FindSourceScan
    ResolveFiltering --> FindFilterScan

    FindSourceScan --> CheckSourceBucket
    FindFilterScan --> CheckFilterBucket

    CheckSourceBucket -->|not bucketed or scan not found| NoRewrite[Return original SemiJoinNode]
    CheckFilterBucket -->|not bucketed or scan not found| NoRewrite

    CheckSourceBucket -->|bucketed| CheckFilterBucket
    CheckFilterBucket -->|bucketed| CheckDistinct

    CheckDistinct -->|already DISTINCT| BuildJoin
    CheckDistinct -->|not DISTINCT| BuildDistinct --> BuildJoin

    BuildJoin --> BuildProject --> OutputPlan

File-Level Changes

Change	Details	Files
Add a configurable session feature flag to enable rewriting bucketed semi-joins to inner joins.	Introduce rewriteBucketedSemiJoinToInnerJoin boolean field with config binding and getter in FeaturesConfig Expose REWRITE_BUCKETED_SEMI_JOIN_TO_INNER_JOIN system property, including default, description, and accessor in SystemSessionProperties Extend TestFeaturesConfig defaults and explicit property mappings to cover the new property	`presto-main-base/src/main/java/com/facebook/presto/sql/analyzer/FeaturesConfig.java` `presto-main-base/src/main/java/com/facebook/presto/SystemSessionProperties.java` `presto-main-base/src/test/java/com/facebook/presto/sql/analyzer/TestFeaturesConfig.java`
Register the new rule into the logical optimizer sequence so it runs early in join optimization.	Add a new IterativeOptimizer instance that contains RewriteBucketedSemiJoinToInnerJoin to PlanOptimizers, positioned before LeftJoinNullFilterToSemiJoin-related rules	`presto-main-base/src/main/java/com/facebook/presto/sql/planner/PlanOptimizers.java`
Implement RewriteBucketedSemiJoinToInnerJoin rule to transform eligible SemiJoinNodes into Project+InnerJoin+Distinct plans when both inputs are bucketed on the join key.	Define Rule implementation that is gated by the new session property Resolve SemiJoin source and filteringSource through Project/Filter to underlying TableScanNodes and their ColumnHandles Fetch table metadata and inspect bucketed_by property to check that both sides are bucketed on the corresponding join column Detect whether the filtering side is already DISTINCT on the join key; if not, wrap it in an AggregationNode with single grouping key Build a new JoinNode(INNER) between the original source and a distinctified filteringSource, then wrap it in a ProjectNode that sets the semi-join output variable to TRUE and passes through source columns Add helper logic (TableScanInfo, isOutputDistinct, findTableScanAndResolveVariable) to support metadata/bucketing resolution and distinct detection through Filter/Project chains	`presto-main-base/src/main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteBucketedSemiJoinToInnerJoin.java`
Add unit tests for the new rule using a bucketed TPCH mock connector that annotates table metadata with bucketed_by properties.	Create TestRewriteBucketedSemiJoinToInnerJoin with RuleTester wired to a BucketedMockConnectorFactory that extends TpchConnectorFactory Define BucketedMockMetadata overriding getTableMetadata to inject bucketed_by for orders(orderkey), lineitem(orderkey), and nation(regionkey) while leaving customer unbucketed Add tests covering: successful rewrite for both-sides-bucketed tables, structural demo of the transformed plan, disabled session property, non-bucketed source, non-bucketed filtering source, mismatched bucket key, non-TableScan source, and cases where the rule fires through Filter and Project	`presto-main-base/src/test/java/com/facebook/presto/sql/planner/iterative/rule/TestRewriteBucketedSemiJoinToInnerJoin.java`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 2 issues, and left some high level feedback:

When constructing the new JoinNode you drop existing semi-join metadata (e.g., dynamic filter id, join distribution/hints, additional join criteria), which could regress behavior; consider threading through any applicable properties from the original SemiJoinNode instead of using empty optionals/defaults.
The helper findTableScanAndResolveVariable only traverses Project and Filter before giving up, so the rule will silently not apply when the semi-join sides are wrapped in other common nodes (e.g., Limit, TopN, EnforceSingleRow); consider extending this traversal to handle the additional wrappers you expect to see around bucketed table scans.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- When constructing the new JoinNode you drop existing semi-join metadata (e.g., dynamic filter id, join distribution/hints, additional join criteria), which could regress behavior; consider threading through any applicable properties from the original SemiJoinNode instead of using empty optionals/defaults.
- The helper `findTableScanAndResolveVariable` only traverses Project and Filter before giving up, so the rule will silently not apply when the semi-join sides are wrapped in other common nodes (e.g., Limit, TopN, EnforceSingleRow); consider extending this traversal to handle the additional wrappers you expect to see around bucketed table scans.

## Individual Comments

### Comment 1
<location path="presto-main-base/src/test/java/com/facebook/presto/sql/planner/iterative/rule/TestRewriteBucketedSemiJoinToInnerJoin.java" line_range="271" />
<code_context>
+    }
+
+    @Test
+    public void testDoesNotFireForNonTableScanSource()
+    {
+        // source is ValuesNode, not a TableScan
</code_context>
<issue_to_address>
**suggestion (testing):** Add a symmetric negative test where the filtering side (or both sides) is a non-TableScan to cover `findTableScanAndResolveVariable` for that branch.

The source-side branch is covered by `testDoesNotFireForNonTableScanSource`, but the equivalent branch where the filtering side is a non-`TableScan` (and the source is a bucketed `TableScan`) isn’t tested. Please add a semi-join test with a non-`TableScan` filtering source (e.g., `ValuesNode`) that asserts `.doesNotFire()` to cover that path in `findTableScanAndResolveVariable`.
</issue_to_address>

### Comment 2
<location path="presto-main-base/src/test/java/com/facebook/presto/sql/planner/iterative/rule/TestRewriteBucketedSemiJoinToInnerJoin.java" line_range="130" />
<code_context>
+    }
+
+    @Test
+    public void testResultDemoShowsRewrite()
+    {
+        // Demonstrates: SemiJoin(bucketed source, bucketed filteringSource)
</code_context>
<issue_to_address>
**suggestion (testing):** Tighten the plan assertions to verify the semi-join output column is preserved and set to TRUE by the Project node.

The current positive tests (`testRewriteBucketedSemiJoinToInnerJoin` and `testResultDemoShowsRewrite`) only verify the `Project` → `InnerJoin` → `Aggregation` shape, but not that the semi-join output symbol is preserved and set to TRUE in the `ProjectNode`. Since this rule relies on `semiJoinOutput := TRUE` to preserve semantics, a regression in that assignment (wrong symbol/expression or missing output) may go unnoticed. Please extend one of these tests to assert that the project outputs include the semi-join output symbol and that it is mapped to a TRUE literal (or at least that the parent plan expects and uses that symbol), using the appropriate `PlanMatchPattern` helpers.

Suggested implementation:

```java
    @Test
    public void testResultDemoShowsRewrite()
    {
        // Demonstrates: SemiJoin(bucketed source, bucketed filteringSource)
        //   → Project(semiJoinOutput := TRUE) → InnerJoin → Distinct(filteringSource)
        // and verifies that the semi-join output symbol is preserved and set to TRUE.
        tester().assertThat(new RewriteBucketedSemiJoinToInnerJoin(tester().getMetadata()))
                .setSystemProperty(REWRITE_BUCKETED_SEMI_JOIN_TO_INNER_JOIN, "true")
                .on(p -> {
                    VariableReferenceExpression sourceKey = p.variable("sourceKey", BIGINT);
                    VariableReferenceExpression filterKey = p.variable("filterKey", BIGINT);
                    VariableReferenceExpression output = p.variable("output", BOOLEAN);

                    // Build the original plan with a SemiJoin that produces `output`
                    return p.semiJoin(
                            sourceKey,
                            filterKey,
                            output,
                            "source",
                            "filter",
                            Optional.empty(),
                            Optional.empty(),
                            Optional.empty());
                })
                .matches(
                        // After rewrite we expect:
                        // Project(output := TRUE, ...) → InnerJoin → Aggregation(DISTINCT filteringSource)
                        project(
                                // The project must explicitly map the semi-join output symbol to TRUE
                                ImmutableMap.of("output", expression("true")),
                                join(
                                        INNER,
                                        ImmutableList.of(equiJoinClause("sourceKey", "filterKey")),
                                        Optional.empty(),
                                        anyTree(),
                                        aggregation(
                                                singleGroupingSet("filteringOrderkey"),
                                                ImmutableMap.of(),
                                                ImmutableMap.of(),
                                                Optional.empty(),
                                                AggregationNode.Step.SINGLE,
                                                anyTree()))));

```

I only saw the beginning of `testResultDemoShowsRewrite`, so the following may need adjustment to fit the existing code:

1. Ensure the `return p.semiJoin(...)` arguments (source/filter handles and optionals) match how `PlanBuilder.semiJoin` is used elsewhere in this test class or codebase.
2. The `matches(...)` tree is reconstructed from the surrounding snippet; if the existing expected pattern uses different helper methods (e.g., `aggregation(...)` wrapping `distinct()` or different symbol names), align the join/aggregation pattern accordingly.
3. Confirm you have static imports for the pattern helpers used above:
   * `project`, `join`, `aggregation`, `singleGroupingSet`, `anyTree`, `equiJoinClause`, and `expression` from `PlanMatchPattern`.
4. If the test previously wrapped the `semiJoin` in a `project` node in the input plan, keep that structure and only tighten the **expected** `project(...)` in `matches(...)` to assert that `"output"` is mapped to `expression("true")` and that `"output"` is used by the parent node as appropriate.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-04-04T22:49:03Z

...a/com/facebook/presto/sql/planner/iterative/rule/TestRewriteBucketedSemiJoinToInnerJoin.java

+    }
+
+    @Test
+    public void testDoesNotFireForNonTableScanSource()


suggestion (testing): Add a symmetric negative test where the filtering side (or both sides) is a non-TableScan to cover findTableScanAndResolveVariable for that branch.

The source-side branch is covered by testDoesNotFireForNonTableScanSource, but the equivalent branch where the filtering side is a non-TableScan (and the source is a bucketed TableScan) isn’t tested. Please add a semi-join test with a non-TableScan filtering source (e.g., ValuesNode) that asserts .doesNotFire() to cover that path in findTableScanAndResolveVariable.

sourcery-ai · 2026-04-04T22:49:03Z

...a/com/facebook/presto/sql/planner/iterative/rule/TestRewriteBucketedSemiJoinToInnerJoin.java

+    }
+
+    @Test
+    public void testResultDemoShowsRewrite()


suggestion (testing): Tighten the plan assertions to verify the semi-join output column is preserved and set to TRUE by the Project node.

The current positive tests (testRewriteBucketedSemiJoinToInnerJoin and testResultDemoShowsRewrite) only verify the Project → InnerJoin → Aggregation shape, but not that the semi-join output symbol is preserved and set to TRUE in the ProjectNode. Since this rule relies on semiJoinOutput := TRUE to preserve semantics, a regression in that assignment (wrong symbol/expression or missing output) may go unnoticed. Please extend one of these tests to assert that the project outputs include the semi-join output symbol and that it is mapped to a TRUE literal (or at least that the parent plan expects and uses that symbol), using the appropriate PlanMatchPattern helpers.

Suggested implementation:

@Test public void testResultDemoShowsRewrite() { // Demonstrates: SemiJoin(bucketed source, bucketed filteringSource) // → Project(semiJoinOutput := TRUE) → InnerJoin → Distinct(filteringSource) // and verifies that the semi-join output symbol is preserved and set to TRUE. tester().assertThat(new RewriteBucketedSemiJoinToInnerJoin(tester().getMetadata())) .setSystemProperty(REWRITE_BUCKETED_SEMI_JOIN_TO_INNER_JOIN, "true") .on(p -> { VariableReferenceExpression sourceKey = p.variable("sourceKey", BIGINT); VariableReferenceExpression filterKey = p.variable("filterKey", BIGINT); VariableReferenceExpression output = p.variable("output", BOOLEAN); // Build the original plan with a SemiJoin that produces `output` return p.semiJoin( sourceKey, filterKey, output, "source", "filter", Optional.empty(), Optional.empty(), Optional.empty()); }) .matches( // After rewrite we expect: // Project(output := TRUE, ...) → InnerJoin → Aggregation(DISTINCT filteringSource) project( // The project must explicitly map the semi-join output symbol to TRUE ImmutableMap.of("output", expression("true")), join( INNER, ImmutableList.of(equiJoinClause("sourceKey", "filterKey")), Optional.empty(), anyTree(), aggregation( singleGroupingSet("filteringOrderkey"), ImmutableMap.of(), ImmutableMap.of(), Optional.empty(), AggregationNode.Step.SINGLE, anyTree()))));

I only saw the beginning of testResultDemoShowsRewrite, so the following may need adjustment to fit the existing code:

Ensure the return p.semiJoin(...) arguments (source/filter handles and optionals) match how PlanBuilder.semiJoin is used elsewhere in this test class or codebase.

The matches(...) tree is reconstructed from the surrounding snippet; if the existing expected pattern uses different helper methods (e.g., aggregation(...) wrapping distinct() or different symbol names), align the join/aggregation pattern accordingly.

Confirm you have static imports for the pattern helpers used above:

project, join, aggregation, singleGroupingSet, anyTree, equiJoinClause, and expression from PlanMatchPattern.

If the test previously wrapped the semiJoin in a project node in the input plan, keep that structure and only tighten the expected project(...) in matches(...) to assert that "output" is mapped to expression("true") and that "output" is used by the parent node as appropriate.

When both sides of a semi-join are backed by tables bucketed on the semi-join key, rewrite the SemiJoinNode to a colocated INNER JOIN with a DISTINCT on the build side. This avoids unnecessary data shuffles since both sides are already co-partitioned by the join key. The rewrite runs early (before other semi-join/join optimizers) so the resulting JoinNode participates in downstream join ordering. Transformation: SemiJoin(source, filteringSource, key, semiJoinOutput) → Project(semiJoinOutput := TRUE) → InnerJoin(source, Distinct(filteringSource), key) Changes: - Add session property optimizer.rewrite-bucketed-semi-join-to-inner-join (FeaturesConfig, SystemSessionProperties, TestFeaturesConfig) - Add RewriteBucketedSemiJoinToInnerJoin optimizer rule - Register rule in PlanOptimizers before LeftJoinNullFilterToSemiJoin - Add 8 test cases with mock bucketed connector infrastructure

kaikalur requested review from a team, feilong-liu and jaystarshot as code owners April 4, 2026 22:46

prestodb-ci added the from:Meta PR from Meta label Apr 4, 2026

sourcery-ai bot reviewed Apr 4, 2026

View reviewed changes

kaikalur force-pushed the rewrite-bucketed-semi-join-to-inner-join branch 4 times, most recently from 3b07998 to 661b173 Compare April 5, 2026 02:16

kaikalur force-pushed the rewrite-bucketed-semi-join-to-inner-join branch from 661b173 to 210667f Compare April 5, 2026 03:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(optimizer): Rewrite bucketed semi-join to inner join#27510

feat(optimizer): Rewrite bucketed semi-join to inner join#27510
kaikalur wants to merge 1 commit intoprestodb:masterfrom
kaikalur:rewrite-bucketed-semi-join-to-inner-join

kaikalur commented Apr 4, 2026 •

edited by abhinavmuk04

Loading

Uh oh!

sourcery-ai bot commented Apr 4, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

sourcery-ai bot Apr 4, 2026

Uh oh!

sourcery-ai bot Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kaikalur commented Apr 4, 2026 • edited by abhinavmuk04 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Transformation

Changes

Test Plan

Release Notes

Uh oh!

sourcery-ai bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for RewriteBucketedSemiJoinToInnerJoin rule application

Updated class diagram for bucketed semi-join rewrite and configuration

Flow diagram for rewriting bucketed semi-join to inner join

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kaikalur commented Apr 4, 2026 •

edited by abhinavmuk04

Loading

sourcery-ai bot commented Apr 4, 2026 •

edited

Loading