Skip to content

feat(optimizer): Rewrite bucketed semi-join to inner join#27510

Open
kaikalur wants to merge 1 commit intoprestodb:masterfrom
kaikalur:rewrite-bucketed-semi-join-to-inner-join
Open

feat(optimizer): Rewrite bucketed semi-join to inner join#27510
kaikalur wants to merge 1 commit intoprestodb:masterfrom
kaikalur:rewrite-bucketed-semi-join-to-inner-join

Conversation

@kaikalur
Copy link
Copy Markdown
Contributor

@kaikalur kaikalur commented Apr 4, 2026

Summary

When both sides of a semi-join are backed by tables bucketed on the semi-join key, rewrite the SemiJoinNode to a colocated INNER JOIN with a DISTINCT on the build side. This avoids unnecessary data shuffles since both sides are already co-partitioned by the join key.

The rewrite runs early (before other semi-join/join optimizers) so the resulting JoinNode participates in downstream join ordering.

Transformation

SemiJoin(source, filteringSource, key, semiJoinOutput)
-> Project(semiJoinOutput := TRUE)
    -> InnerJoin(source, Distinct(filteringSource), key)

Changes

  • FeaturesConfig / SystemSessionProperties / TestFeaturesConfig: Add session property optimizer.rewrite-bucketed-semi-join-to-inner-join (default: false)
  • RewriteBucketedSemiJoinToInnerJoin: New iterative optimizer rule
  • PlanOptimizers: Register rule before LeftJoinNullFilterToSemiJoin
  • TestRewriteBucketedSemiJoinToInnerJoin: 8 test cases with mock bucketed connector

Test Plan

  • Unit tests via TestRewriteBucketedSemiJoinToInnerJoin with mock bucketed connector
  • 8 test cases covering positive rewrite, session property gating, non-bucketed tables, different bucket keys, non-table-scan sources, and filter/project traversal
  • TestFeaturesConfig updated for the new session property

Release Notes

== RELEASE NOTES ==
General Changes
* Add a new optimizer rule that rewrites semi-joins to colocated inner joins when both sides are bucketed on the join key, controlled by session property `rewrite_bucketed_semi_join_to_inner_join` (default: disabled).

@kaikalur kaikalur requested review from a team, feilong-liu and jaystarshot as code owners April 4, 2026 22:46
@prestodb-ci prestodb-ci added the from:Meta PR from Meta label Apr 4, 2026
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Apr 4, 2026

Reviewer's Guide

Implements a new iterative optimizer rule that rewrites eligible bucketed semi-joins into inner joins with a DISTINCT build side, guarded by a new session property and wired into the optimizer pipeline with comprehensive unit tests using a mock bucketed TPCH connector.

Sequence diagram for RewriteBucketedSemiJoinToInnerJoin rule application

sequenceDiagram
    participant Session
    participant PlanOptimizers
    participant IterativeOptimizer
    participant RewriteRule as RewriteBucketedSemiJoinToInnerJoin
    participant Context
    participant Metadata

    Session->>PlanOptimizers: create with FeaturesConfig
    PlanOptimizers->>IterativeOptimizer: register rule RewriteBucketedSemiJoinToInnerJoin

    loop optimization_iterations
        IterativeOptimizer->>RewriteRule: isEnabled(session)
        RewriteRule->>Session: isRewriteBucketedSemiJoinToInnerJoinEnabled
        Session-->>RewriteRule: boolean enabled
        alt rule disabled
            RewriteRule-->>IterativeOptimizer: disabled
        else rule enabled
            IterativeOptimizer->>RewriteRule: apply(SemiJoinNode, captures, context)
            RewriteRule->>Context: getLookup.resolve(source)
            RewriteRule->>Context: getLookup.resolve(filteringSource)

            par resolve_source_table_scan
                RewriteRule->>Context: findTableScanAndResolveVariable(source)
                Context-->>RewriteRule: Optional TableScanInfo
            and resolve_filter_table_scan
                RewriteRule->>Context: findTableScanAndResolveVariable(filteringSource)
                Context-->>RewriteRule: Optional TableScanInfo
            end

            alt missing_table_scan
                RewriteRule-->>IterativeOptimizer: Result.empty
            else found_table_scans
                RewriteRule->>Metadata: getTableMetadata(source.table)
                Metadata-->>RewriteRule: properties including bucketed_by
                RewriteRule->>Metadata: getColumnMetadata(source.table, source.columnHandle)
                Metadata-->>RewriteRule: source column name

                RewriteRule->>Metadata: getTableMetadata(filtering.table)
                Metadata-->>RewriteRule: properties including bucketed_by
                RewriteRule->>Metadata: getColumnMetadata(filtering.table, filtering.columnHandle)
                Metadata-->>RewriteRule: filtering column name

                alt both_sides_bucketed_by_join_key
                    RewriteRule->>Context: isOutputDistinct(filteringSource)
                    Context-->>RewriteRule: boolean isDistinct
                    alt not_distinct
                        RewriteRule->>Context: create AggregationNode DISTINCT on filteringSource
                    else already_distinct
                        RewriteRule->>RewriteRule: reuse existing filteringSource
                    end

                    RewriteRule->>Context: create InnerJoinNode(source, distinctFilteringSource)
                    RewriteRule->>Context: create ProjectNode(InnerJoinNode, semiJoinOutput := TRUE)
                    RewriteRule-->>IterativeOptimizer: Result.ofPlanNode(ProjectNode)
                else not_bucketed
                    RewriteRule-->>IterativeOptimizer: Result.empty
                end
            end
        end
    end
Loading

Updated class diagram for bucketed semi-join rewrite and configuration

classDiagram
    class FeaturesConfig {
        - boolean addExchangeBelowPartialAggregationOverGroupId
        - boolean addDistinctBelowSemiJoinBuild
        - boolean rewriteBucketedSemiJoinToInnerJoin
        - boolean mergeMaxByMinByAggregationsEnabled
        + FeaturesConfig setRewriteBucketedSemiJoinToInnerJoin(boolean rewriteBucketedSemiJoinToInnerJoin)
        + boolean isRewriteBucketedSemiJoinToInnerJoin()
    }

    class SystemSessionProperties {
        <<final>>
        + String REWRITE_BUCKETED_SEMI_JOIN_TO_INNER_JOIN
        + booleanProperty(String name, String description, boolean defaultValue, boolean hidden)
        + boolean isRewriteBucketedSemiJoinToInnerJoinEnabled(Session session)
    }

    class PlanOptimizers {
        + PlanOptimizers(Metadata metadata, RuleStats ruleStats, StatsCalculator statsCalculator, CostCalculator estimatedExchangesCostCalculator)
    }

    class IterativeOptimizer {
        + IterativeOptimizer(Metadata metadata, RuleStats ruleStats, StatsCalculator statsCalculator, CostCalculator estimatedExchangesCostCalculator, Set~Rule~ rules)
        + Result optimize(Session session, PlanNode plan)
    }

    class RewriteBucketedSemiJoinToInnerJoin {
        - Metadata metadata
        - String BUCKETED_BY_PROPERTY
        + RewriteBucketedSemiJoinToInnerJoin(Metadata metadata)
        + Pattern~SemiJoinNode~ getPattern()
        + boolean isEnabled(Session session)
        + Result apply(SemiJoinNode node, Captures captures, Context context)
        - Optional~TableScanInfo~ findTableScanAndResolveVariable(PlanNode node, VariableReferenceExpression variable, Context context)
        - boolean isBucketedByColumn(TableScanInfo info, Session session)
        - boolean isOutputDistinct(PlanNode node, VariableReferenceExpression output, Context context)
    }

    class TableScanInfo {
        - TableScanNode tableScan
        - ColumnHandle columnHandle
        + TableScanInfo(TableScanNode tableScan, ColumnHandle columnHandle)
    }

    class SemiJoinNode {
        + PlanNode getSource()
        + PlanNode getFilteringSource()
        + VariableReferenceExpression getSourceJoinVariable()
        + VariableReferenceExpression getFilteringSourceJoinVariable()
        + VariableReferenceExpression getSemiJoinOutput()
    }

    class JoinNode {
        + JoinType joinType
        + List~EquiJoinClause~ criteria
    }

    class AggregationNode {
        + Step step
        + List~VariableReferenceExpression~ groupingKeys
        + static boolean isDistinct(AggregationNode node)
        + static GroupingSetDescriptor singleGroupingSet(List~VariableReferenceExpression~ keys)
    }

    class ProjectNode {
        + Assignments assignments
    }

    class Metadata {
        + TableMetadata getTableMetadata(Session session, TableHandle table)
        + ColumnMetadata getColumnMetadata(Session session, TableHandle table, ColumnHandle columnHandle)
    }

    class Session
    class RuleStats
    class StatsCalculator
    class CostCalculator
    class Rule
    class Pattern
    class Captures
    class Context {
        + Session getSession()
        + IdAllocator getIdAllocator()
        + Lookup getLookup()
    }
    class Lookup {
        + PlanNode resolve(PlanNode node)
    }

    RewriteBucketedSemiJoinToInnerJoin ..> Metadata : uses
    RewriteBucketedSemiJoinToInnerJoin ..> TableScanInfo : creates
    RewriteBucketedSemiJoinToInnerJoin ..> SemiJoinNode : transforms
    RewriteBucketedSemiJoinToInnerJoin ..> JoinNode : creates
    RewriteBucketedSemiJoinToInnerJoin ..> AggregationNode : creates
    RewriteBucketedSemiJoinToInnerJoin ..> ProjectNode : creates
    RewriteBucketedSemiJoinToInnerJoin ..> Context : uses
    RewriteBucketedSemiJoinToInnerJoin ..> Pattern : returns
    RewriteBucketedSemiJoinToInnerJoin ..|> Rule

    TableScanInfo --> TableScanNode
    SystemSessionProperties ..> FeaturesConfig : reads defaults
    SystemSessionProperties ..> Session : reads system property

    PlanOptimizers ..> IterativeOptimizer : composes
    IterativeOptimizer ..> RewriteBucketedSemiJoinToInnerJoin : contains rule set
    IterativeOptimizer ..> Session : uses
    IterativeOptimizer ..> PlanNode : rewrites
Loading

Flow diagram for rewriting bucketed semi-join to inner join

flowchart TD
    SemiJoinNode[SemiJoinNode input]
    ResolveSource[Resolve source plan node]
    ResolveFiltering[Resolve filteringSource plan node]
    FindSourceScan[Find TableScan and join column for source]
    FindFilterScan[Find TableScan and join column for filteringSource]
    CheckSourceBucket[Check source is bucketed by join key]
    CheckFilterBucket[Check filteringSource is bucketed by join key]
    CheckDistinct[Check if filteringSource output is already DISTINCT]
    BuildDistinct[Wrap filteringSource in AggregationNode DISTINCT if needed]
    BuildJoin[Build InnerJoinNode with DISTINCT filteringSource]
    BuildProject[Build ProjectNode adding semiJoinOutput TRUE]
    OutputPlan[Rewritten plan Project -> InnerJoin -> DISTINCT filteringSource]

    SemiJoinNode --> ResolveSource
    SemiJoinNode --> ResolveFiltering
    ResolveSource --> FindSourceScan
    ResolveFiltering --> FindFilterScan

    FindSourceScan --> CheckSourceBucket
    FindFilterScan --> CheckFilterBucket

    CheckSourceBucket -->|not bucketed or scan not found| NoRewrite[Return original SemiJoinNode]
    CheckFilterBucket -->|not bucketed or scan not found| NoRewrite

    CheckSourceBucket -->|bucketed| CheckFilterBucket
    CheckFilterBucket -->|bucketed| CheckDistinct

    CheckDistinct -->|already DISTINCT| BuildJoin
    CheckDistinct -->|not DISTINCT| BuildDistinct --> BuildJoin

    BuildJoin --> BuildProject --> OutputPlan
Loading

File-Level Changes

Change Details Files
Add a configurable session feature flag to enable rewriting bucketed semi-joins to inner joins.
  • Introduce rewriteBucketedSemiJoinToInnerJoin boolean field with config binding and getter in FeaturesConfig
  • Expose REWRITE_BUCKETED_SEMI_JOIN_TO_INNER_JOIN system property, including default, description, and accessor in SystemSessionProperties
  • Extend TestFeaturesConfig defaults and explicit property mappings to cover the new property
presto-main-base/src/main/java/com/facebook/presto/sql/analyzer/FeaturesConfig.java
presto-main-base/src/main/java/com/facebook/presto/SystemSessionProperties.java
presto-main-base/src/test/java/com/facebook/presto/sql/analyzer/TestFeaturesConfig.java
Register the new rule into the logical optimizer sequence so it runs early in join optimization.
  • Add a new IterativeOptimizer instance that contains RewriteBucketedSemiJoinToInnerJoin to PlanOptimizers, positioned before LeftJoinNullFilterToSemiJoin-related rules
presto-main-base/src/main/java/com/facebook/presto/sql/planner/PlanOptimizers.java
Implement RewriteBucketedSemiJoinToInnerJoin rule to transform eligible SemiJoinNodes into Project+InnerJoin+Distinct plans when both inputs are bucketed on the join key.
  • Define Rule implementation that is gated by the new session property
  • Resolve SemiJoin source and filteringSource through Project/Filter to underlying TableScanNodes and their ColumnHandles
  • Fetch table metadata and inspect bucketed_by property to check that both sides are bucketed on the corresponding join column
  • Detect whether the filtering side is already DISTINCT on the join key; if not, wrap it in an AggregationNode with single grouping key
  • Build a new JoinNode(INNER) between the original source and a distinctified filteringSource, then wrap it in a ProjectNode that sets the semi-join output variable to TRUE and passes through source columns
  • Add helper logic (TableScanInfo, isOutputDistinct, findTableScanAndResolveVariable) to support metadata/bucketing resolution and distinct detection through Filter/Project chains
presto-main-base/src/main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteBucketedSemiJoinToInnerJoin.java
Add unit tests for the new rule using a bucketed TPCH mock connector that annotates table metadata with bucketed_by properties.
  • Create TestRewriteBucketedSemiJoinToInnerJoin with RuleTester wired to a BucketedMockConnectorFactory that extends TpchConnectorFactory
  • Define BucketedMockMetadata overriding getTableMetadata to inject bucketed_by for orders(orderkey), lineitem(orderkey), and nation(regionkey) while leaving customer unbucketed
  • Add tests covering: successful rewrite for both-sides-bucketed tables, structural demo of the transformed plan, disabled session property, non-bucketed source, non-bucketed filtering source, mismatched bucket key, non-TableScan source, and cases where the rule fires through Filter and Project
presto-main-base/src/test/java/com/facebook/presto/sql/planner/iterative/rule/TestRewriteBucketedSemiJoinToInnerJoin.java

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • When constructing the new JoinNode you drop existing semi-join metadata (e.g., dynamic filter id, join distribution/hints, additional join criteria), which could regress behavior; consider threading through any applicable properties from the original SemiJoinNode instead of using empty optionals/defaults.
  • The helper findTableScanAndResolveVariable only traverses Project and Filter before giving up, so the rule will silently not apply when the semi-join sides are wrapped in other common nodes (e.g., Limit, TopN, EnforceSingleRow); consider extending this traversal to handle the additional wrappers you expect to see around bucketed table scans.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- When constructing the new JoinNode you drop existing semi-join metadata (e.g., dynamic filter id, join distribution/hints, additional join criteria), which could regress behavior; consider threading through any applicable properties from the original SemiJoinNode instead of using empty optionals/defaults.
- The helper `findTableScanAndResolveVariable` only traverses Project and Filter before giving up, so the rule will silently not apply when the semi-join sides are wrapped in other common nodes (e.g., Limit, TopN, EnforceSingleRow); consider extending this traversal to handle the additional wrappers you expect to see around bucketed table scans.

## Individual Comments

### Comment 1
<location path="presto-main-base/src/test/java/com/facebook/presto/sql/planner/iterative/rule/TestRewriteBucketedSemiJoinToInnerJoin.java" line_range="271" />
<code_context>
+    }
+
+    @Test
+    public void testDoesNotFireForNonTableScanSource()
+    {
+        // source is ValuesNode, not a TableScan
</code_context>
<issue_to_address>
**suggestion (testing):** Add a symmetric negative test where the filtering side (or both sides) is a non-TableScan to cover `findTableScanAndResolveVariable` for that branch.

The source-side branch is covered by `testDoesNotFireForNonTableScanSource`, but the equivalent branch where the filtering side is a non-`TableScan` (and the source is a bucketed `TableScan`) isn’t tested. Please add a semi-join test with a non-`TableScan` filtering source (e.g., `ValuesNode`) that asserts `.doesNotFire()` to cover that path in `findTableScanAndResolveVariable`.
</issue_to_address>

### Comment 2
<location path="presto-main-base/src/test/java/com/facebook/presto/sql/planner/iterative/rule/TestRewriteBucketedSemiJoinToInnerJoin.java" line_range="130" />
<code_context>
+    }
+
+    @Test
+    public void testResultDemoShowsRewrite()
+    {
+        // Demonstrates: SemiJoin(bucketed source, bucketed filteringSource)
</code_context>
<issue_to_address>
**suggestion (testing):** Tighten the plan assertions to verify the semi-join output column is preserved and set to TRUE by the Project node.

The current positive tests (`testRewriteBucketedSemiJoinToInnerJoin` and `testResultDemoShowsRewrite`) only verify the `Project``InnerJoin``Aggregation` shape, but not that the semi-join output symbol is preserved and set to TRUE in the `ProjectNode`. Since this rule relies on `semiJoinOutput := TRUE` to preserve semantics, a regression in that assignment (wrong symbol/expression or missing output) may go unnoticed. Please extend one of these tests to assert that the project outputs include the semi-join output symbol and that it is mapped to a TRUE literal (or at least that the parent plan expects and uses that symbol), using the appropriate `PlanMatchPattern` helpers.

Suggested implementation:

```java
    @Test
    public void testResultDemoShowsRewrite()
    {
        // Demonstrates: SemiJoin(bucketed source, bucketed filteringSource)
        //   → Project(semiJoinOutput := TRUE) → InnerJoin → Distinct(filteringSource)
        // and verifies that the semi-join output symbol is preserved and set to TRUE.
        tester().assertThat(new RewriteBucketedSemiJoinToInnerJoin(tester().getMetadata()))
                .setSystemProperty(REWRITE_BUCKETED_SEMI_JOIN_TO_INNER_JOIN, "true")
                .on(p -> {
                    VariableReferenceExpression sourceKey = p.variable("sourceKey", BIGINT);
                    VariableReferenceExpression filterKey = p.variable("filterKey", BIGINT);
                    VariableReferenceExpression output = p.variable("output", BOOLEAN);

                    // Build the original plan with a SemiJoin that produces `output`
                    return p.semiJoin(
                            sourceKey,
                            filterKey,
                            output,
                            "source",
                            "filter",
                            Optional.empty(),
                            Optional.empty(),
                            Optional.empty());
                })
                .matches(
                        // After rewrite we expect:
                        // Project(output := TRUE, ...) → InnerJoin → Aggregation(DISTINCT filteringSource)
                        project(
                                // The project must explicitly map the semi-join output symbol to TRUE
                                ImmutableMap.of("output", expression("true")),
                                join(
                                        INNER,
                                        ImmutableList.of(equiJoinClause("sourceKey", "filterKey")),
                                        Optional.empty(),
                                        anyTree(),
                                        aggregation(
                                                singleGroupingSet("filteringOrderkey"),
                                                ImmutableMap.of(),
                                                ImmutableMap.of(),
                                                Optional.empty(),
                                                AggregationNode.Step.SINGLE,
                                                anyTree()))));

```

I only saw the beginning of `testResultDemoShowsRewrite`, so the following may need adjustment to fit the existing code:

1. Ensure the `return p.semiJoin(...)` arguments (source/filter handles and optionals) match how `PlanBuilder.semiJoin` is used elsewhere in this test class or codebase.
2. The `matches(...)` tree is reconstructed from the surrounding snippet; if the existing expected pattern uses different helper methods (e.g., `aggregation(...)` wrapping `distinct()` or different symbol names), align the join/aggregation pattern accordingly.
3. Confirm you have static imports for the pattern helpers used above:
   * `project`, `join`, `aggregation`, `singleGroupingSet`, `anyTree`, `equiJoinClause`, and `expression` from `PlanMatchPattern`.
4. If the test previously wrapped the `semiJoin` in a `project` node in the input plan, keep that structure and only tighten the **expected** `project(...)` in `matches(...)` to assert that `"output"` is mapped to `expression("true")` and that `"output"` is used by the parent node as appropriate.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

}

@Test
public void testDoesNotFireForNonTableScanSource()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add a symmetric negative test where the filtering side (or both sides) is a non-TableScan to cover findTableScanAndResolveVariable for that branch.

The source-side branch is covered by testDoesNotFireForNonTableScanSource, but the equivalent branch where the filtering side is a non-TableScan (and the source is a bucketed TableScan) isn’t tested. Please add a semi-join test with a non-TableScan filtering source (e.g., ValuesNode) that asserts .doesNotFire() to cover that path in findTableScanAndResolveVariable.

}

@Test
public void testResultDemoShowsRewrite()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Tighten the plan assertions to verify the semi-join output column is preserved and set to TRUE by the Project node.

The current positive tests (testRewriteBucketedSemiJoinToInnerJoin and testResultDemoShowsRewrite) only verify the ProjectInnerJoinAggregation shape, but not that the semi-join output symbol is preserved and set to TRUE in the ProjectNode. Since this rule relies on semiJoinOutput := TRUE to preserve semantics, a regression in that assignment (wrong symbol/expression or missing output) may go unnoticed. Please extend one of these tests to assert that the project outputs include the semi-join output symbol and that it is mapped to a TRUE literal (or at least that the parent plan expects and uses that symbol), using the appropriate PlanMatchPattern helpers.

Suggested implementation:

    @Test
    public void testResultDemoShowsRewrite()
    {
        // Demonstrates: SemiJoin(bucketed source, bucketed filteringSource)
        //   → Project(semiJoinOutput := TRUE) → InnerJoin → Distinct(filteringSource)
        // and verifies that the semi-join output symbol is preserved and set to TRUE.
        tester().assertThat(new RewriteBucketedSemiJoinToInnerJoin(tester().getMetadata()))
                .setSystemProperty(REWRITE_BUCKETED_SEMI_JOIN_TO_INNER_JOIN, "true")
                .on(p -> {
                    VariableReferenceExpression sourceKey = p.variable("sourceKey", BIGINT);
                    VariableReferenceExpression filterKey = p.variable("filterKey", BIGINT);
                    VariableReferenceExpression output = p.variable("output", BOOLEAN);

                    // Build the original plan with a SemiJoin that produces `output`
                    return p.semiJoin(
                            sourceKey,
                            filterKey,
                            output,
                            "source",
                            "filter",
                            Optional.empty(),
                            Optional.empty(),
                            Optional.empty());
                })
                .matches(
                        // After rewrite we expect:
                        // Project(output := TRUE, ...) → InnerJoin → Aggregation(DISTINCT filteringSource)
                        project(
                                // The project must explicitly map the semi-join output symbol to TRUE
                                ImmutableMap.of("output", expression("true")),
                                join(
                                        INNER,
                                        ImmutableList.of(equiJoinClause("sourceKey", "filterKey")),
                                        Optional.empty(),
                                        anyTree(),
                                        aggregation(
                                                singleGroupingSet("filteringOrderkey"),
                                                ImmutableMap.of(),
                                                ImmutableMap.of(),
                                                Optional.empty(),
                                                AggregationNode.Step.SINGLE,
                                                anyTree()))));

I only saw the beginning of testResultDemoShowsRewrite, so the following may need adjustment to fit the existing code:

  1. Ensure the return p.semiJoin(...) arguments (source/filter handles and optionals) match how PlanBuilder.semiJoin is used elsewhere in this test class or codebase.
  2. The matches(...) tree is reconstructed from the surrounding snippet; if the existing expected pattern uses different helper methods (e.g., aggregation(...) wrapping distinct() or different symbol names), align the join/aggregation pattern accordingly.
  3. Confirm you have static imports for the pattern helpers used above:
    • project, join, aggregation, singleGroupingSet, anyTree, equiJoinClause, and expression from PlanMatchPattern.
  4. If the test previously wrapped the semiJoin in a project node in the input plan, keep that structure and only tighten the expected project(...) in matches(...) to assert that "output" is mapped to expression("true") and that "output" is used by the parent node as appropriate.

@kaikalur kaikalur force-pushed the rewrite-bucketed-semi-join-to-inner-join branch 4 times, most recently from 3b07998 to 661b173 Compare April 5, 2026 02:16
When both sides of a semi-join are backed by tables bucketed on the
semi-join key, rewrite the SemiJoinNode to a colocated INNER JOIN with
a DISTINCT on the build side. This avoids unnecessary data shuffles
since both sides are already co-partitioned by the join key.

The rewrite runs early (before other semi-join/join optimizers) so the
resulting JoinNode participates in downstream join ordering.

Transformation:
  SemiJoin(source, filteringSource, key, semiJoinOutput)
  → Project(semiJoinOutput := TRUE)
      → InnerJoin(source, Distinct(filteringSource), key)

Changes:
- Add session property optimizer.rewrite-bucketed-semi-join-to-inner-join
  (FeaturesConfig, SystemSessionProperties, TestFeaturesConfig)
- Add RewriteBucketedSemiJoinToInnerJoin optimizer rule
- Register rule in PlanOptimizers before LeftJoinNullFilterToSemiJoin
- Add 8 test cases with mock bucketed connector infrastructure
@kaikalur kaikalur force-pushed the rewrite-bucketed-semi-join-to-inner-join branch from 661b173 to 210667f Compare April 5, 2026 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:Meta PR from Meta

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants