Skip to content

fix(plugin-iceberg): Support hyphenated struct field names in nested ROW type columns#27470

Open
mehradpk wants to merge 3 commits intoprestodb:masterfrom
mehradpk:struct-hyphen-support
Open

fix(plugin-iceberg): Support hyphenated struct field names in nested ROW type columns#27470
mehradpk wants to merge 3 commits intoprestodb:masterfrom
mehradpk:struct-hyphen-support

Conversation

@mehradpk
Copy link
Copy Markdown
Contributor

@mehradpk mehradpk commented Mar 31, 2026

Description

Enable Iceberg tables to support quoted STRUCT field names containing hyphens. Currently, creating tables with ROW("aws-region" VARCHAR) succeeds, but INSERT and SELECT operations fail because nested field names are not properly hex-encoded for Parquet storage.

This change ensures STRUCT field names follow the same Avro-compatible hex-encoding path that top-level VARCHAR columns already use.

Test Plan

Tested in local.

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.
  • If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

== RELEASE NOTES ==

General Changes
* Add support for hyphenated struct field names in nested ROW type columns

Summary by Sourcery

Support nested ROW/STRUCT fields with special characters in names for Iceberg tables backed by Parquet.

Bug Fixes:

  • Fix failures when reading or writing Iceberg tables that use hyphenated or otherwise non-standard struct field names in nested ROW columns.

Enhancements:

  • Hex-encode special characters when resolving Parquet column names, aligning nested struct field handling with existing Avro-compatible encoding.
  • Preserve original case of struct field names when mapping Presto row fields to Parquet columns to avoid mismatches.
  • Normalize nested struct field names to Parquet-compatible identifiers when building primitive type maps for Iceberg.

@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Mar 31, 2026
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Mar 31, 2026

Reviewer's Guide

Supports Iceberg nested ROW/STRUCT fields whose names contain special characters (e.g., hyphens) by aligning Parquet column lookup, Iceberg struct name delimiting, and primitive type map building with the existing Avro-compatible hex-encoding convention.

Sequence diagram for Parquet column lookup with hex-encoded nested field names

sequenceDiagram
    participant PrestoEngine
    participant ColumnIOConverter
    participant ParquetTypeUtils
    participant GroupColumnIO

    PrestoEngine->>ColumnIOConverter: constructField(type, columnIO)
    ColumnIOConverter->>GroupColumnIO: groupColumnIO = (GroupColumnIO) columnIO
    loop For each RowType field
        ColumnIOConverter->>ParquetTypeUtils: lookupColumnByName(groupColumnIO, fieldName)
        activate ParquetTypeUtils
        ParquetTypeUtils->>GroupColumnIO: getChild(fieldName)
        alt Exact match found
            GroupColumnIO-->>ParquetTypeUtils: ColumnIO
            ParquetTypeUtils-->>ColumnIOConverter: ColumnIO
        else Exact match not found
            ParquetTypeUtils->>ParquetTypeUtils: hexEncodeSpecialChars(fieldName)
            ParquetTypeUtils->>GroupColumnIO: getChild(hexEncodedName)
            alt Hex-encoded name found by key
                GroupColumnIO-->>ParquetTypeUtils: ColumnIO
                ParquetTypeUtils-->>ColumnIOConverter: ColumnIO
            else Hex-encoded name not found by key
                ParquetTypeUtils->>GroupColumnIO: iterate children, equalsIgnoreCase(hexEncodedName)
                alt Case-insensitive match found
                    GroupColumnIO-->>ParquetTypeUtils: ColumnIO
                    ParquetTypeUtils-->>ColumnIOConverter: ColumnIO
                else No match
                    ParquetTypeUtils-->>ColumnIOConverter: null
                end
            end
        end
        deactivate ParquetTypeUtils
    end
    ColumnIOConverter-->>PrestoEngine: Optional<Field> for each struct field
Loading

Class diagram for updated Iceberg and Parquet type handling

classDiagram
    class ParquetTypeUtils {
        +ColumnIO lookupColumnByName(GroupColumnIO groupColumnIO, String columnName)
        -String hexEncodeSpecialChars(String name)
    }

    class ColumnIOConverter {
        +Optional~Field~ constructField(Type type, ColumnIO columnIO)
    }

    class TypeConverter {
        +String ORC_ICEBERG_ID_KEY
        +String ORC_ICEBERG_REQUIRED_KEY
        -Pattern UNQUOTED_IDENTIFIER
        +Type toPrestoType(org.apache.iceberg.types.Type type, TypeManager typeManager)
        -boolean needsDelimiting(String name)
        +org.apache.iceberg.types.Type toIcebergType(Type type, String columnName, TypeManager typeManager)
    }

    class PrimitiveTypeMapBuilder {
        -Map~List~String~~, PrimitiveType~ primitiveTypes
        +void visitType(Type type, String name, List~String~ parent)
        -void visitRowType(RowType type, String name, List~String~ parent)
        -String makeCompatibleName(String name)
    }

    ParquetTypeUtils <.. ColumnIOConverter : uses
    ParquetTypeUtils <.. PrimitiveTypeMapBuilder : uses
    TypeConverter <.. PrimitiveTypeMapBuilder : compatible names

    class RowType {
        +List~Field~ getFields()
    }

    class RowType_Field {
        +Optional~String~ getName()
        +Type getType()
        +Field(Optional~String~ name, Type type, boolean delimited)
    }

    TypeConverter ..> RowType : creates
    RowType *-- RowType_Field
Loading

Flow diagram for hex encoding of struct field names

flowchart TD
    A[Start: struct field name] --> B[Iterate characters of name]
    B --> C{More characters?}
    C -->|No| D[Return result String]
    C -->|Yes| E[Read next character c]
    E --> F{Is letter, digit, or underscore?}
    F -->|Yes| G[Append c to result]
    G --> B
    F -->|No| H[Append '_x' + uppercase hex of c]
    H --> B
Loading

File-Level Changes

Change Details Files
Parquet column lookup now falls back to a hex-encoded variant of the requested column name when a direct match is not found, matching Parquet’s special-character encoding behavior.
  • Extend lookupColumnByName to compute a hex-encoded version of the requested column name and attempt lookup with it if the plain name fails
  • Add hexEncodeSpecialChars helper that preserves alphanumerics and underscore while encoding other characters as _xHH with uppercase hex digits
  • Include a case-insensitive search over children using the hex-encoded name when the direct child lookup misses
presto-parquet/src/main/java/com/facebook/presto/parquet/ParquetTypeUtils.java
Iceberg STRUCT fields are now marked as delimited identifiers when their names contain characters that require quoting, allowing safe use of hyphens and other special characters.
  • Introduce UNQUOTED_IDENTIFIER pattern for standard unquoted SQL identifiers
  • Update Iceberg STRUCT-to-RowType conversion to pass a needsDelimiting flag based on whether the field name matches the unquoted-identifier pattern
  • Add needsDelimiting helper encapsulating the identifier check
presto-iceberg/src/main/java/com/facebook/presto/iceberg/TypeConverter.java
PrimitiveTypeMapBuilder now uses Avro-compatible encoded names for nested ROW field components so Parquet column paths align with how names are persisted.
  • Change visitRowType to call makeCompatibleName on each RowType.Field name before descending into child types, instead of using the raw field name
presto-iceberg/src/main/java/com/facebook/presto/iceberg/util/PrimitiveTypeMapBuilder.java
ColumnIO construction for ROW types now uses the original case-preserving field names when resolving Parquet columns, instead of lowercasing them.
  • Remove lowercasing of NamedTypeSignature field names when looking up corresponding Parquet ColumnIO entries in constructField for struct types
presto-parquet/src/main/java/org/apache/parquet/io/ColumnIOConverter.java

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@aaneja aaneja marked this pull request as ready for review April 7, 2026 06:47
@aaneja aaneja requested review from a team, ZacBlanco, hantangwangd and shangxinli as code owners April 7, 2026 06:47
@prestodb-ci prestodb-ci requested review from a team, Mariamalmesfer and NivinCS and removed request for a team April 7, 2026 06:47
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The new hexEncodeSpecialChars iterates over UTF-16 chars, which will mis-handle non-BMP characters; consider iterating over code points so surrogate pairs are encoded correctly and consistently with Parquet/Avro expectations.
  • lookupColumnByName now performs multiple linear scans over groupColumnIO children (plain name, hex-encoded name, then case-insensitive), which could be refactored into a single pass or a small helper to avoid repeated iteration and reduce complexity.
  • The hex-encoding logic in ParquetTypeUtils.hexEncodeSpecialChars appears conceptually similar to PrimitiveTypeMapBuilder.makeCompatibleName; consider consolidating these into a single shared encoder to avoid divergence in name-mangling behavior.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new `hexEncodeSpecialChars` iterates over UTF-16 `char`s, which will mis-handle non-BMP characters; consider iterating over code points so surrogate pairs are encoded correctly and consistently with Parquet/Avro expectations.
- `lookupColumnByName` now performs multiple linear scans over `groupColumnIO` children (plain name, hex-encoded name, then case-insensitive), which could be refactored into a single pass or a small helper to avoid repeated iteration and reduce complexity.
- The hex-encoding logic in `ParquetTypeUtils.hexEncodeSpecialChars` appears conceptually similar to `PrimitiveTypeMapBuilder.makeCompatibleName`; consider consolidating these into a single shared encoder to avoid divergence in name-mangling behavior.

## Individual Comments

### Comment 1
<location path="presto-parquet/src/main/java/com/facebook/presto/parquet/ParquetTypeUtils.java" line_range="247-256" />
<code_context>
         return null;
     }

+    private static String hexEncodeSpecialChars(String name)
+    {
+        StringBuilder result = new StringBuilder();
+        for (int i = 0; i < name.length(); i++) {
+            char c = name.charAt(i);
+            if (Character.isLetterOrDigit(c) || c == '_') {
+                result.append(c);
+            }
+            else {
+                // Use uppercase hex digits to match Parquet's encoding
+                result.append('_').append('x').append(String.format("%02X", (int) c));
+            }
+        }
</code_context>
<issue_to_address>
**suggestion (performance):** Consider avoiding `String.format` in the hot loop for hex encoding.

`String.format` allocates per call and is relatively slow, which can be costly in this hot path. Consider replacing it with a small manual hex encoder (e.g., static `char[] HEX_DIGITS` and appending two chars) and pre-sizing the `StringBuilder` (e.g., `new StringBuilder(name.length() * 3)`) to reduce allocations while preserving behavior.

Suggested implementation:

```java
    private static String hexEncodeSpecialChars(String name)
    {
        // Pre-size assuming worst case: every character becomes "_xHH"
        StringBuilder result = new StringBuilder(name.length() * 3);
        for (int i = 0; i < name.length(); i++) {
            char c = name.charAt(i);
            if (Character.isLetterOrDigit(c) || c == '_') {
                result.append(c);
            }
            else {
                // Use uppercase hex digits to match Parquet's encoding
                int v = c & 0xFF;
                result.append('_')
                        .append('x')
                        .append(HEX_DIGITS[v >>> 4])
                        .append(HEX_DIGITS[v & 0x0F]);
            }
        }

```

To fully implement the optimization and avoid per-call array allocation, add a class-level constant in `ParquetTypeUtils` near the other `private static final` fields:

```java
private static final char[] HEX_DIGITS = "0123456789ABCDEF".toCharArray();
```

Ensure this field is in scope for `hexEncodeSpecialChars`. No other call sites need changes, as the method signature and behavior remain the same.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread presto-parquet/src/main/java/com/facebook/presto/parquet/ParquetTypeUtils.java Outdated
Comment thread presto-parquet/src/main/java/com/facebook/presto/parquet/ParquetTypeUtils.java Outdated
Copy link
Copy Markdown
Contributor

@aaneja aaneja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests for the newly enabled scenario

Comment thread presto-parquet/src/main/java/com/facebook/presto/parquet/ParquetTypeUtils.java Outdated
@mehradpk mehradpk force-pushed the struct-hyphen-support branch from 7282162 to 7af9277 Compare April 13, 2026 10:56
@mehradpk mehradpk changed the title fix(iceberg): support hyphenated struct field names in nested ROW type columns fix(plugin-iceberg): support hyphenated struct field names in nested ROW type columns Apr 13, 2026
@mehradpk mehradpk requested a review from aaneja April 13, 2026 11:00
boolean structHasParameters = false;
for (int i = 0; i < fields.size(); i++) {
NamedTypeSignature namedTypeSignature = fields.get(i).getNamedTypeSignature();
String name = namedTypeSignature.getName().get().toLowerCase(Locale.ENGLISH);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't removing the toLowerCase(Locale.ENGLISH) breaking the established contract ? I would expect some tests would break ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing toLowerCase(Locale.ENGLISH) does not break any tests because lookupColumnByName already has a two-step fallback mechanism, it first tries exact match, and if that fails it falls back to equalsIgnoreCase which handles case differences. So functionally it works either way.

But, I think adding back toLowerCase maintains the established contract matching Hive's convention of lowercasing column names, and together with makeCompatibleName guarantees an exact match in lookupColumnByName rather than falling through to the equalsIgnoreCase fallback.

String name = namedTypeSignature.getName().get().toLowerCase(Locale.ENGLISH);
Optional<Field> field = constructField(parameters.get(i), lookupColumnByName(groupColumnIO, name));
String name = namedTypeSignature.getName().get();
Optional<Field> field = constructField(parameters.get(i), lookupColumnByName(groupColumnIO, makeCompatibleName(name)));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test using a pre-canned parquet file with a nested type with inner-fields like aws-region, aws_x2Dregion and other such corner cases ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes added the test class TestParquetTypeUtils

@steveburnett
Copy link
Copy Markdown
Contributor

  • Please add a release note - or NO RELEASE NOTE - following the Release Notes Guidelines to pass the failing but not required CI check.

  • Please edit the PR title to follow semantic commit style to pass the failing and required CI check. See the failure in the test for advice. If you can't edit the PR title, let us know and we can help.

@mehradpk mehradpk force-pushed the struct-hyphen-support branch 2 times, most recently from c4677d3 to 973e0a1 Compare April 14, 2026 07:55
@mehradpk mehradpk changed the title fix(plugin-iceberg): support hyphenated struct field names in nested ROW type columns Fix(plugin-iceberg): Support hyphenated struct field names in nested ROW type columns Apr 14, 2026
@mehradpk mehradpk changed the title Fix(plugin-iceberg): Support hyphenated struct field names in nested ROW type columns fix(plugin-iceberg): Support hyphenated struct field names in nested ROW type columns Apr 14, 2026
@mehradpk mehradpk requested a review from aaneja April 14, 2026 08:12
Copy link
Copy Markdown
Contributor

@aaneja aaneja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a end to end test that writes to an iceberg table with the struct and ready it back

public void testReadPreCannedParquetWithHyphenatedFields()
throws IOException
{
String parquetFilePath = "src/test/resources/hyphenated-fields/hyphenated_struct_fields.parquet";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use the class loader mechanism to read the resource path, see this for an example

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

(org.apache.parquet.io.GroupColumnIO) messageColumnIO.getChild(1);

// WITH makeCompatibleName() - fields should be found
assertNotNull(lookupColumnByName(applicationColumnIO, makeCompatibleName("aws-region")));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of just checking for existence, let's verify that the name of the field picked up matches what we expect -

Suggested change
assertNotNull(lookupColumnByName(applicationColumnIO, makeCompatibleName("aws-region")));
assertEquals(requireNonNull(lookupColumnByName(applicationColumnIO, makeCompatibleName("aws-region"))).getName(), "aws_x2Dregion");

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's a good point. Modified the test case.

@mehradpk mehradpk force-pushed the struct-hyphen-support branch from 973e0a1 to 31e073e Compare April 16, 2026 23:47
@mehradpk mehradpk force-pushed the struct-hyphen-support branch from 31e073e to 5d88eee Compare April 16, 2026 23:56
@mehradpk
Copy link
Copy Markdown
Contributor Author

Can you also add a end to end test that writes to an iceberg table with the struct and ready it back

Added end to end test in TestIcebergTypes for writing data to iceberg table with struct and read after.

@mehradpk mehradpk requested a review from aaneja April 16, 2026 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:IBM PR from IBM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants