fix(plugin-iceberg): Support hyphenated struct field names in nested ROW type columns by mehradpk · Pull Request #27470 · prestodb/presto

mehradpk · 2026-03-31T02:26:32Z

Description

Enable Iceberg tables to support quoted STRUCT field names containing hyphens. Currently, creating tables with ROW("aws-region" VARCHAR) succeeds, but INSERT and SELECT operations fail because nested field names are not properly hex-encoded for Parquet storage.

This change ensures STRUCT field names follow the same Avro-compatible hex-encoding path that top-level VARCHAR columns already use.

Test Plan

Tested in local.

Contributor checklist

Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.
If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

== RELEASE NOTES ==

General Changes
* Add support for hyphenated struct field names in nested ROW type columns

Summary by Sourcery

Support nested ROW/STRUCT fields with special characters in names for Iceberg tables backed by Parquet.

Bug Fixes:

Fix failures when reading or writing Iceberg tables that use hyphenated or otherwise non-standard struct field names in nested ROW columns.

Enhancements:

Hex-encode special characters when resolving Parquet column names, aligning nested struct field handling with existing Avro-compatible encoding.
Preserve original case of struct field names when mapping Presto row fields to Parquet columns to avoid mismatches.
Normalize nested struct field names to Parquet-compatible identifiers when building primitive type maps for Iceberg.

sourcery-ai · 2026-03-31T02:27:05Z

Reviewer's Guide

Supports Iceberg nested ROW/STRUCT fields whose names contain special characters (e.g., hyphens) by aligning Parquet column lookup, Iceberg struct name delimiting, and primitive type map building with the existing Avro-compatible hex-encoding convention.

Sequence diagram for Parquet column lookup with hex-encoded nested field names

sequenceDiagram
    participant PrestoEngine
    participant ColumnIOConverter
    participant ParquetTypeUtils
    participant GroupColumnIO

    PrestoEngine->>ColumnIOConverter: constructField(type, columnIO)
    ColumnIOConverter->>GroupColumnIO: groupColumnIO = (GroupColumnIO) columnIO
    loop For each RowType field
        ColumnIOConverter->>ParquetTypeUtils: lookupColumnByName(groupColumnIO, fieldName)
        activate ParquetTypeUtils
        ParquetTypeUtils->>GroupColumnIO: getChild(fieldName)
        alt Exact match found
            GroupColumnIO-->>ParquetTypeUtils: ColumnIO
            ParquetTypeUtils-->>ColumnIOConverter: ColumnIO
        else Exact match not found
            ParquetTypeUtils->>ParquetTypeUtils: hexEncodeSpecialChars(fieldName)
            ParquetTypeUtils->>GroupColumnIO: getChild(hexEncodedName)
            alt Hex-encoded name found by key
                GroupColumnIO-->>ParquetTypeUtils: ColumnIO
                ParquetTypeUtils-->>ColumnIOConverter: ColumnIO
            else Hex-encoded name not found by key
                ParquetTypeUtils->>GroupColumnIO: iterate children, equalsIgnoreCase(hexEncodedName)
                alt Case-insensitive match found
                    GroupColumnIO-->>ParquetTypeUtils: ColumnIO
                    ParquetTypeUtils-->>ColumnIOConverter: ColumnIO
                else No match
                    ParquetTypeUtils-->>ColumnIOConverter: null
                end
            end
        end
        deactivate ParquetTypeUtils
    end
    ColumnIOConverter-->>PrestoEngine: Optional<Field> for each struct field

Class diagram for updated Iceberg and Parquet type handling

classDiagram
    class ParquetTypeUtils {
        +ColumnIO lookupColumnByName(GroupColumnIO groupColumnIO, String columnName)
        -String hexEncodeSpecialChars(String name)
    }

    class ColumnIOConverter {
        +Optional~Field~ constructField(Type type, ColumnIO columnIO)
    }

    class TypeConverter {
        +String ORC_ICEBERG_ID_KEY
        +String ORC_ICEBERG_REQUIRED_KEY
        -Pattern UNQUOTED_IDENTIFIER
        +Type toPrestoType(org.apache.iceberg.types.Type type, TypeManager typeManager)
        -boolean needsDelimiting(String name)
        +org.apache.iceberg.types.Type toIcebergType(Type type, String columnName, TypeManager typeManager)
    }

    class PrimitiveTypeMapBuilder {
        -Map~List~String~~, PrimitiveType~ primitiveTypes
        +void visitType(Type type, String name, List~String~ parent)
        -void visitRowType(RowType type, String name, List~String~ parent)
        -String makeCompatibleName(String name)
    }

    ParquetTypeUtils <.. ColumnIOConverter : uses
    ParquetTypeUtils <.. PrimitiveTypeMapBuilder : uses
    TypeConverter <.. PrimitiveTypeMapBuilder : compatible names

    class RowType {
        +List~Field~ getFields()
    }

    class RowType_Field {
        +Optional~String~ getName()
        +Type getType()
        +Field(Optional~String~ name, Type type, boolean delimited)
    }

    TypeConverter ..> RowType : creates
    RowType *-- RowType_Field

Flow diagram for hex encoding of struct field names

flowchart TD
    A[Start: struct field name] --> B[Iterate characters of name]
    B --> C{More characters?}
    C -->|No| D[Return result String]
    C -->|Yes| E[Read next character c]
    E --> F{Is letter, digit, or underscore?}
    F -->|Yes| G[Append c to result]
    G --> B
    F -->|No| H[Append '_x' + uppercase hex of c]
    H --> B

File-Level Changes

Change	Details	Files
Parquet column lookup now falls back to a hex-encoded variant of the requested column name when a direct match is not found, matching Parquet’s special-character encoding behavior.	Extend lookupColumnByName to compute a hex-encoded version of the requested column name and attempt lookup with it if the plain name fails Add hexEncodeSpecialChars helper that preserves alphanumerics and underscore while encoding other characters as _xHH with uppercase hex digits Include a case-insensitive search over children using the hex-encoded name when the direct child lookup misses	`presto-parquet/src/main/java/com/facebook/presto/parquet/ParquetTypeUtils.java`
Iceberg STRUCT fields are now marked as delimited identifiers when their names contain characters that require quoting, allowing safe use of hyphens and other special characters.	Introduce UNQUOTED_IDENTIFIER pattern for standard unquoted SQL identifiers Update Iceberg STRUCT-to-RowType conversion to pass a needsDelimiting flag based on whether the field name matches the unquoted-identifier pattern Add needsDelimiting helper encapsulating the identifier check	`presto-iceberg/src/main/java/com/facebook/presto/iceberg/TypeConverter.java`
PrimitiveTypeMapBuilder now uses Avro-compatible encoded names for nested ROW field components so Parquet column paths align with how names are persisted.	Change visitRowType to call makeCompatibleName on each RowType.Field name before descending into child types, instead of using the raw field name	`presto-iceberg/src/main/java/com/facebook/presto/iceberg/util/PrimitiveTypeMapBuilder.java`
ColumnIO construction for ROW types now uses the original case-preserving field names when resolving Parquet columns, instead of lowercasing them.	Remove lowercasing of NamedTypeSignature field names when looking up corresponding Parquet ColumnIO entries in constructField for struct types	`presto-parquet/src/main/java/org/apache/parquet/io/ColumnIOConverter.java`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

The new hexEncodeSpecialChars iterates over UTF-16 chars, which will mis-handle non-BMP characters; consider iterating over code points so surrogate pairs are encoded correctly and consistently with Parquet/Avro expectations.
lookupColumnByName now performs multiple linear scans over groupColumnIO children (plain name, hex-encoded name, then case-insensitive), which could be refactored into a single pass or a small helper to avoid repeated iteration and reduce complexity.
The hex-encoding logic in ParquetTypeUtils.hexEncodeSpecialChars appears conceptually similar to PrimitiveTypeMapBuilder.makeCompatibleName; consider consolidating these into a single shared encoder to avoid divergence in name-mangling behavior.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The new `hexEncodeSpecialChars` iterates over UTF-16 `char`s, which will mis-handle non-BMP characters; consider iterating over code points so surrogate pairs are encoded correctly and consistently with Parquet/Avro expectations.
- `lookupColumnByName` now performs multiple linear scans over `groupColumnIO` children (plain name, hex-encoded name, then case-insensitive), which could be refactored into a single pass or a small helper to avoid repeated iteration and reduce complexity.
- The hex-encoding logic in `ParquetTypeUtils.hexEncodeSpecialChars` appears conceptually similar to `PrimitiveTypeMapBuilder.makeCompatibleName`; consider consolidating these into a single shared encoder to avoid divergence in name-mangling behavior.

## Individual Comments

### Comment 1
<location path="presto-parquet/src/main/java/com/facebook/presto/parquet/ParquetTypeUtils.java" line_range="247-256" />
<code_context>
         return null;
     }

+    private static String hexEncodeSpecialChars(String name)
+    {
+        StringBuilder result = new StringBuilder();
+        for (int i = 0; i < name.length(); i++) {
+            char c = name.charAt(i);
+            if (Character.isLetterOrDigit(c) || c == '_') {
+                result.append(c);
+            }
+            else {
+                // Use uppercase hex digits to match Parquet's encoding
+                result.append('_').append('x').append(String.format("%02X", (int) c));
+            }
+        }
</code_context>
<issue_to_address>
**suggestion (performance):** Consider avoiding `String.format` in the hot loop for hex encoding.

`String.format` allocates per call and is relatively slow, which can be costly in this hot path. Consider replacing it with a small manual hex encoder (e.g., static `char[] HEX_DIGITS` and appending two chars) and pre-sizing the `StringBuilder` (e.g., `new StringBuilder(name.length() * 3)`) to reduce allocations while preserving behavior.

Suggested implementation:

```java
    private static String hexEncodeSpecialChars(String name)
    {
        // Pre-size assuming worst case: every character becomes "_xHH"
        StringBuilder result = new StringBuilder(name.length() * 3);
        for (int i = 0; i < name.length(); i++) {
            char c = name.charAt(i);
            if (Character.isLetterOrDigit(c) || c == '_') {
                result.append(c);
            }
            else {
                // Use uppercase hex digits to match Parquet's encoding
                int v = c & 0xFF;
                result.append('_')
                        .append('x')
                        .append(HEX_DIGITS[v >>> 4])
                        .append(HEX_DIGITS[v & 0x0F]);
            }
        }

```

To fully implement the optimization and avoid per-call array allocation, add a class-level constant in `ParquetTypeUtils` near the other `private static final` fields:

```java
private static final char[] HEX_DIGITS = "0123456789ABCDEF".toCharArray();
```

Ensure this field is in scope for `hexEncodeSpecialChars`. No other call sites need changes, as the method signature and behavior remain the same.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

aaneja

Please add tests for the newly enabled scenario

aaneja · 2026-04-13T14:35:44Z

            boolean structHasParameters = false;
            for (int i = 0; i < fields.size(); i++) {
                NamedTypeSignature namedTypeSignature = fields.get(i).getNamedTypeSignature();
-                String name = namedTypeSignature.getName().get().toLowerCase(Locale.ENGLISH);


Isn't removing the toLowerCase(Locale.ENGLISH) breaking the established contract ? I would expect some tests would break ?

Removing toLowerCase(Locale.ENGLISH) does not break any tests because lookupColumnByName already has a two-step fallback mechanism, it first tries exact match, and if that fails it falls back to equalsIgnoreCase which handles case differences. So functionally it works either way.

But, I think adding back toLowerCase maintains the established contract matching Hive's convention of lowercasing column names, and together with makeCompatibleName guarantees an exact match in lookupColumnByName rather than falling through to the equalsIgnoreCase fallback.

aaneja · 2026-04-13T14:38:08Z

-                String name = namedTypeSignature.getName().get().toLowerCase(Locale.ENGLISH);
-                Optional<Field> field = constructField(parameters.get(i), lookupColumnByName(groupColumnIO, name));
+                String name = namedTypeSignature.getName().get();
+                Optional<Field> field = constructField(parameters.get(i), lookupColumnByName(groupColumnIO, makeCompatibleName(name)));


Can you add a test using a pre-canned parquet file with a nested type with inner-fields like aws-region, aws_x2Dregion and other such corner cases ?

yes added the test class TestParquetTypeUtils

steveburnett · 2026-04-13T15:24:05Z

Please add a release note - or NO RELEASE NOTE - following the Release Notes Guidelines to pass the failing but not required CI check.
Please edit the PR title to follow semantic commit style to pass the failing and required CI check. See the failure in the test for advice. If you can't edit the PR title, let us know and we can help.

aaneja

Can you also add a end to end test that writes to an iceberg table with the struct and ready it back

aaneja · 2026-04-15T13:15:49Z

+    public void testReadPreCannedParquetWithHyphenatedFields()
+            throws IOException
+    {
+        String parquetFilePath = "src/test/resources/hyphenated-fields/hyphenated_struct_fields.parquet";


Better to use the class loader mechanism to read the resource path, see this for an example

aaneja · 2026-04-15T13:21:25Z

+                (org.apache.parquet.io.GroupColumnIO) messageColumnIO.getChild(1);
+
+        // WITH makeCompatibleName() - fields should be found
+        assertNotNull(lookupColumnByName(applicationColumnIO, makeCompatibleName("aws-region")));


Instead of just checking for existence, let's verify that the name of the field picked up matches what we expect -

Suggested change

assertNotNull(lookupColumnByName(applicationColumnIO, makeCompatibleName("aws-region")));

assertEquals(requireNonNull(lookupColumnByName(applicationColumnIO, makeCompatibleName("aws-region"))).getName(), "aws_x2Dregion");

yes that's a good point. Modified the test case.

…e columns fix(iceberg): support hyphenated struct field names in nested ROW type columns

…E test

mehradpk · 2026-04-16T23:57:58Z

Can you also add a end to end test that writes to an iceberg table with the struct and ready it back

Added end to end test in TestIcebergTypes for writing data to iceberg table with struct and read after.

prestodb-ci added the from:IBM PR from IBM label Mar 31, 2026

aaneja marked this pull request as ready for review April 7, 2026 06:47

aaneja requested review from a team, ZacBlanco, hantangwangd and shangxinli as code owners April 7, 2026 06:47

prestodb-ci requested review from a team, Mariamalmesfer and NivinCS and removed request for a team April 7, 2026 06:47

sourcery-ai bot reviewed Apr 7, 2026

View reviewed changes

Comment thread presto-parquet/src/main/java/com/facebook/presto/parquet/ParquetTypeUtils.java Outdated

aaneja reviewed Apr 7, 2026

View reviewed changes

Comment thread presto-parquet/src/main/java/com/facebook/presto/parquet/ParquetTypeUtils.java Outdated

aaneja requested changes Apr 7, 2026

View reviewed changes

Comment thread presto-parquet/src/main/java/com/facebook/presto/parquet/ParquetTypeUtils.java Outdated

mehradpk force-pushed the struct-hyphen-support branch from 7282162 to 7af9277 Compare April 13, 2026 10:56

mehradpk changed the title ~~fix(iceberg): support hyphenated struct field names in nested ROW type columns~~ fix(plugin-iceberg): support hyphenated struct field names in nested ROW type columns Apr 13, 2026

mehradpk requested a review from aaneja April 13, 2026 11:00

aaneja reviewed Apr 13, 2026

View reviewed changes

mehradpk force-pushed the struct-hyphen-support branch 2 times, most recently from c4677d3 to 973e0a1 Compare April 14, 2026 07:55

mehradpk changed the title ~~fix(plugin-iceberg): support hyphenated struct field names in nested ROW type columns~~ Fix(plugin-iceberg): Support hyphenated struct field names in nested ROW type columns Apr 14, 2026

mehradpk changed the title ~~Fix(plugin-iceberg): Support hyphenated struct field names in nested ROW type columns~~ fix(plugin-iceberg): Support hyphenated struct field names in nested ROW type columns Apr 14, 2026

mehradpk requested a review from aaneja April 14, 2026 08:12

aaneja requested changes Apr 16, 2026

View reviewed changes

mehradpk added 2 commits April 17, 2026 05:01

fix(iceberg): support hyphenated struct field names in nested ROW typ…

58a40f1

…e columns fix(iceberg): support hyphenated struct field names in nested ROW type columns

Add pre-canned Parquet file tests for hyphenated struct field names

ae86d6e

mehradpk force-pushed the struct-hyphen-support branch from 973e0a1 to 31e073e Compare April 16, 2026 23:47

Test improvements: classloader usage, field name verification, and E2…

5d88eee

…E test

mehradpk force-pushed the struct-hyphen-support branch from 31e073e to 5d88eee Compare April 16, 2026 23:56

mehradpk requested a review from aaneja April 16, 2026 23:58

	assertNotNull(lookupColumnByName(applicationColumnIO, makeCompatibleName("aws-region")));
	assertEquals(requireNonNull(lookupColumnByName(applicationColumnIO, makeCompatibleName("aws-region"))).getName(), "aws_x2Dregion");

Conversation

mehradpk commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Plan

Contributor checklist

Release Notes

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for Parquet column lookup with hex-encoded nested field names

Class diagram for updated Iceberg and Parquet type handling

Flow diagram for hex encoding of struct field names

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aaneja left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steveburnett commented Apr 13, 2026

Uh oh!

aaneja left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mehradpk commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mehradpk commented Mar 31, 2026 •

edited

Loading

sourcery-ai bot commented Mar 31, 2026 •

edited

Loading