fix(plugin-iceberg): Support hyphenated struct field names in nested ROW type columns#27470
fix(plugin-iceberg): Support hyphenated struct field names in nested ROW type columns#27470mehradpk wants to merge 3 commits intoprestodb:masterfrom
Conversation
Reviewer's GuideSupports Iceberg nested ROW/STRUCT fields whose names contain special characters (e.g., hyphens) by aligning Parquet column lookup, Iceberg struct name delimiting, and primitive type map building with the existing Avro-compatible hex-encoding convention. Sequence diagram for Parquet column lookup with hex-encoded nested field namessequenceDiagram
participant PrestoEngine
participant ColumnIOConverter
participant ParquetTypeUtils
participant GroupColumnIO
PrestoEngine->>ColumnIOConverter: constructField(type, columnIO)
ColumnIOConverter->>GroupColumnIO: groupColumnIO = (GroupColumnIO) columnIO
loop For each RowType field
ColumnIOConverter->>ParquetTypeUtils: lookupColumnByName(groupColumnIO, fieldName)
activate ParquetTypeUtils
ParquetTypeUtils->>GroupColumnIO: getChild(fieldName)
alt Exact match found
GroupColumnIO-->>ParquetTypeUtils: ColumnIO
ParquetTypeUtils-->>ColumnIOConverter: ColumnIO
else Exact match not found
ParquetTypeUtils->>ParquetTypeUtils: hexEncodeSpecialChars(fieldName)
ParquetTypeUtils->>GroupColumnIO: getChild(hexEncodedName)
alt Hex-encoded name found by key
GroupColumnIO-->>ParquetTypeUtils: ColumnIO
ParquetTypeUtils-->>ColumnIOConverter: ColumnIO
else Hex-encoded name not found by key
ParquetTypeUtils->>GroupColumnIO: iterate children, equalsIgnoreCase(hexEncodedName)
alt Case-insensitive match found
GroupColumnIO-->>ParquetTypeUtils: ColumnIO
ParquetTypeUtils-->>ColumnIOConverter: ColumnIO
else No match
ParquetTypeUtils-->>ColumnIOConverter: null
end
end
end
deactivate ParquetTypeUtils
end
ColumnIOConverter-->>PrestoEngine: Optional<Field> for each struct field
Class diagram for updated Iceberg and Parquet type handlingclassDiagram
class ParquetTypeUtils {
+ColumnIO lookupColumnByName(GroupColumnIO groupColumnIO, String columnName)
-String hexEncodeSpecialChars(String name)
}
class ColumnIOConverter {
+Optional~Field~ constructField(Type type, ColumnIO columnIO)
}
class TypeConverter {
+String ORC_ICEBERG_ID_KEY
+String ORC_ICEBERG_REQUIRED_KEY
-Pattern UNQUOTED_IDENTIFIER
+Type toPrestoType(org.apache.iceberg.types.Type type, TypeManager typeManager)
-boolean needsDelimiting(String name)
+org.apache.iceberg.types.Type toIcebergType(Type type, String columnName, TypeManager typeManager)
}
class PrimitiveTypeMapBuilder {
-Map~List~String~~, PrimitiveType~ primitiveTypes
+void visitType(Type type, String name, List~String~ parent)
-void visitRowType(RowType type, String name, List~String~ parent)
-String makeCompatibleName(String name)
}
ParquetTypeUtils <.. ColumnIOConverter : uses
ParquetTypeUtils <.. PrimitiveTypeMapBuilder : uses
TypeConverter <.. PrimitiveTypeMapBuilder : compatible names
class RowType {
+List~Field~ getFields()
}
class RowType_Field {
+Optional~String~ getName()
+Type getType()
+Field(Optional~String~ name, Type type, boolean delimited)
}
TypeConverter ..> RowType : creates
RowType *-- RowType_Field
Flow diagram for hex encoding of struct field namesflowchart TD
A[Start: struct field name] --> B[Iterate characters of name]
B --> C{More characters?}
C -->|No| D[Return result String]
C -->|Yes| E[Read next character c]
E --> F{Is letter, digit, or underscore?}
F -->|Yes| G[Append c to result]
G --> B
F -->|No| H[Append '_x' + uppercase hex of c]
H --> B
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The new
hexEncodeSpecialCharsiterates over UTF-16chars, which will mis-handle non-BMP characters; consider iterating over code points so surrogate pairs are encoded correctly and consistently with Parquet/Avro expectations. lookupColumnByNamenow performs multiple linear scans overgroupColumnIOchildren (plain name, hex-encoded name, then case-insensitive), which could be refactored into a single pass or a small helper to avoid repeated iteration and reduce complexity.- The hex-encoding logic in
ParquetTypeUtils.hexEncodeSpecialCharsappears conceptually similar toPrimitiveTypeMapBuilder.makeCompatibleName; consider consolidating these into a single shared encoder to avoid divergence in name-mangling behavior.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new `hexEncodeSpecialChars` iterates over UTF-16 `char`s, which will mis-handle non-BMP characters; consider iterating over code points so surrogate pairs are encoded correctly and consistently with Parquet/Avro expectations.
- `lookupColumnByName` now performs multiple linear scans over `groupColumnIO` children (plain name, hex-encoded name, then case-insensitive), which could be refactored into a single pass or a small helper to avoid repeated iteration and reduce complexity.
- The hex-encoding logic in `ParquetTypeUtils.hexEncodeSpecialChars` appears conceptually similar to `PrimitiveTypeMapBuilder.makeCompatibleName`; consider consolidating these into a single shared encoder to avoid divergence in name-mangling behavior.
## Individual Comments
### Comment 1
<location path="presto-parquet/src/main/java/com/facebook/presto/parquet/ParquetTypeUtils.java" line_range="247-256" />
<code_context>
return null;
}
+ private static String hexEncodeSpecialChars(String name)
+ {
+ StringBuilder result = new StringBuilder();
+ for (int i = 0; i < name.length(); i++) {
+ char c = name.charAt(i);
+ if (Character.isLetterOrDigit(c) || c == '_') {
+ result.append(c);
+ }
+ else {
+ // Use uppercase hex digits to match Parquet's encoding
+ result.append('_').append('x').append(String.format("%02X", (int) c));
+ }
+ }
</code_context>
<issue_to_address>
**suggestion (performance):** Consider avoiding `String.format` in the hot loop for hex encoding.
`String.format` allocates per call and is relatively slow, which can be costly in this hot path. Consider replacing it with a small manual hex encoder (e.g., static `char[] HEX_DIGITS` and appending two chars) and pre-sizing the `StringBuilder` (e.g., `new StringBuilder(name.length() * 3)`) to reduce allocations while preserving behavior.
Suggested implementation:
```java
private static String hexEncodeSpecialChars(String name)
{
// Pre-size assuming worst case: every character becomes "_xHH"
StringBuilder result = new StringBuilder(name.length() * 3);
for (int i = 0; i < name.length(); i++) {
char c = name.charAt(i);
if (Character.isLetterOrDigit(c) || c == '_') {
result.append(c);
}
else {
// Use uppercase hex digits to match Parquet's encoding
int v = c & 0xFF;
result.append('_')
.append('x')
.append(HEX_DIGITS[v >>> 4])
.append(HEX_DIGITS[v & 0x0F]);
}
}
```
To fully implement the optimization and avoid per-call array allocation, add a class-level constant in `ParquetTypeUtils` near the other `private static final` fields:
```java
private static final char[] HEX_DIGITS = "0123456789ABCDEF".toCharArray();
```
Ensure this field is in scope for `hexEncodeSpecialChars`. No other call sites need changes, as the method signature and behavior remain the same.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
aaneja
left a comment
There was a problem hiding this comment.
Please add tests for the newly enabled scenario
7282162 to
7af9277
Compare
| boolean structHasParameters = false; | ||
| for (int i = 0; i < fields.size(); i++) { | ||
| NamedTypeSignature namedTypeSignature = fields.get(i).getNamedTypeSignature(); | ||
| String name = namedTypeSignature.getName().get().toLowerCase(Locale.ENGLISH); |
There was a problem hiding this comment.
Isn't removing the toLowerCase(Locale.ENGLISH) breaking the established contract ? I would expect some tests would break ?
There was a problem hiding this comment.
Removing toLowerCase(Locale.ENGLISH) does not break any tests because lookupColumnByName already has a two-step fallback mechanism, it first tries exact match, and if that fails it falls back to equalsIgnoreCase which handles case differences. So functionally it works either way.
But, I think adding back toLowerCase maintains the established contract matching Hive's convention of lowercasing column names, and together with makeCompatibleName guarantees an exact match in lookupColumnByName rather than falling through to the equalsIgnoreCase fallback.
| String name = namedTypeSignature.getName().get().toLowerCase(Locale.ENGLISH); | ||
| Optional<Field> field = constructField(parameters.get(i), lookupColumnByName(groupColumnIO, name)); | ||
| String name = namedTypeSignature.getName().get(); | ||
| Optional<Field> field = constructField(parameters.get(i), lookupColumnByName(groupColumnIO, makeCompatibleName(name))); |
There was a problem hiding this comment.
Can you add a test using a pre-canned parquet file with a nested type with inner-fields like aws-region, aws_x2Dregion and other such corner cases ?
There was a problem hiding this comment.
yes added the test class TestParquetTypeUtils
|
c4677d3 to
973e0a1
Compare
aaneja
left a comment
There was a problem hiding this comment.
Can you also add a end to end test that writes to an iceberg table with the struct and ready it back
| public void testReadPreCannedParquetWithHyphenatedFields() | ||
| throws IOException | ||
| { | ||
| String parquetFilePath = "src/test/resources/hyphenated-fields/hyphenated_struct_fields.parquet"; |
There was a problem hiding this comment.
Better to use the class loader mechanism to read the resource path, see this for an example
| (org.apache.parquet.io.GroupColumnIO) messageColumnIO.getChild(1); | ||
|
|
||
| // WITH makeCompatibleName() - fields should be found | ||
| assertNotNull(lookupColumnByName(applicationColumnIO, makeCompatibleName("aws-region"))); |
There was a problem hiding this comment.
Instead of just checking for existence, let's verify that the name of the field picked up matches what we expect -
| assertNotNull(lookupColumnByName(applicationColumnIO, makeCompatibleName("aws-region"))); | |
| assertEquals(requireNonNull(lookupColumnByName(applicationColumnIO, makeCompatibleName("aws-region"))).getName(), "aws_x2Dregion"); |
There was a problem hiding this comment.
yes that's a good point. Modified the test case.
…e columns fix(iceberg): support hyphenated struct field names in nested ROW type columns
973e0a1 to
31e073e
Compare
31e073e to
5d88eee
Compare
Added end to end test in |
Description
Enable Iceberg tables to support quoted STRUCT field names containing hyphens. Currently, creating tables with ROW("aws-region" VARCHAR) succeeds, but INSERT and SELECT operations fail because nested field names are not properly hex-encoded for Parquet storage.
This change ensures STRUCT field names follow the same Avro-compatible hex-encoding path that top-level VARCHAR columns already use.
Test Plan
Tested in local.
Contributor checklist
Release Notes
Summary by Sourcery
Support nested ROW/STRUCT fields with special characters in names for Iceberg tables backed by Parquet.
Bug Fixes:
Enhancements: