Current State
The presto_unnest() function in R/presto_unnest.R currently has several limitations:
-
Single column limitation: Line 92-94 enforces that only one column can be unnested at a time:
} else if (length(valid_col_names) > 1) {
stop("Currently only one column can be unnested at a time", call. = FALSE)
}
-
Manual alias specification required: For ARRAY(ROW(...)) types, users must manually specify field names in the SQL alias, even though this information is available via presto_type().
-
No tidyr-like options: Missing names_sep and names_repair parameters that would make column naming more flexible and consistent with tidyr::unnest().
-
No partial projection: Cannot extract only specific fields from ARRAY(ROW(...)) types.
Detailed Improvements
1. Support Multiple Column Unnesting
Current Code Location: R/presto_unnest.R lines 67-119
Implementation Steps:
- Remove the single-column check at lines 92-94
- Modify
presto_unnest.tbl_presto() to handle multiple columns:
- Accept multiple columns via
tidyselect::eval_select() (already supports this at line 74)
- Store multiple column names in the
lazy_unnest_query structure
- Update
op_vars.lazy_unnest_query() (lines 130-139) to handle multiple unnested columns
- Update
sql_render.unnest_query() (lines 214-258) to generate multiple CROSS JOIN UNNEST clauses
SQL Generation:
- For single column:
CROSS JOIN UNNEST(col) AS t(elem)
- For multiple columns:
CROSS JOIN UNNEST(col1) AS t1(elem1), UNNEST(col2) AS t2(elem2)
- Handle cases where arrays have different lengths (Presto behavior: Cartesian product)
Example Usage:
tbl(con, "test") %>%
presto_unnest(c(arr1, arr2), values_to = c("val1", "val2"))
Edge Cases:
- Arrays with different lengths (should produce Cartesian product)
- NULL arrays (should produce no rows for that array)
- Empty arrays (should produce no rows for that array)
- Mix of NULL and non-NULL arrays
2. Automatic ROW Field Detection
Current Code Location: R/presto_unnest.R lines 214-258 (SQL rendering)
Implementation Steps:
- In
presto_unnest.tbl_presto(), when a column is detected as ARRAY(ROW(...)):
- Call
presto_type() on the column to get type information
- Parse the ROW type signature to extract field names (see
R/dbColumnType.R lines 85-132 for type parsing logic)
- Use
init.presto.field.from.json() from R/presto.field.R lines 76-189 to extract ROW field structure
- Automatically generate alias names from ROW field names:
- If
ARRAY(ROW(id INTEGER, name VARCHAR)), automatically use AS t(id, name)
- Allow override via
values_to parameter if provided
- Update
sql_render.unnest_query() to use detected field names
Code References:
- Type detection:
R/presto_type.R lines 41-64
- ROW field extraction:
R/presto.field.R lines 130-170
- Type string formatting:
R/dbColumnType.R lines 85-132
Example Usage:
# Current (manual):
tbl(con, "test") %>%
presto_unnest(users, values_to = c("user_id", "user_name"))
# After (automatic):
tbl(con, "test") %>%
presto_unnest(users) # Automatically extracts id and name from ARRAY(ROW(id BIGINT, name VARCHAR))
SQL Example:
-- Before (manual):
SELECT * FROM test
CROSS JOIN UNNEST(users) AS t(user_id, user_name)
-- After (automatic detection):
SELECT * FROM test
CROSS JOIN UNNEST(users) AS t(id, name) -- Field names from ROW type
3. Add tidyr-like Options
Implementation Steps:
- Add
names_sep parameter (default NULL):
- When unnesting multiple columns, use separator to create column names
- Example:
names_sep = "_" with columns arr1, arr2 produces arr1_elem, arr2_elem
- Add
names_repair parameter (default "check_unique"):
- Options:
"check_unique", "unique", "universal", "minimal"
- Use
vctrs::vec_as_names() for name repair (similar to tidyr)
- Apply name repair after generating column names from
values_to or automatic detection
Example Usage:
tbl(con, "test") %>%
presto_unnest(
c(arr1, arr2),
values_to = c("val1", "val2"),
names_sep = "_",
names_repair = "unique"
)
Edge Cases:
- Duplicate column names after unnesting
- Column names that conflict with existing table columns
- Special characters in ROW field names
4. Support Partial Projections
Implementation Steps:
- Add
fields parameter (default NULL for all fields):
- Accept character vector of field names to extract
- Example:
fields = c("id", "name") extracts only those fields from ARRAY(ROW(id, name, email, ...))
- In SQL generation, only include specified fields in the UNNEST alias
- Validate that specified fields exist in the ROW type
Example Usage:
# Extract only id and name from ARRAY(ROW(id, name, email, created_at))
tbl(con, "test") %>%
presto_unnest(users, fields = c("id", "name"))
SQL Example:
-- Full projection:
SELECT * FROM test
CROSS JOIN UNNEST(users) AS t(id, name, email, created_at)
-- Partial projection (fields = c("id", "name")):
SELECT * FROM test
CROSS JOIN UNNEST(users) AS t(id, name)
Files to Modify
Primary Files
R/presto_unnest.R (lines 12-258):
- Update
presto_unnest.tbl_presto() (lines 67-119) to handle multiple columns and ROW field detection
- Update
op_vars.lazy_unnest_query() (lines 130-139) for multiple columns
- Update
sql_render.unnest_query() (lines 214-258) for SQL generation
- Add helper function to detect and parse ROW types using
presto_type()
Supporting Files
R/presto_type.R: May need helper to extract ROW field names from type signature
R/dbColumnType.R: Reference format_presto_type_string() (lines 85-132) for ROW parsing logic
Test Files
tests/testthat/test-presto_unnest.R: Add comprehensive tests for:
- Multiple column unnesting (various combinations)
- Automatic ROW field detection
names_sep and names_repair behavior
- Partial projections
- Edge cases (NULL arrays, empty arrays, different lengths)
- Integration with existing tests (group_by, arrange, window functions, CTEs)
Implementation Checklist
Testing Requirements
Unit Tests
- Multiple column unnesting with same-length arrays
- Multiple column unnesting with different-length arrays (Cartesian product)
- Automatic ROW field detection for
ARRAY(ROW(...))
names_sep parameter behavior
names_repair parameter with various options
- Partial projections with
fields parameter
- Error handling: invalid field names, non-ROW types with
fields
Integration Tests
- Multiple unnest + group_by + summarize
- Multiple unnest + arrange
- Multiple unnest + window functions
- Multiple unnest + CTEs
- Backward compatibility: existing single-column usage still works
Edge Case Tests
- NULL arrays
- Empty arrays
- Mix of NULL and non-NULL arrays
- Arrays with zero elements
- Very large arrays
- Nested complex types (ARRAY(ROW(...)) with nested ROWs)
Acceptance Criteria
- ✅ Can unnest multiple columns in a single call
- ✅ Automatically detects and uses ROW field names for
ARRAY(ROW(...)) types
- ✅
names_sep parameter works as expected
- ✅
names_repair parameter works with all supported options
- ✅ Partial projections work with
fields parameter
- ✅ All existing tests pass (backward compatibility)
- ✅ New comprehensive test suite passes
- ✅ Documentation updated with examples
- ✅ Function signature matches tidyr::unnest() where applicable
Related
Part of RPresto Improvement Plan - Feature 1
Current State
The
presto_unnest()function inR/presto_unnest.Rcurrently has several limitations:Single column limitation: Line 92-94 enforces that only one column can be unnested at a time:
Manual alias specification required: For
ARRAY(ROW(...))types, users must manually specify field names in the SQL alias, even though this information is available viapresto_type().No tidyr-like options: Missing
names_sepandnames_repairparameters that would make column naming more flexible and consistent withtidyr::unnest().No partial projection: Cannot extract only specific fields from
ARRAY(ROW(...))types.Detailed Improvements
1. Support Multiple Column Unnesting
Current Code Location:
R/presto_unnest.Rlines 67-119Implementation Steps:
presto_unnest.tbl_presto()to handle multiple columns:tidyselect::eval_select()(already supports this at line 74)lazy_unnest_querystructureop_vars.lazy_unnest_query()(lines 130-139) to handle multiple unnested columnssql_render.unnest_query()(lines 214-258) to generate multipleCROSS JOIN UNNESTclausesSQL Generation:
CROSS JOIN UNNEST(col) AS t(elem)CROSS JOIN UNNEST(col1) AS t1(elem1), UNNEST(col2) AS t2(elem2)Example Usage:
Edge Cases:
2. Automatic ROW Field Detection
Current Code Location:
R/presto_unnest.Rlines 214-258 (SQL rendering)Implementation Steps:
presto_unnest.tbl_presto(), when a column is detected asARRAY(ROW(...)):presto_type()on the column to get type informationR/dbColumnType.Rlines 85-132 for type parsing logic)init.presto.field.from.json()fromR/presto.field.Rlines 76-189 to extract ROW field structureARRAY(ROW(id INTEGER, name VARCHAR)), automatically useAS t(id, name)values_toparameter if providedsql_render.unnest_query()to use detected field namesCode References:
R/presto_type.Rlines 41-64R/presto.field.Rlines 130-170R/dbColumnType.Rlines 85-132Example Usage:
SQL Example:
3. Add tidyr-like Options
Implementation Steps:
names_sepparameter (defaultNULL):names_sep = "_"with columnsarr1,arr2producesarr1_elem,arr2_elemnames_repairparameter (default"check_unique"):"check_unique","unique","universal","minimal"vctrs::vec_as_names()for name repair (similar to tidyr)values_toor automatic detectionExample Usage:
Edge Cases:
4. Support Partial Projections
Implementation Steps:
fieldsparameter (defaultNULLfor all fields):fields = c("id", "name")extracts only those fields fromARRAY(ROW(id, name, email, ...))Example Usage:
SQL Example:
Files to Modify
Primary Files
R/presto_unnest.R(lines 12-258):presto_unnest.tbl_presto()(lines 67-119) to handle multiple columns and ROW field detectionop_vars.lazy_unnest_query()(lines 130-139) for multiple columnssql_render.unnest_query()(lines 214-258) for SQL generationpresto_type()Supporting Files
R/presto_type.R: May need helper to extract ROW field names from type signatureR/dbColumnType.R: Referenceformat_presto_type_string()(lines 85-132) for ROW parsing logicTest Files
tests/testthat/test-presto_unnest.R: Add comprehensive tests for:names_sepandnames_repairbehaviorImplementation Checklist
presto_unnest.tbl_presto()op_vars.lazy_unnest_query()for multiple columnssql_render.unnest_query()for multiple UNNEST clausespresto_type()names_sepparameter and logicnames_repairparameter and logic (usingvctrs::vec_as_names())fieldsparameter for partial projectionsfieldsparameter (check field existence)Testing Requirements
Unit Tests
ARRAY(ROW(...))names_sepparameter behaviornames_repairparameter with various optionsfieldsparameterfieldsIntegration Tests
Edge Case Tests
Acceptance Criteria
ARRAY(ROW(...))typesnames_sepparameter works as expectednames_repairparameter works with all supported optionsfieldsparameterRelated
Part of RPresto Improvement Plan - Feature 1