Skip to content

Enhance presto_unnest() Functionality #325

@jarodmeng

Description

@jarodmeng

Current State

The presto_unnest() function in R/presto_unnest.R currently has several limitations:

  1. Single column limitation: Line 92-94 enforces that only one column can be unnested at a time:

    } else if (length(valid_col_names) > 1) {
      stop("Currently only one column can be unnested at a time", call. = FALSE)
    }
    
  2. Manual alias specification required: For ARRAY(ROW(...)) types, users must manually specify field names in the SQL alias, even though this information is available via presto_type().

  3. No tidyr-like options: Missing names_sep and names_repair parameters that would make column naming more flexible and consistent with tidyr::unnest().

  4. No partial projection: Cannot extract only specific fields from ARRAY(ROW(...)) types.

Detailed Improvements

1. Support Multiple Column Unnesting

Current Code Location: R/presto_unnest.R lines 67-119

Implementation Steps:

  1. Remove the single-column check at lines 92-94
  2. Modify presto_unnest.tbl_presto() to handle multiple columns:
    • Accept multiple columns via tidyselect::eval_select() (already supports this at line 74)
    • Store multiple column names in the lazy_unnest_query structure
    • Update op_vars.lazy_unnest_query() (lines 130-139) to handle multiple unnested columns
  3. Update sql_render.unnest_query() (lines 214-258) to generate multiple CROSS JOIN UNNEST clauses

SQL Generation:

  • For single column: CROSS JOIN UNNEST(col) AS t(elem)
  • For multiple columns: CROSS JOIN UNNEST(col1) AS t1(elem1), UNNEST(col2) AS t2(elem2)
  • Handle cases where arrays have different lengths (Presto behavior: Cartesian product)

Example Usage:

tbl(con, "test") %>%
  presto_unnest(c(arr1, arr2), values_to = c("val1", "val2"))

Edge Cases:

  • Arrays with different lengths (should produce Cartesian product)
  • NULL arrays (should produce no rows for that array)
  • Empty arrays (should produce no rows for that array)
  • Mix of NULL and non-NULL arrays

2. Automatic ROW Field Detection

Current Code Location: R/presto_unnest.R lines 214-258 (SQL rendering)

Implementation Steps:

  1. In presto_unnest.tbl_presto(), when a column is detected as ARRAY(ROW(...)):
    • Call presto_type() on the column to get type information
    • Parse the ROW type signature to extract field names (see R/dbColumnType.R lines 85-132 for type parsing logic)
    • Use init.presto.field.from.json() from R/presto.field.R lines 76-189 to extract ROW field structure
  2. Automatically generate alias names from ROW field names:
    • If ARRAY(ROW(id INTEGER, name VARCHAR)), automatically use AS t(id, name)
    • Allow override via values_to parameter if provided
  3. Update sql_render.unnest_query() to use detected field names

Code References:

  • Type detection: R/presto_type.R lines 41-64
  • ROW field extraction: R/presto.field.R lines 130-170
  • Type string formatting: R/dbColumnType.R lines 85-132

Example Usage:

# Current (manual):
tbl(con, "test") %>%
  presto_unnest(users, values_to = c("user_id", "user_name"))

# After (automatic):
tbl(con, "test") %>%
  presto_unnest(users)  # Automatically extracts id and name from ARRAY(ROW(id BIGINT, name VARCHAR))

SQL Example:

-- Before (manual):
SELECT * FROM test
CROSS JOIN UNNEST(users) AS t(user_id, user_name)

-- After (automatic detection):
SELECT * FROM test
CROSS JOIN UNNEST(users) AS t(id, name)  -- Field names from ROW type

3. Add tidyr-like Options

Implementation Steps:

  1. Add names_sep parameter (default NULL):
    • When unnesting multiple columns, use separator to create column names
    • Example: names_sep = "_" with columns arr1, arr2 produces arr1_elem, arr2_elem
  2. Add names_repair parameter (default "check_unique"):
    • Options: "check_unique", "unique", "universal", "minimal"
    • Use vctrs::vec_as_names() for name repair (similar to tidyr)
  3. Apply name repair after generating column names from values_to or automatic detection

Example Usage:

tbl(con, "test") %>%
  presto_unnest(
    c(arr1, arr2),
    values_to = c("val1", "val2"),
    names_sep = "_",
    names_repair = "unique"
  )

Edge Cases:

  • Duplicate column names after unnesting
  • Column names that conflict with existing table columns
  • Special characters in ROW field names

4. Support Partial Projections

Implementation Steps:

  1. Add fields parameter (default NULL for all fields):
    • Accept character vector of field names to extract
    • Example: fields = c("id", "name") extracts only those fields from ARRAY(ROW(id, name, email, ...))
  2. In SQL generation, only include specified fields in the UNNEST alias
  3. Validate that specified fields exist in the ROW type

Example Usage:

# Extract only id and name from ARRAY(ROW(id, name, email, created_at))
tbl(con, "test") %>%
  presto_unnest(users, fields = c("id", "name"))

SQL Example:

-- Full projection:
SELECT * FROM test
CROSS JOIN UNNEST(users) AS t(id, name, email, created_at)

-- Partial projection (fields = c("id", "name")):
SELECT * FROM test
CROSS JOIN UNNEST(users) AS t(id, name)

Files to Modify

Primary Files

  • R/presto_unnest.R (lines 12-258):
    • Update presto_unnest.tbl_presto() (lines 67-119) to handle multiple columns and ROW field detection
    • Update op_vars.lazy_unnest_query() (lines 130-139) for multiple columns
    • Update sql_render.unnest_query() (lines 214-258) for SQL generation
    • Add helper function to detect and parse ROW types using presto_type()

Supporting Files

  • R/presto_type.R: May need helper to extract ROW field names from type signature
  • R/dbColumnType.R: Reference format_presto_type_string() (lines 85-132) for ROW parsing logic

Test Files

  • tests/testthat/test-presto_unnest.R: Add comprehensive tests for:
    • Multiple column unnesting (various combinations)
    • Automatic ROW field detection
    • names_sep and names_repair behavior
    • Partial projections
    • Edge cases (NULL arrays, empty arrays, different lengths)
    • Integration with existing tests (group_by, arrange, window functions, CTEs)

Implementation Checklist

  • Remove single-column limitation check
  • Implement multiple column support in presto_unnest.tbl_presto()
  • Update op_vars.lazy_unnest_query() for multiple columns
  • Update sql_render.unnest_query() for multiple UNNEST clauses
  • Add ROW type detection using presto_type()
  • Implement automatic field name extraction from ROW types
  • Add names_sep parameter and logic
  • Add names_repair parameter and logic (using vctrs::vec_as_names())
  • Add fields parameter for partial projections
  • Add validation for fields parameter (check field existence)
  • Update function documentation with new parameters
  • Add examples to function documentation
  • Write comprehensive tests for all new features
  • Test edge cases (NULL, empty arrays, different lengths)
  • Ensure backward compatibility with existing code

Testing Requirements

Unit Tests

  1. Multiple column unnesting with same-length arrays
  2. Multiple column unnesting with different-length arrays (Cartesian product)
  3. Automatic ROW field detection for ARRAY(ROW(...))
  4. names_sep parameter behavior
  5. names_repair parameter with various options
  6. Partial projections with fields parameter
  7. Error handling: invalid field names, non-ROW types with fields

Integration Tests

  1. Multiple unnest + group_by + summarize
  2. Multiple unnest + arrange
  3. Multiple unnest + window functions
  4. Multiple unnest + CTEs
  5. Backward compatibility: existing single-column usage still works

Edge Case Tests

  1. NULL arrays
  2. Empty arrays
  3. Mix of NULL and non-NULL arrays
  4. Arrays with zero elements
  5. Very large arrays
  6. Nested complex types (ARRAY(ROW(...)) with nested ROWs)

Acceptance Criteria

  1. ✅ Can unnest multiple columns in a single call
  2. ✅ Automatically detects and uses ROW field names for ARRAY(ROW(...)) types
  3. names_sep parameter works as expected
  4. names_repair parameter works with all supported options
  5. ✅ Partial projections work with fields parameter
  6. ✅ All existing tests pass (backward compatibility)
  7. ✅ New comprehensive test suite passes
  8. ✅ Documentation updated with examples
  9. ✅ Function signature matches tidyr::unnest() where applicable

Related

Part of RPresto Improvement Plan - Feature 1

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions