Skip to content

Conversation

@aykut-bozkurt
Copy link
Member

@aykut-bozkurt aykut-bozkurt commented Mar 19, 2025

  • Adds udf parquet.list(<uri_with_pattern>) where uri_with_pattern might contain * for words of arbitrary length or ** for arbitrarily nested directories.
  • Support COPY FROM <uri_with_pattern>.
COPY test1 TO '/tmp/parent/child/test1.parquet';
COPY test2 TO '/tmp/parent/child/test2.parquet';

COPY test3 FROM '/tmp/parent/**/*.parquet' WITH (format 'parquet');

Warning: list operation is not supported for http(s) object stores. (available for all other stores)

Closes #112.

@codecov
Copy link

codecov bot commented Mar 19, 2025

Codecov Report

❌ Patch coverage is 98.31461% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.30%. Comparing base (7060d99) to head (87f3d04).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/arrow_parquet/parquet_reader.rs 96.47% 5 Missing ⚠️
src/arrow_parquet/uri_utils.rs 89.47% 2 Missing ⚠️
src/parquet_udfs/list.rs 96.49% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #120      +/-   ##
==========================================
+ Coverage   91.03%   91.30%   +0.27%     
==========================================
  Files          95       97       +2     
  Lines       10759    11204     +445     
==========================================
+ Hits         9794    10230     +436     
- Misses        965      974       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@aykut-bozkurt aykut-bozkurt force-pushed the aykut/glob-read branch 2 times, most recently from 1f74247 to b5afb04 Compare April 17, 2025 16:34
@aykut-bozkurt aykut-bozkurt marked this pull request as ready for review April 17, 2025 18:32
@aykut-bozkurt aykut-bozkurt force-pushed the aykut/glob-read branch 2 times, most recently from c32dba4 to ed971ec Compare May 5, 2025 15:05
@aykut-bozkurt aykut-bozkurt force-pushed the aykut/glob-read branch 3 times, most recently from 592efe8 to 5c006b4 Compare June 11, 2025 22:39
@aykut-bozkurt aykut-bozkurt requested a review from marcoslot June 11, 2025 22:42
@aykut-bozkurt aykut-bozkurt force-pushed the aykut/glob-read branch 4 times, most recently from 12f6342 to 84e7040 Compare August 10, 2025 20:59
Copy link
Collaborator

@pgguru pgguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what happens if the files have different structures and they're non-castable/other columns?

I see the check for the schema matching the tuple type, so I assume it'd presumably start loading things until it hit a file that couldn't be coerced and abort at that point. Is it worth adding a test of different structures to validate behavior here?

}

pub(crate) fn is_pattern(&self) -> bool {
self.path.to_string().contains('*') || self.path.to_string().contains("**")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantically, anything that contains ** also contains *. Is it possible to have escaped * chars that aren't pattern chars, or other glob syntaxes (like % or ? or something)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added one test with s3://.../** in addition to s3://.../**/*.parquet. I did not want to add support for % or ? initially. (a bit complicated with rust pattern library)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm what happens if * is allowed as file or uri key. let me check.

Copy link
Member Author

@aykut-bozkurt aykut-bozkurt Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flow is like below:

  1. try reading file from uri
  2. if the uri does not exist
    2.1 if uri contains pattern (only '*' for now), read after listing the uri
    2.2 otherwise panic with original error

@aykut-bozkurt
Copy link
Member Author

added tests with schema mismatch (more columns, or column type mismatch) Those are checked before reading any file.

Copy link
Collaborator

@pgguru pgguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good with changes, thanks!

@aykut-bozkurt aykut-bozkurt merged commit 247675b into main Sep 24, 2025
14 checks passed
@aykut-bozkurt aykut-bozkurt deleted the aykut/glob-read branch September 24, 2025 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support glob patterns

3 participants