-
Notifications
You must be signed in to change notification settings - Fork 32
Support copying from glob patterns #120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #120 +/- ##
==========================================
+ Coverage 91.03% 91.30% +0.27%
==========================================
Files 95 97 +2
Lines 10759 11204 +445
==========================================
+ Hits 9794 10230 +436
- Misses 965 974 +9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
fa24da5 to
6a136e1
Compare
1f74247 to
b5afb04
Compare
c32dba4 to
ed971ec
Compare
ed971ec to
07d6f11
Compare
592efe8 to
5c006b4
Compare
12f6342 to
84e7040
Compare
pgguru
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what happens if the files have different structures and they're non-castable/other columns?
I see the check for the schema matching the tuple type, so I assume it'd presumably start loading things until it hit a file that couldn't be coerced and abort at that point. Is it worth adding a test of different structures to validate behavior here?
src/arrow_parquet/uri_utils.rs
Outdated
| } | ||
|
|
||
| pub(crate) fn is_pattern(&self) -> bool { | ||
| self.path.to_string().contains('*') || self.path.to_string().contains("**") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pedantically, anything that contains ** also contains *. Is it possible to have escaped * chars that aren't pattern chars, or other glob syntaxes (like % or ? or something)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added one test with s3://.../** in addition to s3://.../**/*.parquet. I did not want to add support for % or ? initially. (a bit complicated with rust pattern library)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm what happens if * is allowed as file or uri key. let me check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flow is like below:
- try reading file from uri
- if the uri does not exist
2.1 if uri contains pattern (only '*' for now), read after listing the uri
2.2 otherwise panic with original error
84e7040 to
8cf5526
Compare
|
added tests with schema mismatch (more columns, or column type mismatch) Those are checked before reading any file. |
8cf5526 to
a8e09f7
Compare
e4a832b to
572a819
Compare
572a819 to
87f3d04
Compare
pgguru
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good with changes, thanks!
parquet.list(<uri_with_pattern>)whereuri_with_patternmight contain*for words of arbitrary length or**for arbitrarily nested directories.COPY FROM <uri_with_pattern>.Warning: list operation is not supported for http(s) object stores. (available for all other stores)
Closes #112.