Skip to content

ingest - implement and make use of connector classes #1519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Apr 17, 2025

Conversation

fvankrieken
Copy link
Contributor

@fvankrieken fvankrieken commented Mar 12, 2025

Closes #1515

Multiple things big and small going on here. I tried to be quite detailed in commit descriptions, so look there as you go by commit. Definitely go by commit, I think it's relatively sane

all the connector-related ingest changes are in one commit (9th) - if you're unsure of how the functionality in a specific commit is relevant, I recommend checking that one as a reference.

Main takeaways

  • rework connectors a bit, generally
    • different apis that can be available: versioned vs nonversioned, etc
    • move towards kwargs instead of dict object passed in
    • registry "subregistries" so that you can declare what api is needed at various points in codebase
  • implement connectors for storage in ingest. For now, this is separate from edm.recipes, which is a little more coupled to being backwards compatible with library
  • use connectors in ingest

My biggest open questions

  • Resolve edm.recipes vs ingest_datastore -> should this new connector just go in recipes? Should it become one with the current recipes connector?
    • arg against -> nice to have clean break. An external user stays further away from any code that has a concept of library. Also don't have to worry about refactoring name (ingest_datastore is generic, better to expose slightly to users, whereas edm.recipes is very much OUR datastore)
    • arg for -> might be easier to take advantages of the new connector if we just make one and have it backwards compatible. Less room for confusion moving forward with two parallel connectors, one pushing one pulling.
  • handling raw filename in ingest (see comment)

@fvankrieken fvankrieken force-pushed the fvk-ingest-connectors branch 8 times, most recently from 5660ddf to 0d10fd6 Compare March 14, 2025 23:33
@fvankrieken fvankrieken changed the base branch from main to fvk-ingest-s3opt March 14, 2025 23:33
Base automatically changed from fvk-ingest-s3opt to main March 18, 2025 01:58
@fvankrieken fvankrieken force-pushed the fvk-ingest-connectors branch 12 times, most recently from 535a463 to 39ec8d0 Compare March 21, 2025 14:01
@fvankrieken fvankrieken force-pushed the fvk-ingest-connectors branch 5 times, most recently from 2d3b04a to daa5c27 Compare April 2, 2025 21:22
Copy link

codecov bot commented Apr 2, 2025

Codecov Report

Attention: Patch coverage is 91.47465% with 37 lines in your changes missing coverage. Please review.

Project coverage is 71.19%. Comparing base (a71b5bf) to head (0224d21).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
dcpy/connectors/registry.py 83.52% 12 Missing and 2 partials ⚠️
dcpy/connectors/edm/publishing.py 79.54% 8 Missing and 1 partial ⚠️
dcpy/connectors/filesystem.py 91.17% 1 Missing and 2 partials ⚠️
dcpy/connectors/ingest_datastore.py 93.75% 2 Missing and 1 partial ⚠️
dcpy/lifecycle/data_loader.py 0.00% 3 Missing ⚠️
dcpy/models/lifecycle/ingest.py 94.73% 3 Missing ⚠️
dcpy/connectors/edm/recipes.py 88.88% 1 Missing ⚠️
dcpy/connectors/esri/arcgis_feature_service.py 95.45% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1519      +/-   ##
==========================================
+ Coverage   70.69%   71.19%   +0.50%     
==========================================
  Files         125      129       +4     
  Lines        6504     6690     +186     
  Branches      742      732      -10     
==========================================
+ Hits         4598     4763     +165     
- Misses       1750     1775      +25     
+ Partials      156      152       -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

If we have different apis for these connectors, at some points we
need to be able to declare what we expect to be able to do. Maybe this
is a little stringent right now (before getting a subregistry of our
default one, no api is available), so might need some reworking. I think
the utility of this is more shown when being used in ingest.
Implement connector objects and tests for the various ingest sources
two points here
- "ingest_datastore" connector that's decoupled from any specific
  storage. The idea is that user could set up storage for ingest with
  something like what's sort of outlined in #1555
- "storage" connectors underneath - I really needed a connector that
  both was non-versioned and also had some concept of  listing things
  after a prefix. This maybe will be scrapped in the future. Alex and I
  have talked about how maybe all connectors should  move towards
  "paths" (sort of RESTful) which would make this maybe redundant. But
  for now, I needed it.
reduce boilerplate. this could have gone either direction, but there
were cases where "pull_versioned" needed a specific implementation so
this felt more generalized.
@fvankrieken fvankrieken force-pushed the fvk-ingest-connectors branch 3 times, most recently from d7e8026 to 5c69e2a Compare April 15, 2025 21:26
@fvankrieken fvankrieken force-pushed the fvk-ingest-connectors branch from 5c69e2a to 7c1ad6c Compare April 15, 2025 21:41
@sf-dcp sf-dcp self-requested a review April 16, 2025 14:59
def validate_against_existing_versions(
ds: recipes.Dataset, filepath: Path
) -> ArchiveAction:
def validate_against_existing_versions(ds: recipes.Dataset, filepath: Path) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rename this function to something like is_new_version and also update the doctsring where it says that the config file will be updated in line 16? From my understanding, the config file will no longer record the date checked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow the docstring is way inaccurate. Yeah will update.

Maybe compare_to_existing_version? is_equivalent_to_existing_version? I think the former maybe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the scope is more "check if it's new (against the existing versions in recipes), then if it isn't validate the data"... which sort of leads me back to the current name. But to that point, I think the scope of this function is a little confusing.

It also has a dependency on recipes which needs to be removed. Let me add a commit to try to clean this up and we can see if we think it belongs or if it's a little too weird to try to fit into this PR

@fvankrieken fvankrieken force-pushed the fvk-ingest-connectors branch 4 times, most recently from 9656b3e to bf482ac Compare April 17, 2025 16:12
Copy link
Contributor

@sf-dcp sf-dcp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed at high level -- LGTM!

@fvankrieken fvankrieken force-pushed the fvk-ingest-connectors branch 2 times, most recently from 91bd355 to 25d385f Compare April 17, 2025 17:22
@fvankrieken fvankrieken force-pushed the fvk-ingest-connectors branch from 25d385f to 25ada63 Compare April 17, 2025 17:25
@fvankrieken fvankrieken merged commit 6add0c5 into main Apr 17, 2025
23 checks passed
@github-project-automation github-project-automation bot moved this from New to Done in Data Engineering Apr 17, 2025
@fvankrieken fvankrieken deleted the fvk-ingest-connectors branch April 17, 2025 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Ingest: implement connector classes for extracting and archiving
4 participants