Switch Google Cloud Storage file source from fs-gcsfs to gcsfs (fsspec)#21590
Switch Google Cloud Storage file source from fs-gcsfs to gcsfs (fsspec)#21590mvdbeek merged 5 commits intogalaxyproject:devfrom
Conversation
fc76d37 to
d5d0959
Compare
and add gcsfs to the test dependency group in pyproject.toml, so that the unit tests run also with the regular unit tests. Replace the pyfilesystem2-based fs-gcsfs with the fsspec-based gcsfs library. The gcsfs library has native support for anonymous access via token='anon', which properly handles credential refresh without errors. This fixes the "Anonymous credentials cannot be refreshed" error that was occurring with fs-gcsfs when listing files from public GCS buckets. Fixes galaxyproject#21565
d5d0959 to
d4de755
Compare
|
Applied both suggestions, thank you @nsoranzo! |
|
I can try to test later -- looks like this dropped all the virtual directory workarounds, did you see in the docs for fsspec gcsfs that that should work better? |
|
No, i haven't checked this. Claude said this: Problem: The fs-gcsfs library was using AnonymousCredentials from google.auth, which raises InvalidOperation when refresh() is called. Recent changes in the Google Cloud libraries caused credential refresh to be attempted even for anonymous access. Solution: Switched to gcsfs (fsspec-based) which has native anonymous access support via token='anon'. This properly handles the anonymous credential case without attempting refresh. Files Changed Now extends FsspecFilesSource instead of PyFilesystem2FilesSource lib/galaxy/dependencies/init.py - Renamed check_fs_gcsfs to check_gcsfs, removed check_google_cloud_storage packages/files/setup.cfg - Updated test dependency from fs-gcsfs to gcsfs test/unit/files/test_gcsfs.py - Updated import from fs_gcsfs.GCSFS to gcsfs.GCSFileSystem test/unit/files/gcsfs_file_sources_conf.yml - Simplified config (removed unused oauth parameters) The gcsfs library is a well-maintained fsspec implementation that's consistent with the existing s3fs file source pattern already in use in Galaxy. |
and fs-gcsfs is unmaintained, last release in 2022. |
|
Yeah, sounds fine to swap to something more maintained, we just need to make sure this works with buckets that use virtual directories. I'll give it a shot later today with the motrpac buckets. |
|
I can test this later tonight. |
Some fsspec implementations (notably gcsfs) include the directory being listed as one of the entries in ls() results. This caused duplicate entries to appear when navigating into folders.
Reverts field renames that would break existing configurations: - service_account_credentials → service_account_json - access_token → token
Allows scoping the file source to a subdirectory within the bucket, matching the behavior of the previous fs-gcsfs implementation.
|
Good news is that this doesn't have the same virtual directory listing issue, so that's good to go. The config name changes were breaking changes -- we need to either document that, allow for an 'old_value' mapping, or just keep using the old config names. I pushed a commit that does the latter because I'm guessing claude did the renames without thinking about existing configs. The other thing is that this dropped root path support, which I added back. Feel free to drop any of this if the changes were intentional, we just need to make sure to provide migration notes if so. |
|
Awesome, thanks for checking. Looks good to me, feel free to merge. |
dannon
left a comment
There was a problem hiding this comment.
@ksuderman Want to sanity check my changes and see that this works in your implementation, too? Then it's good to go. I'll go ahead and approve from my end.
|
Ok, I can't actually test this due to #21262. In fact it seems to be much worse since Jan 12th. The web handler pod has been running for 90 minutes and it is still loading tools... It looks like #21554 was the culprit, or at least that was the first smoke-test to fail, but I have no idea why that would break anything. |
|
This PR was merged without a "kind/" label, please correct. |
I don't think that's the issue. xref #21262 (comment) ? |
#21554 was just the PR that caused the smoke test to start failing. I haven't had an opportunity to investigate the root cause yet. |
Replace the pyfilesystem2-based fs-gcsfs with the fsspec-based gcsfs library. The gcsfs library has native support for anonymous access via token='anon', which properly handles credential refresh without errors.
This fixes the "Anonymous credentials cannot be refreshed" error that was occurring with fs-gcsfs when listing files from public GCS buckets.
Fixes #21565
I don't have a GCP account handy, @dannon or @ksuderman would you be able to test this ?
How to test the changes?
(Select all options that apply)
License