-
Notifications
You must be signed in to change notification settings - Fork 338
feat: Create Spark 4 Client #3188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Create Spark 4 Client #3188
Conversation
|
@flyrain & @gh-yzou - This is the same thing as I did here (#3026) before I accidentally blew away my repo. I'd love to have a conversation about whether you would be fine merging this in or if you want to do the module separation first to clean this up. I know that there is a user who wants to have this and having a first version will help us be responsive to our users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces support for Spark 4.0 by creating a new client that mirrors the existing Spark 3.5 implementation. The changes primarily involve copying and adapting existing code to work with Spark 4.0's interfaces and dependencies, with appropriate configuration changes to support Scala 2.13 exclusively for this version.
Key Changes:
- Created a complete Spark 4.0 client implementation in
plugins/spark/v4.0/ - Added version-specific Scala configuration to support different Scala versions per Spark version
- Updated build configurations to handle ANTLR version conflicts between Spark/Delta dependencies
- Included comprehensive test suites (unit, integration, and regression tests)
- Added documentation and Docker setup for testing
Reviewed changes
Copilot reviewed 47 out of 47 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| settings.gradle.kts | Added version-specific Scala version configuration |
| runtime/spark-tests/build.gradle.kts | Forced ANTLR 4.9.3 for Spark 3.5/Delta 3.3 compatibility |
| plugins/spark/v4.0/spark/* | Complete Spark 4.0 client implementation (catalog, REST, utilities) |
| plugins/spark/v4.0/integration/* | Integration tests for Spark 4.0 client |
| plugins/spark/v4.0/regtests/* | Regression test framework and scripts |
| plugins/spark/v3.5/integration/build.gradle.kts | Forced ANTLR 4.9.3 for consistency |
| plugins/spark/v3.5/README.md | New comprehensive documentation for Spark 3.5 |
| plugins/spark/v4.0/README.md | New comprehensive documentation for Spark 4.0 |
| plugins/spark/README.md | Updated to reference version-specific READMEs |
| plugins/spark/spark-scala.properties | Added Spark 4.0 with Scala 2.13 only |
| pluginlibs.versions.toml | Added Spark 4.0.1 version |
| CHANGELOG.md | Added entry for Spark 4.0 client |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thanks a lot for working on it, Adam! This is a solid direction. Iceberg followed a very similar approach when introducing new Spark version support, copying the existing Spark version client and adapting it accordingly. It keeps the implementation straightforward and reduces the risk of regressions. And we could delete the old versions when users have migrate the new ones. One thing we might consider is preserving the original git commit history when copying the modules. That can make future debugging and backporting work easier, especially when comparing behaviors across Spark versions. |
| configurations.all { | ||
| if (name != "checkstyle") { | ||
| resolutionStrategy { | ||
| force("org.antlr:antlr4-runtime:4.9.3") // Spark 3.5 and Delta 3.3 require ANTLR 4.9.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to force this version? Who tries to make it different and why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we have an issue where our integration tests rely on Spark 3.5. Spark 3.5 relies on Spark 3.5 requires ANTLR 4.9.3. Spark 4.0 relies on ANTLR 4.13.1.
This was one way to handle this, but there actually might be a cleaner way by putting this in pluginlibs.versions.toml. I'll try that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Spark dependencies clash with our main server (I assume this is the case), then leveraging the Apprunner is probably the cleanest solution. I'm sure it was created right for this purpose (right, @snazy ?). With Apprunner the server build is independent of the test env. and the tests will run the server in isolation. The tests will be able to use anything they need on the class path without clashing with server deps.
gh-yzou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @adam-christian-software Thanks a lot for working on this! I took a quick look, i assume there is no actual src code change, most of the change is related to dependency and readme, right?
plugins/spark/spark-scala.properties
Outdated
| sparkVersions=3.5 | ||
| sparkVersions=3.5,4.0 | ||
|
|
||
| scalaVersions=2.12,2.13 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have this be scalaVersions.3.5 to be consistent ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What Scala versions does Spark 4 support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark 4 is only with scala 2.13, no 2.12 support any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. That's why we have to introduce this. I'll update to be consistent.
| if (scalaVersion == "2.12") { | ||
| pluginlibs.versions.scala212.get() | ||
| } else { | ||
| pluginlibs.versions.scala213.get() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scala213 is no going to be a valid version for 4.0 anymore right? i don't think we nee dthis anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you meant scala212 here, @gh-yzou .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sorry, a typo, i meant scala212 is not needed here anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you are right. Let me update.
plugins/spark/v3.5/README.md
Outdated
| [SupportsNamespaces](https://github.com/apache/spark/blob/v3.5.6/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsNamespaces.java), | ||
|
|
||
| Right now, the plugin only provides support for Spark 3.5, Scala version 2.12 and 2.13, | ||
| and depends on iceberg-spark-runtime 1.9.1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems we updated the iceberg version, but didn't update the doc here, the compatible version now is 1.10.0 for all spark client
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I just merged upstream in and it looks like the Iceberg version is the same now.
| import java.util.List; | ||
| import java.util.Map; | ||
| import org.apache.iceberg.catalog.Namespace; | ||
| import org.apache.iceberg.catalog.TableIdentifier; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PolarisCatalog.java, PolarisRESTCatalog, and all classes under rest are actually related to Polaris its-self instead spark, and shouldn't change along with the spark version, which is similar to the rest component in iceberg-core. How about let's do a small refactor first to move those part out as a client rest component, and then let spark depends on it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that approach. The question is when. I would be amenable to doing this after this PR while I'm doing the consolidation because I believe that there is a fair amount of duplication that we can already solve.
What do you think? Do you want this now or are you OK with it going in later?
| @Override | ||
| public Table loadTable(Identifier identifier) throws NoSuchTableException { | ||
| try { | ||
| GenericTable genericTable = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rahil recently merged the support for hudi in, can we also cooperate this part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I would like to do it post-merge of this PR since we are going to have to integrate the common APIs, but I would be fine delaying this PR by doing another cut of the current client.
What would you like to do?
| @@ -0,0 +1,83 @@ | |||
| <!-- | |||
| Licensed to the Apache Software Foundation (ASF) under one | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't see the get-started for spark 4.0, do we plan to reuse get-started? shall we move the get-start out for common?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I was thinking it made sense to move the getting-started out of the plugin and push it to the getting-started area that we have the other docker-compose files in.
What do you think?
settings.gradle.kts
Outdated
| // Check if there's a version-specific scalaVersions property, otherwise use the default | ||
| val scalaVersionsKey = "scalaVersions.${sparkVersion}" | ||
| val scalaVersionsStr = | ||
| if (sparkScalaVersions.containsKey(scalaVersionsKey)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we update it to all spark version specific, i don't think we need the if here anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update.
Resolved conflict in plugins/spark/README.md by combining both the directory listing for Spark versions and the detailed documentation from upstream.
Context
This is the same change as #3026 but I accidentally destroyed my repo.
This change fixes #3021 by creating the Spark 4 client. Most of the changes are just simply copying and pasting what already exists and changing to match the proper interfaces.
I will carve out the overlapping business logic into a separate module later.
Test
Move over the same testing framework with what exists in the 3.5 plugin. In addition, I tested E2E with Parquet.
Checklist
CHANGELOG.md(if needed)site/content/in-dev/unreleased(if needed). I need the updates in test: Add Some Spark Client Tests and Update Documentation on Generic Tables #3000 first.