Skip to content

Conversation

@adam-christian-software
Copy link
Contributor

Context

This is the same change as #3026 but I accidentally destroyed my repo.

This change fixes #3021 by creating the Spark 4 client. Most of the changes are just simply copying and pasting what already exists and changing to match the proper interfaces.

I will carve out the overlapping business logic into a separate module later.

Test

Move over the same testing framework with what exists in the 3.5 plugin. In addition, I tested E2E with Parquet.

Checklist

  • 🛡️ Don't disclose security issues! (contact [email protected])
  • 🔗 Clearly explained why the changes are needed, or linked related issues: Fixes #
  • 🧪 Added/updated tests with good coverage, or manually tested (and explained how)
  • 💡 Added comments for complex logic
  • 🧾 Updated CHANGELOG.md (if needed)
  • [BLOCKED] 📚 Updated documentation in site/content/in-dev/unreleased (if needed). I need the updates in test: Add Some Spark Client Tests and Update Documentation on Generic Tables #3000 first.

@adam-christian-software
Copy link
Contributor Author

@flyrain & @gh-yzou - This is the same thing as I did here (#3026) before I accidentally blew away my repo. I'd love to have a conversation about whether you would be fine merging this in or if you want to do the module separation first to clean this up. I know that there is a user who wants to have this and having a first version will help us be responsive to our users.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces support for Spark 4.0 by creating a new client that mirrors the existing Spark 3.5 implementation. The changes primarily involve copying and adapting existing code to work with Spark 4.0's interfaces and dependencies, with appropriate configuration changes to support Scala 2.13 exclusively for this version.

Key Changes:

  • Created a complete Spark 4.0 client implementation in plugins/spark/v4.0/
  • Added version-specific Scala configuration to support different Scala versions per Spark version
  • Updated build configurations to handle ANTLR version conflicts between Spark/Delta dependencies
  • Included comprehensive test suites (unit, integration, and regression tests)
  • Added documentation and Docker setup for testing

Reviewed changes

Copilot reviewed 47 out of 47 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
settings.gradle.kts Added version-specific Scala version configuration
runtime/spark-tests/build.gradle.kts Forced ANTLR 4.9.3 for Spark 3.5/Delta 3.3 compatibility
plugins/spark/v4.0/spark/* Complete Spark 4.0 client implementation (catalog, REST, utilities)
plugins/spark/v4.0/integration/* Integration tests for Spark 4.0 client
plugins/spark/v4.0/regtests/* Regression test framework and scripts
plugins/spark/v3.5/integration/build.gradle.kts Forced ANTLR 4.9.3 for consistency
plugins/spark/v3.5/README.md New comprehensive documentation for Spark 3.5
plugins/spark/v4.0/README.md New comprehensive documentation for Spark 4.0
plugins/spark/README.md Updated to reference version-specific READMEs
plugins/spark/spark-scala.properties Added Spark 4.0 with Scala 2.13 only
pluginlibs.versions.toml Added Spark 4.0.1 version
CHANGELOG.md Added entry for Spark 4.0 client

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@flyrain
Copy link
Contributor

flyrain commented Dec 1, 2025

Thanks a lot for working on it, Adam! This is a solid direction. Iceberg followed a very similar approach when introducing new Spark version support, copying the existing Spark version client and adapting it accordingly. It keeps the implementation straightforward and reduces the risk of regressions. And we could delete the old versions when users have migrate the new ones.

One thing we might consider is preserving the original git commit history when copying the modules. That can make future debugging and backporting work easier, especially when comparing behaviors across Spark versions.

configurations.all {
if (name != "checkstyle") {
resolutionStrategy {
force("org.antlr:antlr4-runtime:4.9.3") // Spark 3.5 and Delta 3.3 require ANTLR 4.9.3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to force this version? Who tries to make it different and why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we have an issue where our integration tests rely on Spark 3.5. Spark 3.5 relies on Spark 3.5 requires ANTLR 4.9.3. Spark 4.0 relies on ANTLR 4.13.1.

This was one way to handle this, but there actually might be a cleaner way by putting this in pluginlibs.versions.toml. I'll try that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Spark dependencies clash with our main server (I assume this is the case), then leveraging the Apprunner is probably the cleanest solution. I'm sure it was created right for this purpose (right, @snazy ?). With Apprunner the server build is independent of the test env. and the tests will run the server in isolation. The tests will be able to use anything they need on the class path without clashing with server deps.

Copy link
Contributor

@gh-yzou gh-yzou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @adam-christian-software Thanks a lot for working on this! I took a quick look, i assume there is no actual src code change, most of the change is related to dependency and readme, right?

sparkVersions=3.5
sparkVersions=3.5,4.0

scalaVersions=2.12,2.13
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have this be scalaVersions.3.5 to be consistent ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What Scala versions does Spark 4 support?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark 4 is only with scala 2.13, no 2.12 support any more.

Copy link
Contributor Author

@adam-christian-software adam-christian-software Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. That's why we have to introduce this. I'll update to be consistent.

if (scalaVersion == "2.12") {
pluginlibs.versions.scala212.get()
} else {
pluginlibs.versions.scala213.get()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scala213 is no going to be a valid version for 4.0 anymore right? i don't think we nee dthis anymore

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you meant scala212 here, @gh-yzou .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sorry, a typo, i meant scala212 is not needed here anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you are right. Let me update.

[SupportsNamespaces](https://github.com/apache/spark/blob/v3.5.6/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsNamespaces.java),

Right now, the plugin only provides support for Spark 3.5, Scala version 2.12 and 2.13,
and depends on iceberg-spark-runtime 1.9.1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems we updated the iceberg version, but didn't update the doc here, the compatible version now is 1.10.0 for all spark client

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I just merged upstream in and it looks like the Iceberg version is the same now.

import java.util.List;
import java.util.Map;
import org.apache.iceberg.catalog.Namespace;
import org.apache.iceberg.catalog.TableIdentifier;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PolarisCatalog.java, PolarisRESTCatalog, and all classes under rest are actually related to Polaris its-self instead spark, and shouldn't change along with the spark version, which is similar to the rest component in iceberg-core. How about let's do a small refactor first to move those part out as a client rest component, and then let spark depends on it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that approach. The question is when. I would be amenable to doing this after this PR while I'm doing the consolidation because I believe that there is a fair amount of duplication that we can already solve.

What do you think? Do you want this now or are you OK with it going in later?

@Override
public Table loadTable(Identifier identifier) throws NoSuchTableException {
try {
GenericTable genericTable =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rahil recently merged the support for hudi in, can we also cooperate this part?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I would like to do it post-merge of this PR since we are going to have to integrate the common APIs, but I would be fine delaying this PR by doing another cut of the current client.

What would you like to do?

@@ -0,0 +1,83 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't see the get-started for spark 4.0, do we plan to reuse get-started? shall we move the get-start out for common?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I was thinking it made sense to move the getting-started out of the plugin and push it to the getting-started area that we have the other docker-compose files in.

What do you think?

// Check if there's a version-specific scalaVersions property, otherwise use the default
val scalaVersionsKey = "scalaVersions.${sparkVersion}"
val scalaVersionsStr =
if (sparkScalaVersions.containsKey(scalaVersionsKey)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we update it to all spark version specific, i don't think we need the if here anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update.

Resolved conflict in plugins/spark/README.md by combining both the directory
listing for Spark versions and the detailed documentation from upstream.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create Spark 4 client

4 participants