Skip to content

Add license and notice to spark client jar and push jar to maven #1830

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,11 @@ constructor(private val softwareComponentFactory: SoftwareComponentFactory) : Pl

suppressPomMetadataWarningsFor("testFixturesApiElements")
suppressPomMetadataWarningsFor("testFixturesRuntimeElements")

if (project.tasks.findByName("createPolarisSparkJar") != null) {
// if the project contains spark client jar, also publish the jar to maven
artifact(project.tasks.named("createPolarisSparkJar").get())
}
}

if (
Expand Down
8 changes: 4 additions & 4 deletions plugins/spark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,8 @@ and depends on iceberg-spark-runtime 1.9.0.

# Build Plugin Jar
A task createPolarisSparkJar is added to build a jar for the Polaris Spark plugin, the jar is named as:
`polaris-iceberg-<icebergVersion>-spark-runtime-<sparkVersion>_<scalaVersion>-<polarisVersion>.jar`. For example:
`polaris-iceberg-1.9.0-spark-runtime-3.5_2.12-0.10.0-beta-incubating-SNAPSHOT.jar`.
`polaris-spark-<sparkVersion>_<scalaVersion>-<polarisVersion>-bundle.jar`. For example:
`polaris-spark-3.5_2.12-0.11.0-beta-incubating-SNAPSHOT-bundle.jar`.

- `./gradlew :polaris-spark-3.5_2.12:createPolarisSparkJar` -- build jar for Spark 3.5 with Scala version 2.12.
- `./gradlew :polaris-spark-3.5_2.13:createPolarisSparkJar` -- build jar for Spark 3.5 with Scala version 2.13.
Expand Down Expand Up @@ -67,12 +67,12 @@ bin/spark-shell \
```

Assume the path to the built Spark client jar is
`/polaris/plugins/spark/v3.5/spark/build/2.12/libs/polaris-iceberg-1.9.0-spark-runtime-3.5_2.12-0.10.0-beta-incubating-SNAPSHOT.jar`
`/polaris/plugins/spark/v3.5/spark/build/2.12/libs/polaris-spark-3.5_2.12-0.11.0-beta-incubating-SNAPSHOT-bundle.jar`
and the name of the catalog is `polaris`. The cli command will look like following:

```shell
bin/spark-shell \
--jars /polaris/plugins/spark/v3.5/spark/build/2.12/libs/polaris-iceberg-1.9.0-spark-runtime-3.5_2.12-0.10.0-beta-incubating-SNAPSHOT.jar \
--jars /polaris/plugins/spark/v3.5/spark/build/2.12/libs/polaris-spark-3.5_2.12-0.11.0-beta-incubating-SNAPSHOT-bundle.jar \
--packages org.apache.iceberg:iceberg-aws-bundle:1.9.0,io.delta:delta-spark_2.12:3.3.1 \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -266,7 +266,7 @@
"from pyspark.sql import SparkSession\n",
"\n",
"spark = (SparkSession.builder\n",
" .config(\"spark.jars\", \"../polaris_libs/polaris-iceberg-1.9.0-spark-runtime-3.5_2.12-0.11.0-beta-incubating-SNAPSHOT.jar\")\n",
" .config(\"spark.jars\", \"../polaris_libs/polaris-spark-3.5_2.12-0.11.0-beta-incubating-SNAPSHOT-bundle.jar\")\n",
" .config(\"spark.jars.packages\", \"org.apache.iceberg:iceberg-aws-bundle:1.9.0,io.delta:delta-spark_2.12:3.2.1\")\n",
" .config(\"spark.sql.catalog.spark_catalog\", \"org.apache.spark.sql.delta.catalog.DeltaCatalog\")\n",
" .config('spark.sql.iceberg.vectorization.enabled', 'false')\n",
Expand Down
2 changes: 1 addition & 1 deletion plugins/spark/v3.5/regtests/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ for SCALA_VERSION in "${SCALA_VERSIONS[@]}"; do
echo "RUN REGRESSION TEST FOR SPARK_MAJOR_VERSION=${SPARK_MAJOR_VERSION}, SPARK_VERSION=${SPARK_VERSION}, SCALA_VERSION=${SCALA_VERSION}"
# find the project jar
SPARK_DIR=${SPARK_ROOT_DIR}/spark
JAR_PATH=$(find ${SPARK_DIR} -name "polaris-iceberg-*.*-spark-runtime-${SPARK_MAJOR_VERSION}_${SCALA_VERSION}-*.jar" -print -quit)
JAR_PATH=$(find ${SPARK_DIR} -name "polaris-spark-${SPARK_MAJOR_VERSION}_${SCALA_VERSION}-*.*-bundle.jar" -print -quit)
echo "find jar ${JAR_PATH}"

SPARK_EXISTS="TRUE"
Expand Down
325 changes: 325 additions & 0 deletions plugins/spark/v3.5/spark/LICENSE

Large diffs are not rendered by default.

16 changes: 16 additions & 0 deletions plugins/spark/v3.5/spark/NOTICE
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
Apache Polaris (incubating)
Copyright 2025 The Apache Software Foundation

This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).

The initial code for the Polaris project was donated
to the ASF by Snowflake Inc. (https://www.snowflake.com/) copyright 2024.

--------------------------------------------------------------------------------

This project includes code from Project Nessie, developed at Dremio,
with the following copyright notice:

| Nessie
| Copyright 2015-2025 Dremio Corporation
11 changes: 7 additions & 4 deletions plugins/spark/v3.5/spark/build.gradle.kts
Original file line number Diff line number Diff line change
Expand Up @@ -151,13 +151,16 @@ tasks.register("checkNoDisallowedImports") {
tasks.named("check") { dependsOn("checkNoDisallowedImports") }

tasks.register<ShadowJar>("createPolarisSparkJar") {
archiveClassifier = null
archiveBaseName =
"polaris-iceberg-${icebergVersion}-spark-runtime-${sparkMajorVersion}_${scalaVersion}"
archiveClassifier = "bundle"
Copy link
Contributor

@flyrain flyrain Jun 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think a bit more, I think it'd be nice to have the iceberg version in the package name. Otherwise, debugging would be tricky. Users couldn't figure out the Iceberg lib version easily.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we could apply the Iceberg version as we did for Scala version.

Copy link
Contributor Author

@gh-yzou gh-yzou Jun 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the jar now seems <project_name>-<polaris_version>-.jar, and the scala_version is embedded in the project_name, such as spark-3.5_2.12. I don't think it is a good idea to have iceberg version as project name, since the iceberg version might gets updated more frequently.
i don't think we can touch polaris_version also, then classifier is the only place . however, it seems wired that the iceberg version is after the polaris version. I think what we can do is publish on the webpage about the iceberg client compatibility. Given in long term, we may not need iceberg version in the package name also.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can open up the jar file to figure out the embedded Iceberg version. But my concern is that a lot of users don't know that or couldn't do that easily. I think it's still valuable to provide the iceberg versions as it's embedded. Unless we figure out a way to not embed it.

I don't think it is a good idea to have iceberg version as project name, since the iceberg version might gets updated more frequently.

We need to release a new version for this plugin every time we update the embedded Iceberg version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems wired that the iceberg version is after the polaris version.

True. The only option is to play with the project name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a blocker to me for this PR. We could figure it out later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to release a new version for this plugin every time we update the embedded Iceberg version.

I don't think we need to release a new version for this plugin every time update the embedded Iceberg version. i only need to release a new version every time we do a release, whatever iceberg version shipped during the release, it will be the version, right? However, if we we have it in the project name, then the project name has to be updated every time the iceberg version is updated, which I don't think it is a good idea, which can be very frequently.

I tried couple of ways to retain the original jar name, so far didn't find a good way yet, we can continue investigating. However, i hope in long term, we don't have to ship iceberg runtime in the bundle, and I think we should definitely have a doc/webpage somewhere to doc the compatible iceberg version regardless whether it is in the name or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only need to release a new version every time we do a release

+1

if we we have it in the project name, then the project name has to be updated every time the iceberg version is updated

I don't think we need to do that.

i hope in long term, we don't have to ship iceberg runtime in the bundle, and I think we should definitely have a doc/webpage somewhere to doc the compatible iceberg version regardless whether it is in the name or not.

Thanks for the investigation! Let's doc it first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to do that.

Sorry, i meant every time we update the iceberg dependency version in Polaris, for example, when polaris is updated from 1.7 -> 1.8 -> 1.9, i don't mean every time when there is new iceberg version available.

Actually, after some thought, i don't think we should include the iceberg version in the package name. The polaris client is a client we provided for spark to communicate with polaris, the spark scala version is used to tell the spark scala compatibility when choosing which library to use. For the iceberg, even though we ship an iceberg runtime along with our package today, we don't really guarantee any compatibility with the iceberg client, in other words, user can not use an iceberg runtime jar with the polaris jar we provided. The iceberg client is more like a dependency we ship with, which i think makes more sense to be doc correctly, instead of add it in the package name, which actually could be confusing to users in other aspects also.

isZip64 = true

// pack both the source code and dependencies
// include the LICENSE and NOTICE files for the shadow Jar
from(projectDir) {
include("LICENSE")
include("NOTICE")
}

// pack both the source code and dependencies
from(sourceSets.main.get().output)
configurations = listOf(project.configurations.runtimeClasspath.get())

Expand Down
2 changes: 1 addition & 1 deletion site/content/in-dev/unreleased/polaris-spark-client.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ bin/spark-shell \
--conf spark.sql.catalog.<spark-catalog-name>.scope='PRINCIPAL_ROLE:ALL' \
--conf spark.sql.catalog.<spark-catalog-name>.token-refresh-enabled=true
```
Assume the released Polaris Spark client you want to use is `org.apache.polaris:polaris-iceberg-1.8.1-spark-runtime-3.5_2.12:1.0.0`,
Assume the released Polaris Spark client you want to use is `org.apache.polaris:polaris-spark-3.5_2.12:1.0.0`,
Copy link
Contributor

@flyrain flyrain Jun 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure spark can download the bundle jar with the following config? Can we test it out? I'm not sure how do we test it against local mvn repo though, one way is to merge this into nightly maven repo first, then test it out.

--packages org.apache.polaris:polaris-spark-3.5_2.12:1.0.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested it out yet. once it is into nightly maven repo, i can test it out and update the doc if needed.

replace the `polaris-spark-client-package` field with the release.

The `spark-catalog-name` is the catalog name you will use with Spark, and `polaris-catalog-name` is the catalog name used
Expand Down
Loading