Skip to content

[SPARK-56374][BUILD] Align SBT assembly shade rules with Maven#55307

Open
yadavay-amzn wants to merge 1 commit intoapache:masterfrom
yadavay-amzn:fix/SPARK-56374-sbt-shade-alignment
Open

[SPARK-56374][BUILD] Align SBT assembly shade rules with Maven#55307
yadavay-amzn wants to merge 1 commit intoapache:masterfrom
yadavay-amzn:fix/SPARK-56374-sbt-shade-alignment

Conversation

@yadavay-amzn
Copy link
Copy Markdown

@yadavay-amzn yadavay-amzn commented Apr 11, 2026

What changes were proposed in this pull request?

Add the missing org.apache.arrow shade rule to SparkConnectClient (jvm) and SparkConnectJdbc assembly settings in project/SparkBuild.scala.

This is a small incremental step toward full SBT/Maven assembly parity (SPARK-56374).

Why are the changes needed?

Maven's sql/connect/client/jvm/pom.xml relocates Arrow classes to org/sparkproject/org/apache/arrow/ inside the client assembly JAR. SBT was missing this rule, leaving 998 Arrow classes unshaded at org/apache/arrow/ inside the assembly JAR.

Since Arrow classes are bundled inside the client assembly JAR (not on the external classpath), they should be shaded like the other bundled dependencies (grpc, netty, protobuf, etc.) to avoid classpath conflicts when the client JAR is used alongside other libraries that depend on a different Arrow version.

Comparison of Connect client JVM assembly JARs (before this fix):

Dependency Maven SBT
grpc org/sparkproject/io/grpc/ org/sparkproject/connect/client/io/grpc/
netty org/sparkproject/io/netty/ org/sparkproject/connect/client/io/netty/
guava org/sparkproject/connect/guava/ org/sparkproject/connect/client/com/google/common/
arrow org/sparkproject/org/apache/arrow/ (998 classes) org/apache/arrow/ — NOT SHADED

After this fix, Arrow is shaded to org/sparkproject/connect/client/org/apache/arrow/.

Note: The Connect server assembly already matches Maven (identical class count of 5739 and identical shaded namespaces). No server changes are needed.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Build verification:

build/sbt connect-client-jvm/assembly   # ✅ success
build/sbt connect-client-jdbc/assembly  # ✅ success

JAR content verification:

  • jar tf confirms Arrow classes now appear under org/sparkproject/connect/client/org/apache/arrow/ with zero unshaded org/apache/arrow/ entries

Runtime tests:

  • connect-client-jvm/testOnly org.apache.spark.sql.connect.client.arrow.*113/113 passed (Arrow serialization/deserialization)
  • connect-client-jvm/testOnly org.apache.spark.sql.connect.client.SparkConnectClientSuite58/58 passed
  • Full connect-client-jvm/test1141/1166 passed (1 pre-existing failure in JavaEncoderSuite confirmed on master, 24 E2E suites require running server)

Was this patch authored or co-authored using generative AI tooling?

Yes

@yadavay-amzn
Copy link
Copy Markdown
Author

Tried closing and re-opening #55306 to trigger the automated checks workflows but doesn't work.

Opened this fresh PR but still no checks triggered.
Requesting help from committers. cc : @sarutak

Tested the builds locally and they passed.

@sarutak
Copy link
Copy Markdown
Member

sarutak commented Apr 11, 2026

Hi @yadavay-amzn, your GA seems still disabled.
yadavay-amzn-ga-disabled

Could you confirm it?

@sarutak
Copy link
Copy Markdown
Member

sarutak commented Apr 13, 2026

Hi @yadavay-amzn I have a few concerns about this PR.

  1. When all modules are built with SBT, Guava and Arrow are never relocated in any module. So the issue described in the PR description (referencing unshaded com.google.common.* failing at runtime) should not occur in a pure SBT build — all modules consistently reference the original namespaces.

  2. The described problem could occur if some modules are built with Maven (where the parent pom's shade plugin relocates Guava globally) and Spark Connect is built with SBT. However, mixing Maven-built and SBT-built JARs is not a normal workflow.

  3. Since SBT does not relocate Guava or Arrow in other modules, applying these relocation rules only to the Connect assembly JARs would cause the relocated references (e.g., org.sparkproject.guava.*) to point to classes that don't exist on the classpath. This could actually introduce runtime issues rather than fix them.

  4. The "How was this patch tested?" section verifies that the assembly builds succeed and that bytecode was rewritten, but this is a compile-time / packaging-time check. The concern here is about runtime behavior — whether the relocated references can actually be resolved. A runtime test (e.g., bin/spark-shell --remote local with the SBT-built distribution) would be more appropriate to validate this change.

cc: @LuciferYang who raised SPARK-56374

@LuciferYang
Copy link
Copy Markdown
Contributor

LuciferYang commented Apr 14, 2026

@yadavay-amzn Thank you for submitting the pr,I’d like to clarify the intent behind this PR.:

Why I’m proposing this work

Right now, build/sbt package / build/sbt assembly and mvn package produce different JARs from the same Spark source code. Most differences come from how shading is handled:

  • Maven uses maven-shade-plugin to relocate dependencies into org.sparkproject.*. SBT often uses different prefixes or doesn’t relocate at all.
  • Some Maven modules include only a small set of dependencies in their final JAR. SBT typically bundles the full transitive closure, leading to larger JARs and potential class conflicts.
  • common/network-yarn uses an antrun step in Maven to rename native Netty libraries into the shaded namespace. SBT has no equivalent logic.
  • SBT’s CopyDependencies doesn’t always pick the shaded JARs when assembling the distribution.

The end result is that an SBT-built Spark distribution cannot replace a Maven-built one at runtime. Downstream code that relies on shaded relocated classes will fail. The release process currently uses Maven builds as the source of truth, which means we are unable to use a more efficient method for version building and release.

What “done” looks like

My target end state:

  1. Byte-equivalent JARs. For every module that Maven shades, sbt assembly produces a JAR with identical class layout, relocation prefixes, and included dependencies. A jar tf diff should show only trivial differences like timestamps.
  2. Runtime compatibility. A distribution built with dev/make-distribution.sh --sbt must pass the full PySpark test suite, Spark Connect client/server flows, and the YARN external shuffle service.
  3. Build-time validation. We may need to add a new SBT task that fails the build directly if the shaded JAR is missing expected relocated classes or still contains unshaded packages.

Explicit non-goals:

  • Replacing maven-shade-plugin or changing Maven behavior. Maven remains the source of truth.
  • Refactoring the overall SBT build structure. Work is limited to assembly and shading settings and downstream consumers like CopyDependencies.
  • Modifying runtime code in core, sql, connect, or network-*. No bytecode changes outside build definitions.

Where the work lives

  • Only file modified: project/SparkBuild.scala
  • Maven source of truth:
    • Root pom.xml (inherited shade rules)
    • core/pom.xml
    • sql/core/pom.xml (note combine.self="override")
    • sql/connect/server/pom.xml
    • sql/connect/common/pom.xml (note combine.self="override", SPARK-54177)
    • sql/connect/client/jvm/pom.xml
    • sql/connect/client/jdbc/pom.xml
    • common/network-yarn/pom.xml
    • streaming/pom.xml
    • connector/protobuf/pom.xml (already partially handled in SBT — verify parity)
    • connector/kafka-0-10-assembly/pom.xml
    • connector/kinesis-asl-assembly/pom.xml

Read all these poms first. They are the specification. The modules covered above are the scope we need to align.

How to test

What I can think of now is that tests for PySpark and Connect should use the shaded jars to verify normal runtime behavior. The YARN shuffle service needs to start properly and load Netty native libraries correctly. And for Spark Connect JDBC, the client should run normally with no class-not-found errors.

This is quite a challenging task, and I’m really glad you’re interested in it. Feel free to ping me anytime if there are updates. Thanks ~

…nt assembly

Add org.apache.arrow relocation to SparkConnectClient (jvm) and
SparkConnectJdbc assembly shade rules in SparkBuild.scala, matching
Maven's client pom.xml.

Maven shades Arrow classes to org/sparkproject/org/apache/arrow/ in
the Connect client JAR. SBT was missing this rule, leaving 998 Arrow
classes unshaded at org/apache/arrow/ inside the assembly JAR.

This is a small incremental step toward full SBT/Maven assembly
parity (SPARK-56374). The Connect server assembly already matches
Maven (identical class count and shaded namespaces). Other gaps
(extra unshaded transitive dependencies in SBT client JARs,
network-yarn native library renaming, CopyDependencies) are left
for follow-up work.
@yadavay-amzn yadavay-amzn force-pushed the fix/SPARK-56374-sbt-shade-alignment branch from 2254ce1 to 8d37bec Compare April 14, 2026 20:32
@yadavay-amzn
Copy link
Copy Markdown
Author

Thank you @sarutak for the thorough review and @LuciferYang for the detailed clarification of the scope and goals.

I want to acknowledge that my initial PR misunderstood the scope of SPARK-56374. The original Guava/thirdparty shade rules I added to the Connect server were wrong — as @sarutak correctly pointed out, Guava is not bundled inside the server assembly JAR, so shading references to it would create dangling pointers at runtime. I have reverted those changes.

After doing a deeper investigation comparing Maven and SBT assembly JARs side by side, I found:

  1. Connect server: Already identical between Maven and SBT (5739 classes, same shaded namespaces). No fix needed — @sarutak was right.
  2. Connect client/jvm: Arrow classes (998) are bundled inside the assembly JAR but not shaded in SBT, while Maven shades them to org/sparkproject/org/apache/arrow/. This is the one real shade rule gap I could confirm.
  3. Connect client/jvm: SBT bundles ~9000 extra unshaded transitive classes (jackson, json4s, jline, kryo, etc.) that Maven excludes. This is an assemblyExcludedJars gap, separate from shade rules.

I have updated this PR to only add the Arrow shade rule for the two client modules (jvm and jdbc). I also ran runtime tests as @sarutak suggested — 113 Arrow serialization/deserialization tests and 58 SparkConnectClientSuite tests all pass.

I understand from @LuciferYang's comment that the full scope of SPARK-56374 is much larger — achieving byte-equivalent JARs across all shaded modules, including core, network-yarn (native Netty renaming), streaming, CopyDependencies, and more. This PR only addresses a small piece of that.

I'd be very interested in continuing to work on the remaining alignment if you'd be open to providing guidance. I'm happy to tackle it incrementally, module by module, following the pom.xml specifications @LuciferYang listed. Please let me know if that would be helpful or if you'd prefer to approach it differently.

@yadavay-amzn
Copy link
Copy Markdown
Author

@sarutak BTW, looks like Github Actions are disabled at the account level for me
I've created a support ticket to fix it.
Screenshot 2026-04-14 at 1 50 10 PM

@sarutak
Copy link
Copy Markdown
Member

sarutak commented Apr 15, 2026

Regarding the broader direction — while I understand the goal of achieving SBT/Maven parity, I'm not sure this work should be prioritized right now.

The differences between SBT and Maven artifacts are covered by the Maven-based daily CI builds, and beyond that, no significant issues caused by the SBT/Maven shading gap have been reported so far. On the other hand, modifying the SBT build carries a risk of breaking something that is currently working fine. Given the balance between that risk and the benefit gained, I think the priority of this effort is relatively low at this point.

WDYT @LuciferYang ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants