[SPARK-56374][BUILD] Align SBT assembly shade rules with Maven#55307
[SPARK-56374][BUILD] Align SBT assembly shade rules with Maven#55307yadavay-amzn wants to merge 1 commit intoapache:masterfrom
Conversation
|
Hi @yadavay-amzn, your GA seems still disabled. Could you confirm it? |
|
Hi @yadavay-amzn I have a few concerns about this PR.
cc: @LuciferYang who raised SPARK-56374 |
|
@yadavay-amzn Thank you for submitting the pr,I’d like to clarify the intent behind this PR.: Why I’m proposing this workRight now,
The end result is that an SBT-built Spark distribution cannot replace a Maven-built one at runtime. Downstream code that relies on shaded relocated classes will fail. The release process currently uses Maven builds as the source of truth, which means we are unable to use a more efficient method for version building and release. What “done” looks likeMy target end state:
Explicit non-goals:
Where the work lives
Read all these poms first. They are the specification. The modules covered above are the scope we need to align. How to testWhat I can think of now is that tests for PySpark and Connect should use the shaded jars to verify normal runtime behavior. The YARN shuffle service needs to start properly and load Netty native libraries correctly. And for Spark Connect JDBC, the client should run normally with no class-not-found errors. This is quite a challenging task, and I’m really glad you’re interested in it. Feel free to ping me anytime if there are updates. Thanks ~ |
…nt assembly Add org.apache.arrow relocation to SparkConnectClient (jvm) and SparkConnectJdbc assembly shade rules in SparkBuild.scala, matching Maven's client pom.xml. Maven shades Arrow classes to org/sparkproject/org/apache/arrow/ in the Connect client JAR. SBT was missing this rule, leaving 998 Arrow classes unshaded at org/apache/arrow/ inside the assembly JAR. This is a small incremental step toward full SBT/Maven assembly parity (SPARK-56374). The Connect server assembly already matches Maven (identical class count and shaded namespaces). Other gaps (extra unshaded transitive dependencies in SBT client JARs, network-yarn native library renaming, CopyDependencies) are left for follow-up work.
2254ce1 to
8d37bec
Compare
|
Thank you @sarutak for the thorough review and @LuciferYang for the detailed clarification of the scope and goals. I want to acknowledge that my initial PR misunderstood the scope of SPARK-56374. The original Guava/thirdparty shade rules I added to the Connect server were wrong — as @sarutak correctly pointed out, Guava is not bundled inside the server assembly JAR, so shading references to it would create dangling pointers at runtime. I have reverted those changes. After doing a deeper investigation comparing Maven and SBT assembly JARs side by side, I found:
I have updated this PR to only add the Arrow shade rule for the two client modules (jvm and jdbc). I also ran runtime tests as @sarutak suggested — 113 Arrow serialization/deserialization tests and 58 SparkConnectClientSuite tests all pass. I understand from @LuciferYang's comment that the full scope of SPARK-56374 is much larger — achieving byte-equivalent JARs across all shaded modules, including I'd be very interested in continuing to work on the remaining alignment if you'd be open to providing guidance. I'm happy to tackle it incrementally, module by module, following the pom.xml specifications @LuciferYang listed. Please let me know if that would be helpful or if you'd prefer to approach it differently. |
|
@sarutak BTW, looks like Github Actions are disabled at the account level for me |
|
Regarding the broader direction — while I understand the goal of achieving SBT/Maven parity, I'm not sure this work should be prioritized right now. The differences between SBT and Maven artifacts are covered by the Maven-based daily CI builds, and beyond that, no significant issues caused by the SBT/Maven shading gap have been reported so far. On the other hand, modifying the SBT build carries a risk of breaking something that is currently working fine. Given the balance between that risk and the benefit gained, I think the priority of this effort is relatively low at this point. WDYT @LuciferYang ? |


What changes were proposed in this pull request?
Add the missing
org.apache.arrowshade rule toSparkConnectClient(jvm) andSparkConnectJdbcassembly settings inproject/SparkBuild.scala.This is a small incremental step toward full SBT/Maven assembly parity (SPARK-56374).
Why are the changes needed?
Maven's
sql/connect/client/jvm/pom.xmlrelocates Arrow classes toorg/sparkproject/org/apache/arrow/inside the client assembly JAR. SBT was missing this rule, leaving 998 Arrow classes unshaded atorg/apache/arrow/inside the assembly JAR.Since Arrow classes are bundled inside the client assembly JAR (not on the external classpath), they should be shaded like the other bundled dependencies (grpc, netty, protobuf, etc.) to avoid classpath conflicts when the client JAR is used alongside other libraries that depend on a different Arrow version.
Comparison of Connect client JVM assembly JARs (before this fix):
org/sparkproject/io/grpc/org/sparkproject/connect/client/io/grpc/✅org/sparkproject/io/netty/org/sparkproject/connect/client/io/netty/✅org/sparkproject/connect/guava/org/sparkproject/connect/client/com/google/common/✅org/sparkproject/org/apache/arrow/(998 classes)org/apache/arrow/— NOT SHADED ❌After this fix, Arrow is shaded to
org/sparkproject/connect/client/org/apache/arrow/.Note: The Connect server assembly already matches Maven (identical class count of 5739 and identical shaded namespaces). No server changes are needed.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Build verification:
JAR content verification:
jar tfconfirms Arrow classes now appear underorg/sparkproject/connect/client/org/apache/arrow/with zero unshadedorg/apache/arrow/entriesRuntime tests:
connect-client-jvm/testOnly org.apache.spark.sql.connect.client.arrow.*— 113/113 passed (Arrow serialization/deserialization)connect-client-jvm/testOnly org.apache.spark.sql.connect.client.SparkConnectClientSuite— 58/58 passedconnect-client-jvm/test— 1141/1166 passed (1 pre-existing failure inJavaEncoderSuiteconfirmed on master, 24 E2E suites require running server)Was this patch authored or co-authored using generative AI tooling?
Yes