Skip to content

Support update for Delta clustered tables [databricks]#13822

Merged
jihoonson merged 1 commit intoNVIDIA:release/25.12from
jihoonson:update-clustered-tables
Nov 21, 2025
Merged

Support update for Delta clustered tables [databricks]#13822
jihoonson merged 1 commit intoNVIDIA:release/25.12from
jihoonson:update-clustered-tables

Conversation

@jihoonson
Copy link
Copy Markdown
Collaborator

@jihoonson jihoonson commented Nov 20, 2025

Fixes #13547

Description

This PR adds support for updating clustered tables for Delta.

I ran some test on my workstation to compare the performance of the update operation between CPU and GPU. The selectivity of the match condition in the query was about 10%.

CREATE TABLE store_sales_clone SHALLOW CLONE delta.`/path/to/sf=100/delta_clustered/store_sales`;

UPDATE store_sales_clone SET
    ss_wholesale_cost = ss_wholesale_cost * 2,
    ss_list_price = ss_list_price * 2,
    ss_sales_price = ss_sales_price * 2,
    ss_ext_sales_price = ss_ext_sales_price * 2,
    ss_ext_wholesale_cost = ss_ext_wholesale_cost * 2,
    ss_ext_list_price = ss_ext_list_price * 2,
    ss_ext_discount_amt = ss_ext_discount_amt * 2
WHERE ss_store_sk <= 20;

The query speedup was:

Means = 46820.0, 35492.666666666664
Time diff = 11327.333333333336
Speedup = 1.3191457390259023
T-Test (test statistic, p value, df) = 7.280031124055662, 0.0018917865950277067, 4.0
T-Test Confidence Interval = 7007.335430528114, 15647.331236138558
ALERT: significant change has been detected (p-value < 0.05)

GPU configs:

export SPARK_CONF=("--master" "local[16]"
                   "--conf" "spark.driver.maxResultSize=2GB"
                   "--conf" "spark.driver.memory=16G"
                   "--conf" "spark.sql.files.maxPartitionBytes=2gb"
                   "--conf" "spark.plugins=com.nvidia.spark.SQLPlugin"
                   "--conf" "spark.rapids.memory.host.spillStorageSize=16G"
                   "--conf" "spark.rapids.memory.pinnedPool.size=8g"
                   "--conf" "spark.rapids.sql.concurrentGpuTasks=6"
                   "--conf" "spark.sql.adaptive.coalescePartitions.minPartitionSize=32mb"
                   "--conf" "spark.sql.adaptive.advisoryPartitionSizeInBytes=160mb"
                   "--conf" "spark.shuffle.manager=com.nvidia.spark.rapids.spark356.RapidsShuffleManager"
                   "--conf" "spark.rapids.shuffle.multiThreaded.writer.threads=64"
                   "--conf" "spark.rapids.shuffle.multiThreaded.reader.threads=64"
                   "--conf" "spark.rapids.sql.multiThreadedRead.numThreads=64"
                   "--packages" "io.delta:delta-spark_2.12:3.3.1"
                   "--conf" "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension"
                   "--conf" "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
                   "--conf" "spark.driver.extraClassPath=$SPARK_RAPIDS_PLUGIN_JAR"
                   "--conf" "spark.executor.extraClassPath=$SPARK_RAPIDS_PLUGIN_JAR")

CPU configs:

export SPARK_CONF=("--master" "local[32]"
                   "--conf" "spark.rapids.sql.enabled=false"
                   "--conf" "spark.driver.memory=16G"
                   "--conf" "spark.scheduler.minRegisteredResourcesRatio=1.0"
                   "--conf" "spark.driver.extraClassPath=$NDS_LISTENER_JAR"
                   "--conf" "spark.executor.extraClassPath=$NDS_LISTENER_JAR"
                   "--packages" "io.delta:delta-spark_2.12:3.3.1"
                   "--conf" "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension"
                   "--conf" "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog")

Checklists

  • This PR has added documentation for new or modified features or behaviors.
  • This PR has added new tests or modified existing tests to cover new code paths.
    (Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
  • Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

@jihoonson
Copy link
Copy Markdown
Collaborator Author

build

Signed-off-by: Jihoon Son <ghoonson@gmail.com>
@jihoonson jihoonson force-pushed the update-clustered-tables branch from e2e75d7 to 46a917d Compare November 20, 2025 01:44
@jihoonson
Copy link
Copy Markdown
Collaborator Author

build

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Nov 20, 2025

Greptile Summary

  • Removes the liquid clustering fallback check from UpdateCommandMetaBase to enable GPU acceleration for UPDATE operations on Delta clustered tables
  • Updates test_delta_update_sql_liquid_clustering to verify GPU execution instead of expecting CPU fallback, consistent with earlier clustered table write support

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The changes are minimal and well-aligned with the existing codebase pattern from PR Add write support for Delta clustered tables [databricks] #13417 which added write support for clustered tables. The implementation in GpuUpdateCommandBase already handles the update logic correctly, and the test confirms GPU execution matches CPU results. No additional logic changes are needed beyond removing the fallback check.
  • No files require special attention

Important Files Changed

Filename Overview
delta-lake/common/src/main/delta-33x-40x/scala/com/nvidia/spark/rapids/delta/common/UpdateCommandMetaBase.scala Removes clustered table check to enable GPU acceleration for UPDATE operations on Delta tables with liquid clustering

Sequence Diagram

sequenceDiagram
    participant User
    participant SparkSQL
    participant UpdateCommandMetaBase
    participant GpuUpdateCommandBase
    participant DeltaLog
    participant GPU

    User->>SparkSQL: "UPDATE clustered_table SET e = e+1 WHERE a > 0"
    SparkSQL->>UpdateCommandMetaBase: tagSelfForGpu()
    UpdateCommandMetaBase->>UpdateCommandMetaBase: "Check if Delta write enabled"
    UpdateCommandMetaBase->>UpdateCommandMetaBase: "Check deletion vectors"
    Note over UpdateCommandMetaBase: Clustered table check removed
    UpdateCommandMetaBase->>SparkSQL: "Tagged for GPU execution"
    SparkSQL->>GpuUpdateCommandBase: "run()"
    GpuUpdateCommandBase->>DeltaLog: "withNewTransaction()"
    GpuUpdateCommandBase->>GpuUpdateCommandBase: "performUpdate()"
    GpuUpdateCommandBase->>GPU: "rewriteFiles()"
    GPU->>GpuUpdateCommandBase: "Updated files"
    GpuUpdateCommandBase->>DeltaLog: "commitIfNeeded()"
    DeltaLog->>GpuUpdateCommandBase: "Commit version"
    GpuUpdateCommandBase->>User: "Return affected rows"
Loading

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

@jihoonson jihoonson merged commit 7242e32 into NVIDIA:release/25.12 Nov 21, 2025
63 of 64 checks passed
@sameerz sameerz added the feature request New feature or request label Nov 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants