Skip to content

[AutoSparkUT]Add Dataset, DataFrameFunctions and ColumnExpression suites#14157

Merged
GaryShen2008 merged 3 commits intoNVIDIA:mainfrom
GaryShen2008:add-dataset-suite
Jan 16, 2026
Merged

[AutoSparkUT]Add Dataset, DataFrameFunctions and ColumnExpression suites#14157
GaryShen2008 merged 3 commits intoNVIDIA:mainfrom
GaryShen2008:add-dataset-suite

Conversation

@GaryShen2008
Copy link
Copy Markdown
Collaborator

Enable 3 Spark UT suites:

  • RapidsDatasetSuite
  • RapidsDataFrameFunctionsSuite
  • RapidsColumnExpressionSuite

Summary:

  • 4 failed cases in RapidsColumnExpressionSuite, filed 3 issues
  • 2 failed cases in RapidsDataFrameFunctionsSuite, filed 2 issues, 1 case replaced by testRapids
  • 2 failed cases in RapidsDatasetSuite with 1 issue, 4 cases replaced by testRapids

Signed-off-by: Gary Shen <gashen@nvidia.com>
RapidsDataFrameFunctionsSuite and RapidsColumnExpressionSuite

Signed-off-by: Gary Shen <gashen@nvidia.com>
Copilot AI review requested due to automatic review settings January 15, 2026 09:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request enables three Spark unit test suites for GPU execution: RapidsColumnExpressionSuite, RapidsDataFrameFunctionsSuite, and RapidsDatasetSuite. It adds test suite configuration entries and implements custom test cases for scenarios where GPU behavior differs from CPU (primarily ordering differences).

Changes:

  • Added configuration entries in RapidsTestSettings.scala to enable three test suites with appropriate exclusions for known issues and adjusted tests
  • Created RapidsColumnExpressionSuite.scala that extends Spark's ColumnExpressionSuite for GPU execution
  • Created RapidsDataFrameFunctionsSuite.scala with a custom testRapids implementation for array_intersect that handles non-deterministic ordering
  • Created RapidsDatasetSuite.scala with four custom testRapids implementations that sort results to handle non-deterministic ordering

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
RapidsTestSettings.scala Added suite configurations with test exclusions for known issues and adjusted tests
RapidsColumnExpressionSuite.scala New test suite extending Spark's ColumnExpressionSuite for GPU validation
RapidsDataFrameFunctionsSuite.scala New test suite with custom array_intersect test handling ordering differences
RapidsDatasetSuite.scala New test suite with four custom tests that sort results for consistent ordering

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Jan 15, 2026

Greptile Summary

This PR enables three Spark unit test suites for GPU execution: RapidsColumnExpressionSuite (145 tests), RapidsDataFrameFunctionsSuite (105 tests), and RapidsDatasetSuite (161 tests).

Key Changes:

  • Added three new test suite files that extend original Spark test suites with RapidsSQLTestsTrait for GPU execution
  • Updated RapidsTestSettings.scala to enable the suites with appropriate exclusions for known issues (8 failed tests with 6 filed issues)
  • Implemented custom testRapids versions for 5 tests that require result sorting due to non-deterministic ordering in GPU execution
  • All test files use proper Spark 330 shim configuration and follow established patterns

Test Results:

  • RapidsColumnExpressionSuite: 4 failed cases with 3 filed issues
  • RapidsDataFrameFunctionsSuite: 2 failed cases with 2 filed issues, 1 replaced by testRapids
  • RapidsDatasetSuite: 2 failed cases with 1 issue, 4 replaced by testRapids

All failures are properly documented with GitHub issue links and appropriate exclusion reasons (KNOWN_ISSUE or ADJUST_UT).

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The PR follows established patterns for enabling Spark unit tests, properly handles known failures with documented issues, and includes appropriate testRapids replacements for non-deterministic ordering cases
  • No files require special attention

Important Files Changed

Filename Overview
tests/src/test/spark330/scala/org/apache/spark/sql/rapids/suites/RapidsColumnExpressionSuite.scala New test suite extending Spark's ColumnExpressionSuite to run 145 column expression tests on GPU with proper shim configuration
tests/src/test/spark330/scala/org/apache/spark/sql/rapids/suites/RapidsDataFrameFunctionsSuite.scala New test suite extending Spark's DataFrameFunctionsSuite with custom testRapids implementation for array_intersect that handles non-deterministic ordering
tests/src/test/spark330/scala/org/apache/spark/sql/rapids/suites/RapidsDatasetSuite.scala New test suite extending Spark's DatasetSuite with 4 testRapids implementations that sort results to handle non-deterministic ordering in GPU execution
tests/src/test/spark330/scala/org/apache/spark/sql/rapids/utils/RapidsTestSettings.scala Configuration updates enabling 3 new test suites with appropriate exclusions for known issues and adjusted tests

Sequence Diagram

sequenceDiagram
    participant Test as Test Execution
    participant Suite as RapidsTestSuite
    participant Trait as RapidsSQLTestsTrait
    participant GPU as GPU Executor
    participant Settings as RapidsTestSettings

    Test->>Settings: Load test configuration
    Settings->>Settings: enableSuite[RapidsColumnExpressionSuite]
    Settings->>Settings: enableSuite[RapidsDataFrameFunctionsSuite]
    Settings->>Settings: enableSuite[RapidsDatasetSuite]
    Settings->>Settings: Apply exclusions for known issues

    Test->>Suite: Execute test suite
    Suite->>Trait: Inherit from RapidsSQLTestsTrait
    
    alt Inherited Test
        Suite->>Trait: Run original Spark test
        Trait->>GPU: Execute on GPU via checkAnswer override
        GPU->>Trait: Return GPU results
    else testRapids Override
        Suite->>Suite: Run custom testRapids implementation
        Note over Suite: Sort results for deterministic ordering
        Suite->>Trait: Compare results via checkAnswer
        Trait->>GPU: Execute on GPU
        GPU->>Trait: Return GPU results
    end

    Trait->>Trait: Compare expected vs actual results
    Trait->>Test: Return test result (pass/fail)
Loading

Signed-off-by: Gary Shen <gashen@nvidia.com>
@GaryShen2008 GaryShen2008 changed the title Add Dataset, DataFrameFunctions and ColumnExpression suites [AutoSparkUT]Add Dataset, DataFrameFunctions and ColumnExpression suites Jan 15, 2026
@GaryShen2008 GaryShen2008 added the test Only impacts tests label Jan 15, 2026
@GaryShen2008
Copy link
Copy Markdown
Collaborator Author

build

@GaryShen2008 GaryShen2008 merged commit 1cb64bc into NVIDIA:main Jan 16, 2026
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test Only impacts tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants