Skip to content

[AutoSparkUT] Read row group containing both dictionary and plain encoded pages - Missing GPU verification test #13739

@wjxiz1992

Description

@wjxiz1992

Describe the bug

The test "Read row group containing both dictionary and plain encoded pages" from ParquetEncodingSuite uses CPU's VectorizedParquetRecordReader directly to verify Parquet file encoding handling. When this test runs with GPU-written Parquet files, the CPU reader fails because it cannot properly parse the GPU's encoding format.

The GPU needs an equivalent verification test that validates the same scenario (mixed dictionary/plain encoding pages) using GPU-compatible reading methods.

Error Message:

java.lang.ArrayIndexOutOfBoundsException: Index 768 out of bounds for length 768
at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getDictId(OnHeapColumnVector.java:336)
at org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:429)

Root Cause:

  • GPU writes Parquet files with its own encoding strategy (e.g., PLAIN encoding)
  • CPU's VectorizedParquetRecordReader expects specific internal encoding structures
  • CPU reader's dictionary ID lookup fails when trying to read GPU-written files
  • Failure occurs at row 768 (transition point from dictionary to plain encoding)

Steps/Code to reproduce bug

Maven Test Execution

Result:

[ERROR] Tests run: 5, Failures: 1, Errors: 0, Skipped: 0
[ERROR] Failures: 
[ERROR]   RapidsParquetEncodingSuite.Read row group containing both dictionary and plain encoded pages:143
java.lang.ArrayIndexOutOfBoundsException: Index 768 out of bounds for length 768

Spark-shell Reproduction

The reproduction uses the exact code from the original unit test (lines 117-143 of ParquetEncodingSuite.scala).

Reproduction command:

$SPARK_HOME/bin/spark-shell \
  --master local[2] \
  --conf spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation,org.apache.spark.sql.catalyst.optimizer.ConstantFolding \
  --conf spark.rapids.sql.enabled=true \
  --conf spark.plugins=com.nvidia.spark.SQLPlugin \
  --conf spark.sql.queryExecutionListeners=org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback \
  --conf spark.rapids.sql.explain=ALL \
  --conf spark.rapids.sql.test.isFoldableNonLitAllowed=true \
  --conf spark.rapids.sql.csv.read.decimal.enabled=true \
  --conf spark.rapids.sql.format.avro.enabled=true \
  --conf spark.rapids.sql.format.avro.read.enabled=true \
  --conf spark.rapids.sql.format.hive.text.write.enabled=true \
  --conf spark.rapids.sql.format.json.enabled=true \
  --conf spark.rapids.sql.format.json.read.enabled=true \
  --conf spark.rapids.sql.incompatibleDateFormats.enabled=true \
  --conf spark.rapids.sql.python.gpu.enabled=true \
  --conf spark.rapids.sql.rowBasedUDF.enabled=true \
  --conf spark.rapids.sql.window.collectList.enabled=true \
  --conf spark.rapids.sql.window.collectSet.enabled=true \
  --conf spark.rapids.sql.window.range.byte.enabled=true \
  --conf spark.rapids.sql.window.range.short.enabled=true \
  --conf spark.rapids.sql.expression.Ascii=true \
  --conf spark.rapids.sql.expression.Conv=true \
  --conf spark.rapids.sql.expression.GetJsonObject=true \
  --conf spark.rapids.sql.expression.JsonToStructs=true \
  --conf spark.rapids.sql.expression.StructsToJson=true \
  --conf spark.rapids.sql.exec.CollectLimitExec=true \
  --conf spark.rapids.sql.exec.FlatMapCoGroupsInPandasExec=true \
  --conf spark.rapids.sql.exec.WindowInPandasExec=true \
  --conf spark.rapids.sql.hasExtendedYearValues=false \
  --conf spark.unsafe.exceptionOnMemoryLeak=true \
  --conf spark.sql.session.timeZone=America/Los_Angeles \
  --jars $RAPIDS_JAR \
  -i reproduce-gpu-issue-parquet-mixed-encoding.scala

Key code (from original test):

// Configure Parquet encoding
spark.conf.set(ParquetOutputFormat.DICTIONARY_PAGE_SIZE, "2048")
spark.conf.set(ParquetOutputFormat.PAGE_SIZE, "4096")

// Create data: 512 unique values × 3 copies = 1536 rows
// This creates mixed encoding: first page dictionary, rest plain
val data = (0 until 512).flatMap(i => Seq.fill(3)(i.toString))
data.toDF("f").coalesce(1).write.mode("overwrite").parquet(tempDir.getCanonicalPath)

// Try to read using CPU VectorizedParquetRecordReader
val reader = new VectorizedParquetRecordReader(offHeapEnabled, batchSize)
reader.initialize(file, null /* set columns to null to project all columns */)
val column = reader.resultBatch().column(0)
assert(reader.nextBatch())

// Verify data (this is where it fails)
(0 until 512).foreach { i =>
  assert(column.getUTF8String(3 * i).toString == i.toString)
  assert(column.getUTF8String(3 * i + 1).toString == i.toString)
  assert(column.getUTF8String(3 * i + 2).toString == i.toString)
}

Reproduction Result:

✓ GPU successfully writes Parquet file (1536 rows)
✓ CPU reader successfully reads first 768 values (dictionary-encoded page)
✗ CPU reader FAILS at value 769 (transition to plain-encoded page)
✗ Error: ArrayIndexOutOfBoundsException - Index 768 out of bounds for length 768

Expected behavior

GPU should have its own test that validates correct handling of Parquet files with mixed encoding pages (dictionary + plain) within the same row group, using GPU-compatible reading methods (e.g., via SQL path).

The test should verify:

  • All data is read correctly (all 1536 values)
  • GPU Parquet reader is actually used (check execution plan)
  • Both dictionary-encoded and plain-encoded pages are handled correctly

Environment details

  • Spark Version: 3.3.0
  • RAPIDS Version: 25.12.0-SNAPSHOT
  • CUDA Version: 12.x
  • GPU: NVIDIA GPU
  • OS: Linux
  • Java: OpenJDK 11

Original Spark Test Location:

  • File: spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala
  • Lines: 117-143
  • Test name: "Read row group containing both dictionary and plain encoded pages"

Test Configuration:

  • Dictionary page size: 2048 bytes
  • Page size: 4096 bytes
  • Data: 512 unique values × 3 copies = 1536 rows
  • This configuration creates ~3 pages: first page dictionary-encoded, remaining pages plain-encoded

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions