-
Notifications
You must be signed in to change notification settings - Fork 282
[AutoSparkUT] Read row group containing both dictionary and plain encoded pages - Missing GPU verification test #13739
Description
Describe the bug
The test "Read row group containing both dictionary and plain encoded pages" from ParquetEncodingSuite uses CPU's VectorizedParquetRecordReader directly to verify Parquet file encoding handling. When this test runs with GPU-written Parquet files, the CPU reader fails because it cannot properly parse the GPU's encoding format.
The GPU needs an equivalent verification test that validates the same scenario (mixed dictionary/plain encoding pages) using GPU-compatible reading methods.
Error Message:
java.lang.ArrayIndexOutOfBoundsException: Index 768 out of bounds for length 768
at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getDictId(OnHeapColumnVector.java:336)
at org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:429)
Root Cause:
- GPU writes Parquet files with its own encoding strategy (e.g., PLAIN encoding)
- CPU's
VectorizedParquetRecordReaderexpects specific internal encoding structures - CPU reader's dictionary ID lookup fails when trying to read GPU-written files
- Failure occurs at row 768 (transition point from dictionary to plain encoding)
Steps/Code to reproduce bug
Maven Test Execution
Result:
[ERROR] Tests run: 5, Failures: 1, Errors: 0, Skipped: 0
[ERROR] Failures:
[ERROR] RapidsParquetEncodingSuite.Read row group containing both dictionary and plain encoded pages:143
java.lang.ArrayIndexOutOfBoundsException: Index 768 out of bounds for length 768
Spark-shell Reproduction
The reproduction uses the exact code from the original unit test (lines 117-143 of ParquetEncodingSuite.scala).
Reproduction command:
$SPARK_HOME/bin/spark-shell \
--master local[2] \
--conf spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation,org.apache.spark.sql.catalyst.optimizer.ConstantFolding \
--conf spark.rapids.sql.enabled=true \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.sql.queryExecutionListeners=org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback \
--conf spark.rapids.sql.explain=ALL \
--conf spark.rapids.sql.test.isFoldableNonLitAllowed=true \
--conf spark.rapids.sql.csv.read.decimal.enabled=true \
--conf spark.rapids.sql.format.avro.enabled=true \
--conf spark.rapids.sql.format.avro.read.enabled=true \
--conf spark.rapids.sql.format.hive.text.write.enabled=true \
--conf spark.rapids.sql.format.json.enabled=true \
--conf spark.rapids.sql.format.json.read.enabled=true \
--conf spark.rapids.sql.incompatibleDateFormats.enabled=true \
--conf spark.rapids.sql.python.gpu.enabled=true \
--conf spark.rapids.sql.rowBasedUDF.enabled=true \
--conf spark.rapids.sql.window.collectList.enabled=true \
--conf spark.rapids.sql.window.collectSet.enabled=true \
--conf spark.rapids.sql.window.range.byte.enabled=true \
--conf spark.rapids.sql.window.range.short.enabled=true \
--conf spark.rapids.sql.expression.Ascii=true \
--conf spark.rapids.sql.expression.Conv=true \
--conf spark.rapids.sql.expression.GetJsonObject=true \
--conf spark.rapids.sql.expression.JsonToStructs=true \
--conf spark.rapids.sql.expression.StructsToJson=true \
--conf spark.rapids.sql.exec.CollectLimitExec=true \
--conf spark.rapids.sql.exec.FlatMapCoGroupsInPandasExec=true \
--conf spark.rapids.sql.exec.WindowInPandasExec=true \
--conf spark.rapids.sql.hasExtendedYearValues=false \
--conf spark.unsafe.exceptionOnMemoryLeak=true \
--conf spark.sql.session.timeZone=America/Los_Angeles \
--jars $RAPIDS_JAR \
-i reproduce-gpu-issue-parquet-mixed-encoding.scalaKey code (from original test):
// Configure Parquet encoding
spark.conf.set(ParquetOutputFormat.DICTIONARY_PAGE_SIZE, "2048")
spark.conf.set(ParquetOutputFormat.PAGE_SIZE, "4096")
// Create data: 512 unique values × 3 copies = 1536 rows
// This creates mixed encoding: first page dictionary, rest plain
val data = (0 until 512).flatMap(i => Seq.fill(3)(i.toString))
data.toDF("f").coalesce(1).write.mode("overwrite").parquet(tempDir.getCanonicalPath)
// Try to read using CPU VectorizedParquetRecordReader
val reader = new VectorizedParquetRecordReader(offHeapEnabled, batchSize)
reader.initialize(file, null /* set columns to null to project all columns */)
val column = reader.resultBatch().column(0)
assert(reader.nextBatch())
// Verify data (this is where it fails)
(0 until 512).foreach { i =>
assert(column.getUTF8String(3 * i).toString == i.toString)
assert(column.getUTF8String(3 * i + 1).toString == i.toString)
assert(column.getUTF8String(3 * i + 2).toString == i.toString)
}Reproduction Result:
✓ GPU successfully writes Parquet file (1536 rows)
✓ CPU reader successfully reads first 768 values (dictionary-encoded page)
✗ CPU reader FAILS at value 769 (transition to plain-encoded page)
✗ Error: ArrayIndexOutOfBoundsException - Index 768 out of bounds for length 768
Expected behavior
GPU should have its own test that validates correct handling of Parquet files with mixed encoding pages (dictionary + plain) within the same row group, using GPU-compatible reading methods (e.g., via SQL path).
The test should verify:
- All data is read correctly (all 1536 values)
- GPU Parquet reader is actually used (check execution plan)
- Both dictionary-encoded and plain-encoded pages are handled correctly
Environment details
- Spark Version: 3.3.0
- RAPIDS Version: 25.12.0-SNAPSHOT
- CUDA Version: 12.x
- GPU: NVIDIA GPU
- OS: Linux
- Java: OpenJDK 11
Original Spark Test Location:
- File:
spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala - Lines: 117-143
- Test name: "Read row group containing both dictionary and plain encoded pages"
Test Configuration:
- Dictionary page size: 2048 bytes
- Page size: 4096 bytes
- Data: 512 unique values × 3 copies = 1536 rows
- This configuration creates ~3 pages: first page dictionary-encoded, remaining pages plain-encoded