[AutoSparkUT] Read row group containing both dictionary and plain encoded pages - Missing GPU verification test

## Describe the bug

The test "Read row group containing both dictionary and plain encoded pages" from `ParquetEncodingSuite` uses CPU's `VectorizedParquetRecordReader` directly to verify Parquet file encoding handling. When this test runs with GPU-written Parquet files, the CPU reader fails because it cannot properly parse the GPU's encoding format.

The GPU needs an equivalent verification test that validates the same scenario (mixed dictionary/plain encoding pages) using GPU-compatible reading methods.

**Error Message**:
```
java.lang.ArrayIndexOutOfBoundsException: Index 768 out of bounds for length 768
at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getDictId(OnHeapColumnVector.java:336)
at org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:429)
```

**Root Cause**:
- GPU writes Parquet files with its own encoding strategy (e.g., PLAIN encoding)
- CPU's `VectorizedParquetRecordReader` expects specific internal encoding structures
- CPU reader's dictionary ID lookup fails when trying to read GPU-written files
- Failure occurs at row 768 (transition point from dictionary to plain encoding)

---

## Steps/Code to reproduce bug

### Maven Test Execution


**Result**: 
```
[ERROR] Tests run: 5, Failures: 1, Errors: 0, Skipped: 0
[ERROR] Failures: 
[ERROR]   RapidsParquetEncodingSuite.Read row group containing both dictionary and plain encoded pages:143
java.lang.ArrayIndexOutOfBoundsException: Index 768 out of bounds for length 768
```

### Spark-shell Reproduction

The reproduction uses the exact code from the original unit test (lines 117-143 of `ParquetEncodingSuite.scala`).

**Reproduction command**:
```bash
$SPARK_HOME/bin/spark-shell \
  --master local[2] \
  --conf spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation,org.apache.spark.sql.catalyst.optimizer.ConstantFolding \
  --conf spark.rapids.sql.enabled=true \
  --conf spark.plugins=com.nvidia.spark.SQLPlugin \
  --conf spark.sql.queryExecutionListeners=org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback \
  --conf spark.rapids.sql.explain=ALL \
  --conf spark.rapids.sql.test.isFoldableNonLitAllowed=true \
  --conf spark.rapids.sql.csv.read.decimal.enabled=true \
  --conf spark.rapids.sql.format.avro.enabled=true \
  --conf spark.rapids.sql.format.avro.read.enabled=true \
  --conf spark.rapids.sql.format.hive.text.write.enabled=true \
  --conf spark.rapids.sql.format.json.enabled=true \
  --conf spark.rapids.sql.format.json.read.enabled=true \
  --conf spark.rapids.sql.incompatibleDateFormats.enabled=true \
  --conf spark.rapids.sql.python.gpu.enabled=true \
  --conf spark.rapids.sql.rowBasedUDF.enabled=true \
  --conf spark.rapids.sql.window.collectList.enabled=true \
  --conf spark.rapids.sql.window.collectSet.enabled=true \
  --conf spark.rapids.sql.window.range.byte.enabled=true \
  --conf spark.rapids.sql.window.range.short.enabled=true \
  --conf spark.rapids.sql.expression.Ascii=true \
  --conf spark.rapids.sql.expression.Conv=true \
  --conf spark.rapids.sql.expression.GetJsonObject=true \
  --conf spark.rapids.sql.expression.JsonToStructs=true \
  --conf spark.rapids.sql.expression.StructsToJson=true \
  --conf spark.rapids.sql.exec.CollectLimitExec=true \
  --conf spark.rapids.sql.exec.FlatMapCoGroupsInPandasExec=true \
  --conf spark.rapids.sql.exec.WindowInPandasExec=true \
  --conf spark.rapids.sql.hasExtendedYearValues=false \
  --conf spark.unsafe.exceptionOnMemoryLeak=true \
  --conf spark.sql.session.timeZone=America/Los_Angeles \
  --jars $RAPIDS_JAR \
  -i reproduce-gpu-issue-parquet-mixed-encoding.scala
```

**Key code** (from original test):
```scala
// Configure Parquet encoding
spark.conf.set(ParquetOutputFormat.DICTIONARY_PAGE_SIZE, "2048")
spark.conf.set(ParquetOutputFormat.PAGE_SIZE, "4096")

// Create data: 512 unique values × 3 copies = 1536 rows
// This creates mixed encoding: first page dictionary, rest plain
val data = (0 until 512).flatMap(i => Seq.fill(3)(i.toString))
data.toDF("f").coalesce(1).write.mode("overwrite").parquet(tempDir.getCanonicalPath)

// Try to read using CPU VectorizedParquetRecordReader
val reader = new VectorizedParquetRecordReader(offHeapEnabled, batchSize)
reader.initialize(file, null /* set columns to null to project all columns */)
val column = reader.resultBatch().column(0)
assert(reader.nextBatch())

// Verify data (this is where it fails)
(0 until 512).foreach { i =>
  assert(column.getUTF8String(3 * i).toString == i.toString)
  assert(column.getUTF8String(3 * i + 1).toString == i.toString)
  assert(column.getUTF8String(3 * i + 2).toString == i.toString)
}
```

**Reproduction Result**:
```
✓ GPU successfully writes Parquet file (1536 rows)
✓ CPU reader successfully reads first 768 values (dictionary-encoded page)
✗ CPU reader FAILS at value 769 (transition to plain-encoded page)
✗ Error: ArrayIndexOutOfBoundsException - Index 768 out of bounds for length 768
```

---

## Expected behavior

GPU should have its own test that validates correct handling of Parquet files with mixed encoding pages (dictionary + plain) within the same row group, using GPU-compatible reading methods (e.g., via SQL path).

The test should verify:
- All data is read correctly (all 1536 values)
- GPU Parquet reader is actually used (check execution plan)
- Both dictionary-encoded and plain-encoded pages are handled correctly

---

## Environment details

- **Spark Version**: 3.3.0
- **RAPIDS Version**: 25.12.0-SNAPSHOT
- **CUDA Version**: 12.x
- **GPU**: NVIDIA GPU
- **OS**: Linux
- **Java**: OpenJDK 11

**Original Spark Test Location**: 
- File: `spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala`
- Lines: 117-143
- Test name: "Read row group containing both dictionary and plain encoded pages"

**Test Configuration**:
- Dictionary page size: 2048 bytes
- Page size: 4096 bytes
- Data: 512 unique values × 3 copies = 1536 rows
- This configuration creates ~3 pages: first page dictionary-encoded, remaining pages plain-encoded

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AutoSparkUT] Read row group containing both dictionary and plain encoded pages - Missing GPU verification test #13739

Describe the bug

Steps/Code to reproduce bug

Maven Test Execution

Spark-shell Reproduction

Expected behavior

Environment details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[AutoSparkUT] Read row group containing both dictionary and plain encoded pages - Missing GPU verification test #13739

Description

Describe the bug

Steps/Code to reproduce bug

Maven Test Execution

Spark-shell Reproduction

Expected behavior

Environment details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions