Feature Request / Improvement
Today we highly lean on PyArrow to do the reading of the Parquet files, but this has some big disadvantages:
- PyArrow does not treat Field-IDs as first class citizens. Therefore we have to first get the physical schema (from the Parquet files) and prune the schema based on field-IDs.
- We have to post-process the buffers to apply schema evolution. For example, if a table has promoted an integer to a long, Iceberg does not rewrite the datafiles with the new column. Instead, when we see an integer at read-time, we promote the buffer to a long. Ideally we want to push this down to the reader right away.
If we could push this down into Iceberg-Rust, and return references to Arrow buffers back to PyIceberg, that would be great. We can start simple first by still applying the merge-on-read deletes in PyIceberg, and move that over to Iceberg-Rust step by step.
Feature Request / Improvement
Today we highly lean on PyArrow to do the reading of the Parquet files, but this has some big disadvantages:
If we could push this down into Iceberg-Rust, and return references to Arrow buffers back to PyIceberg, that would be great. We can start simple first by still applying the merge-on-read deletes in PyIceberg, and move that over to Iceberg-Rust step by step.