Delegate reading DataFiles to Iceberg-Rust

### Feature Request / Improvement

Today we highly lean on PyArrow to do the reading of the Parquet files, but this has some big disadvantages:

- PyArrow does not treat Field-IDs as first class citizens. Therefore we have to first get the physical schema (from the Parquet files) and [prune the schema](https://github.com/apache/iceberg-python/blob/3eecdadc000047ec30749fc5d6ce1f2f072a30b2/pyiceberg/io/pyarrow.py#L1516) based on field-IDs.
- We have to post-process the buffers to apply schema evolution. For example, if a table has promoted an integer to a long, Iceberg does not rewrite the datafiles with the new column. Instead, when we see an integer at read-time, we [promote the buffer to a long](https://github.com/apache/iceberg-python/blob/3eecdadc000047ec30749fc5d6ce1f2f072a30b2/pyiceberg/io/pyarrow.py#L1554-L1560). Ideally we want to push this down to the reader right away.

If we could push this down into Iceberg-Rust, and return references to Arrow buffers back to PyIceberg, that would be great. We can start simple first by still applying the merge-on-read deletes in PyIceberg, and move that over to Iceberg-Rust step by step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delegate reading DataFiles to Iceberg-Rust #2396

Feature Request / Improvement

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Delegate reading DataFiles to Iceberg-Rust #2396

Description

Feature Request / Improvement

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions