Skip to content

Delegate reading DataFiles to Iceberg-Rust #2396

@Fokko

Description

@Fokko

Feature Request / Improvement

Today we highly lean on PyArrow to do the reading of the Parquet files, but this has some big disadvantages:

  • PyArrow does not treat Field-IDs as first class citizens. Therefore we have to first get the physical schema (from the Parquet files) and prune the schema based on field-IDs.
  • We have to post-process the buffers to apply schema evolution. For example, if a table has promoted an integer to a long, Iceberg does not rewrite the datafiles with the new column. Instead, when we see an integer at read-time, we promote the buffer to a long. Ideally we want to push this down to the reader right away.

If we could push this down into Iceberg-Rust, and return references to Arrow buffers back to PyIceberg, that would be great. We can start simple first by still applying the merge-on-read deletes in PyIceberg, and move that over to Iceberg-Rust step by step.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions