Add PuffinWriter for writing deletion vectors by moomindani · Pull Request #3474 · apache/iceberg-python

moomindani · 2026-06-09T23:07:54Z

Part of #2261. Continues #2822.

Rationale for this change

This adds a PuffinWriter for writing Puffin files containing deletion-vector-v1 blobs — the first building block for deletion-vector write support in PyIceberg (tracking issue #2261).

It revives #2822 by @rambleraptor (with @glesperance's Spark interop test), which was auto-closed by the stale bot rather than on merit. The original work — including all review feedback already addressed there (@ebyhr, @geruh) — is preserved commit-for-commit.

On top of that, this PR adds unit tests for two agreed review items that were not yet asserted by any test:

the blob fields value [2147483645] (Java MetadataColumns.ROW_POSITION, INT_MAX - 2), required for Java/Spark interoperability; and
the deletion-vector blob framing at the byte level (length prefix, DV magic, CRC-32 over magic + vector), which the PuffinFile reader skips, so the round-trip tests did not previously exercise it.

As in the original PR, this is intentionally scoped to the writer + tests so we can agree on the write semantics before wiring it into the delete/manifest writers and the merge-on-read path. Per the original review discussion, the writer expects the caller to provide one merged deletion vector per data file.

Are these changes tested?

Yes:

Unit tests for round-trip write/read, the single-blob (1:1) behavior, the DV field id, byte-level blob framing, and empty files (tests/table/test_puffin.py).
A Spark interoperability test confirming PyIceberg can read Spark-written Puffin DVs (tests/integration/test_puffin_spark_interop.py, by @glesperance).

Are there any user-facing changes?

No. PuffinWriter is a new internal building block and is not yet wired into any public write path.

Verify pyiceberg's PuffinFile reader can parse deletion vectors written by Spark. Uses coalesce(1) to force Spark to create DVs instead of COW.

@ebyhr

PuffinFile reads only the serialized vector, skipping a blob's length prefix, deletion-vector magic and CRC-32, so the round-trip tests never exercise that framing. Add coverage for review items agreed on the original PR (apache#2822) that were not yet asserted by any test: - Assert the blob `fields` is [2147483645] (Java MetadataColumns.ROW_POSITION, INT_MAX - 2), required for Java/Spark interoperability (raised by @ebyhr). - Assert the deletion-vector blob framing at the byte level: the length prefix, the deletion-vector magic, and the CRC-32 over magic + vector.

ebyhr · 2026-06-09T23:36:03Z

+        self._blobs = []
+        self._blob_payloads = []
+
+        # 1. Create bitmaps from positions


nit: I would avoid using number prefixes. When we want to add a new operation, we need to adjust the subsequent numbers.

ebyhr · 2026-06-09T23:39:30Z

+        # Calculate the cardinality from the bitmaps
+        cardinality = sum(len(bm) for bm in bitmaps.values())


nit: A comment for a simple single line seems excessive. It's evident when we read the code.

ebyhr · 2026-06-10T00:00:21Z

+@pytest.mark.integration
+def test_read_spark_written_puffin_dv(spark: SparkSession, session_catalog: RestCatalog) -> None:
+    """Verify pyiceberg can read Puffin DVs written by Spark."""
+    identifier = "default.spark_puffin_format_test"


This PR introduces support for write operations, so we're interested in verifying that Spark can read Puffin files written by PyIceberg. There are no requested changes for now. I suppose this PR is a preparatory change, and we'll need another PR to use it during the write operations.

ebyhr · 2026-06-10T00:20:36Z

+class PuffinWriter:
+    _blobs: list[PuffinBlobMetadata]
+    _blob_payloads: list[bytes]
+    _created_by: str | None


Could you please set the default value for the _created_by field using PyIceberg version {version}? You can obtain the version by using importlib.metadata.version.

ebyhr · 2026-06-10T00:24:52Z

@@ -0,0 +1,93 @@
+# Licensed to the Apache Software Foundation (ASF) under one


This test passes without the changes made in this PR. Could you please extract a PR that adding this test?

- Default created-by footer property to 'PyIceberg version {version}' - Move the Spark interop reader test to a separate PR - Remove numbered and self-evident comments - Name the row position field id constant - Validate positions in set_blob (non-negative, non-empty) - Simplify blob framing and finish() assembly

ebyhr · 2026-06-10T04:20:54Z

+
+
+class PuffinWriter:
+    """Writes a Puffin file containing a single deletion-vector-v1 blob."""


This comment looks misleading. This writer doesn't write a file in my understanding.

ebyhr · 2026-06-10T04:22:38Z

+    _blob_payloads: list[bytes]
+    _created_by: str
+
+    def __init__(self, created_by: str | None = None) -> None:


What about accepting an OutputFile or something, and writing the content to it? I think this is a better approach than returning bytes. Iceberg Java PuffinWriter also accepts an output file object.

rambleraptor and others added 7 commits June 9, 2026 14:57

deletion vector write

755793c

test fix

9b10a4f

lint fixes

c90ad38

test: Add Spark interop test for Puffin DV reader

842d6a5

Verify pyiceberg's PuffinFile reader can parse deletion vectors written by Spark. Uses coalesce(1) to force Spark to create DVs instead of COW.

PR comments

9524618

lint

e23a67d

moomindani mentioned this pull request Jun 9, 2026

Write Deletion Vectors #2822

Closed

ebyhr reviewed Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PuffinWriter for writing deletion vectors#3474

Add PuffinWriter for writing deletion vectors#3474
moomindani wants to merge 8 commits into
apache:mainfrom
moomindani:moomindani/dv-write-revival

moomindani commented Jun 9, 2026

Uh oh!

ebyhr Jun 9, 2026

Uh oh!

ebyhr Jun 9, 2026

Uh oh!

ebyhr Jun 10, 2026

Uh oh!

ebyhr Jun 10, 2026

Uh oh!

ebyhr Jun 10, 2026

Uh oh!

ebyhr Jun 10, 2026

Uh oh!

ebyhr Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		# Calculate the cardinality from the bitmaps
		cardinality = sum(len(bm) for bm in bitmaps.values())

		@@ -0,0 +1,93 @@
		# Licensed to the Apache Software Foundation (ASF) under one



		class PuffinWriter:
		"""Writes a Puffin file containing a single deletion-vector-v1 blob."""

Conversation

moomindani commented Jun 9, 2026

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

ebyhr Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

ebyhr Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

ebyhr Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

ebyhr Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

ebyhr Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

ebyhr Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

ebyhr Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants