docs(specs): add Staged Insert Specification by dimitri-yatsenko · Pull Request #177 · datajoint/datajoint-docs

dimitri-yatsenko · 2026-05-21T14:42:02Z

Summary

A normative spec at src/reference/specs/staged-insert.md defining the staged-insert contract for the <object@> codec. The principle that scopes the spec: staged insert applies only to codecs whose content format has an incremental-write API. Among the built-in codecs, only <object@> qualifies — <blob@>, <hash@>, <npy@>, and <attach@> all require a materialized value at insert time and have no pathway through which staged insert would help.

What changed

File	Change
`src/reference/specs/staged-insert.md` (new)	~154-line minimal spec
`mkdocs.yaml`	Nav entry under `Reference → Specifications → Data Operations`
`src/reference/specs/data-manipulation.md`	§2.9 cross-link to the new spec; example snippet corrected

Spec contents

Overview & principle — incremental-write APIs only
Scope — <object@> in; all others out with one-line rationale per codec
Lifecycle — 4 phases (setup → drafting → finalization → unwinding), with diagram
Path construction (normative) — schema-addressed canonical path
Metadata contract — {path, store, size, ext, is_dir, item_count, timestamp} matching ObjectCodec.encode (builtin_codecs/object.py:166-174)
Atomicity model — at-most-once with cleanup; explicit boundaries for block exceptions, duplicate-PK on insert, storage errors, BaseException
Concurrency — same-PK serialization rule, transaction interaction
Configuration — references object-store-configuration.md
Example — Zarr written incrementally into an <object@> directory from a streaming source
Future work — staged insert for other codecs once their formats grow incremental-write APIs (<zarr@>, <hdf5@>, parquet candidates); hash-addressed staged; multi-row; resumable

What this PR does not do

No code changes in datajoint-python. The spec documents the as-shipped <object@> gate at staged_insert.py:100-101.
No generalized codec protocol and no Conformance section. Both are deferred until a second codec actually needs them.

Test plan

mkdocs serve renders the spec under Reference → Specifications → Data Operations
Cross-links resolve (codec-api.md, data-manipulation.md, type-system.md, object-store-configuration.md, the how-to, garbage-collection)
§2.9 in data-manipulation.md reads correctly
Spec and how-to (src/how-to/staged-insert.md, merged in docs: dedicated how-to page for staged insert #175) agree on scope and idioms

dimitri-yatsenko · 2026-05-21T14:51:42Z

Planned edits before merge

While discussing the staged-insert spec, we surfaced that <zarr@> (from dj-zarr-codecs) is a real codec — a sibling of <object@> under SchemaCodec, not built on top of it. Today's pattern of "use <object@> to host a Zarr store" is a workaround that works because <object@> is the generic schema-addressed-directory codec; once staged_insert1 accepts any SchemaCodec (per this spec), <zarr@> becomes the type-correct choice. The spec should reflect that.

Two edits are planned. They will be applied as a final commit on this branch immediately before merge, after the matching implementation PR in datajoint-python lands — so the spec ships in lockstep with the as-shipped code, not as an aspirational document.

Edit 1 — Add a "Plugin codecs" sub-section to the Codec compatibility matrix

After the built-in compatibility matrix (currently the last row is "Other custom codec"), add:

Plugin codecs (examples)

Third-party packages can register additional SchemaCodec or HashAddressedCodec subclasses that inherit staged-insert support automatically:

Codec Plugin package Lifecycle Notes

<zarr@> dj-zarr-codecs Schema-addressed Canonical example of a third-party SchemaCodec; stores shape, dtype, and provenance in the column metadata

(Other plugins — dj-photon-codecs, dj-figpack-codecs — can be added here as they adopt the protocol.)

Edit 2 — Replace the `<object@>` Zarr example with a `<zarr@>` example

Currently the Examples section leads with <object@> writing Zarr, which is the workaround. After implementation, the canonical example becomes:

<zarr@> — Zarr array (via dj-zarr-codecs plugin)
import zarr

with ImagingSession.staged_insert1 as staged:
    staged.rec['subject_id'] = 1
    staged.rec['session_id'] = 1
    z = zarr.open(staged.store('frames', '.zarr'), mode='w',
                  shape=(1000, 512, 512), chunks=(1, 512, 512), dtype='uint16')
    for i in range(1000):
        z[i] = acquire_frame()
    staged.rec['n_frames'] = 1000
<object@> — Generic multi-file directory

For directory layouts without a format-aware codec (custom binary formats, mixed files, ad-hoc collections):
with Dataset.staged_insert1 as staged:
    staged.rec['dataset_id'] = 1
    fs = staged.store('artifact')        # fsspec.FSMap
    fs['data.bin']      = signal.tobytes()
    fs['metadata.json'] = json.dumps({'session': '2026-05-21'}).encode()
Use <object@> only as a fallback. Prefer a format-aware codec (<zarr@>, <npy@>, or a custom SchemaCodec subclass) when one exists — you get richer column metadata (shape, dtype, etc.) and a typed fetch result.

Merge sequencing

This PR is reviewable now as a design document. The two edits above are deltas against this PR's tip that I'll commit just before merge. Concretely:

Reviewers land any spec-shape feedback on the current diff.
Implementation PR in datajoint-python lands (introduces the generalized gate so <npy@>, <blob@>, <attach@>, <zarr@>, etc. work with staged_insert1).
I push Edits 1 and 2 here as one final commit referencing the merged implementation PR.
This PR merges.

If you'd prefer them applied now (and accept that the spec runs ahead of the code for a few weeks), say so — happy to flip the order.

dimitri-yatsenko · 2026-05-21T14:58:05Z

Revised planned edits — superseding my previous comment

@dimitri-yatsenko corrected my framing: <zarr@> is built around insert1(numpy_array), not staged insert. The codec's encode() synchronously serializes the array to Zarr format. Staged insert isn't part of the <zarr@> idiom — and importantly, can't be, because <zarr@> requires a fully-formed numpy or zarr array as the encode input. There's no way to stream chunks through <zarr@>'s normal encode path.

That means my earlier "Edit 2: replace <object@> Zarr example with <zarr@>" was misframed. The <object@>-hosts-Zarr example in the spec is not a workaround — it's the correct pattern for the only case staged insert exists to serve: arrays too large to materialize. <zarr@> is for the materializable case and uses ordinary insert1.

The real divide is array size, not codec preference. Revised plan:

Revised Edit 1 — Add a "When to use staged insert" callout near the top of the spec

Add this paragraph after the Overview, before §Scope:

Staged insert is for content too large to materialize in process memory. For arrays that fit in memory, ordinary insert1 with a typed codec is both simpler and more idiomatic — pass a numpy array to <zarr@> or <npy@> and the codec handles serialization. Reach for staged_insert1 only when you can't hold the full value in memory (multi-GB Zarr stores being streamed from an instrument, HDF5 files written incrementally, blobs piped from a producer).

Revised Edit 2 — Add `<zarr@>` to the compatibility matrix with the right framing

Under the existing "Plugin codecs" sub-section idea, the row becomes:

Codec	Plugin package	Staged insert?	Notes
`<zarr@>`	`dj-zarr-codecs`	Not typical	Use ordinary `insert1` with a numpy or zarr array; the codec serializes to Zarr internally. Staged insert is rarely the right tool here — `<zarr@>`'s encode requires a materialized array. For streaming Zarr writes that don't fit in memory, use `<object@>` with `staged.store(field, '.zarr')` and open Zarr directly.

This is more honest than putting <zarr@> in the "supported" column — it technically inherits the protocol from SchemaCodec, but it's not the right tool for the staged use case.

Dropped: the previous "replace the `<object@>` Zarr example" edit

The current <object@> Zarr example in the spec stays. It correctly shows the streaming pattern that <zarr@> can't serve. I'll just add a small note pointing readers to <zarr@> + insert1 for the in-memory case:

<object@> — Streaming Zarr / HDF5 / multi-file directories

Use <object@> when the data is built up incrementally and doesn't fit in memory. For Zarr arrays that do fit in memory, use <zarr@> with ordinary insert1 instead — it's simpler and yields a typed fetch result.
# [existing example unchanged]

Merge sequencing unchanged

These edits still land as one final commit just before merge, after the implementation PR in datajoint-python ships the generalized gate.

dimitri-yatsenko · 2026-05-21T14:59:57Z

Test & validate `dj-zarr-codecs` against the spec

To turn the spec from a normative document into a verifiable contract, I'll add a Conformance section listing tests every plugin codec must pass, and then implement them in dj-zarr-codecs once the datajoint-python implementation PR lands. Listing both halves below.

Part 1 — Add a Conformance section to this spec (planned third edit, ships with the others before merge)

A new section after §Codec compatibility matrix:

Conformance tests

Every codec that participates in the staged-write protocol MUST pass the following tests. They're stated here as a contract; reference implementations live in datajoint-python's integration suite for the built-in codecs, and tests/conformance.py (TBD) provides reusable fixtures third-party packages can import.

Required for any participating codec

Test Asserts

test_codec_admitted_by_staged_insert_gate A table whose field uses this codec accepts with table.staged_insert1 without raising.

test_staged_write_lands_at_canonical_path After a clean exit, the written content exists at the path the codec returned (schema-addressed canonical, or hash-addressed canonical after the rename).

test_staged_insert_metadata_matches_encode The metadata dict assigned to staged.rec[field] on finalization is structurally equal to what the same codec's encode() would produce for equivalent content.

test_staged_insert_fetch_roundtrip After staged insert, fetching the field returns a value indistinguishable from what an ordinary insert1 of the same content would have produced.

test_staged_cleanup_on_exception Raising inside the with block leaves no row inserted and no canonical artifact (and for hash-addressed codecs, no staging artifact).

test_staged_primary_key_required Calling staged.open() or staged.store() before all primary key attributes are set on staged.rec raises DataJointError.

Additional for hash-addressed codecs

Test Asserts

test_staged_dedup_hit Two staged inserts of the same content to different primary keys produce one canonical hash-addressed object; both rows reference it.

test_staged_concurrent_canonical_collision A staging-to-canonical rename whose destination is concurrently created falls through to the dedup branch without error.

Part 2 — Implement the conformance tests in `dj-zarr-codecs`

Once the datajoint-python implementation PR merges, open a PR against dj-zarr-codecs that:

Bumps the datajoint-python pin in pyproject.toml (pixi.toml) from rev = "f4b02583251c" to the merged implementation commit.

Adds tests/test_staged_insert.py implementing the six SchemaCodec conformance tests above against <zarr@>:

class TestZarrStagedConformance:
    def test_codec_admitted_by_staged_insert_gate(self, schema): ...
    def test_staged_write_lands_at_canonical_path(self, schema): ...
    def test_staged_insert_metadata_matches_encode(self, schema): ...
    def test_staged_insert_fetch_roundtrip(self, schema): ...
    def test_staged_cleanup_on_exception(self, schema): ...
    def test_staged_primary_key_required(self, schema): ...

Each test exercises a small <zarr@> table, comparing staged-insert behavior against ordinary insert1 for the same array.

Adds <zarr@>-specific tests beyond the generic conformance contract:

Test	Asserts
`test_staged_zarr_shape_dtype_recorded`	Staged-inserted `<zarr@>` metadata column contains `shape`, `dtype`, `store`, and `provenance` matching what `<zarr@>`'s `encode()` would have produced.
`test_staged_zarr_chunked_write_roundtrip`	Open Zarr via `zarr.open(staged.store(field, '.zarr'))`, write in chunks larger than memory budget, fetch, assert chunk-by-chunk equality. Demonstrates the streaming case `<zarr@>` was previously not designed for.
`test_zarr_insert1_still_works`	Regression guard: the existing `test_numpy_array_roundtrip` / `test_zarray_roundtrip` tests still pass after the gate generalization. (The `<zarr@>` `insert1` path is the idiomatic one for in-memory arrays — must not regress.)

Confirms the codec compatibility matrix claim by running the conformance suite against both schema-addressed (<zarr@>) and hash-addressed (a small <blob@>-style sanity test) codecs to prove the design isn't <zarr@>-specific.

Sequencing

Step	Repo	Status
1. Spec review	datajoint-docs #177	this PR
2. Implementation	datajoint-python	not yet opened — references the spec as design
3. Spec edits (Zarr framing, conformance section) commit	datajoint-docs #177	held until step 2 merges
4. dj-zarr-codecs conformance tests	dj-zarr-codecs	held until step 3 merges
5. Merge sequence: step 2 → 3 → 4

Steps 2 and 4 land sequentially because step 4 needs the implementation to pass. Step 3 (the spec) lands between them because the spec describes what step 2 shipped and what step 4 validates.

dimitri-yatsenko · 2026-05-21T15:11:45Z

Deferred: `<photon@>` (from `dj-photon-codecs`)

Examined dj-photon-codecs/codecs.py. <photon@> is a SchemaCodec subclass like <zarr@> and <npy@>, so the generalized gate admits it. The only wrinkle is that <photon@> transforms data inside encode() (Anscombe stabilization + Blosc/Zstd + Zarr attrs), and the staged path bypasses encode() — so on the staged path, the caller is responsible for applying the transform per chunk before writing.

Decision: not adding a <photon@> subsection or matrix row to this spec yet. Sequencing:

This PR's <zarr@> framing lands (already in aa0f66d).
datajoint-python implementation PR lands the generalized gate.
dj-zarr-codecs conformance tests pass (the validation step from #177-comment-4509522796).
Then revisit <photon@> with empirical evidence about how transforming codecs fit the protocol — whether the "caller pre-transforms" pattern is enough (likely), whether the codec needs to expose its transform as a public helper, or whether more protocol surface is warranted.

The expectation, based on the protocol mechanics, is that <photon@> works the same way as <zarr@> once <zarr@> works — same FSMap-driven Zarr write, different caller-side preprocessing. We'll confirm rather than assume.

MilagrosMarin · 2026-05-21T15:20:09Z

Read this carefully against the dj-python source — and caught up on aa0f66d and the four follow-up comments above. The sequencing plan (spec → impl PR → dj-zarr-codecs conformance suite → revisit <photon@>) is clean, and the conformance contract you've drafted (test_staged_insert_metadata_matches_encode, test_staged_dedup_hit, etc.) will catch most of the structural-equality concerns I'd otherwise flag. A few suggestions where the spec text — independent of the impl/conformance steps — could use sharpening:

On the hash algorithm (Open Decision #3). The decision says SHA-256 is spec'd "to match today's <blob@>/<hash@> behavior", but hash_registry.compute_hash (hash_registry.py:51-67) uses MD5 + base32 producing a 26-char lowercase token:

md5_digest = hashlib.md5(data).digest()
return base64.b32encode(md5_digest).decode("ascii").rstrip("=").lower()

Two ways to reconcile:

Re-state Update styling #3 as a migration to SHA-256 (which invalidates existing hash paths — worth calling out as breaking)
Or align the spec to MD5+base32+26-char (and adjust the {hash[:2]}/{hash[2:4]} path template, which only makes intuitive sense for a longer hex string)

On the hash-addressed canonical path. Spec line 162 gives _hash/{hash[:2]}/{hash[2:4]}/{hash}, but hash_registry.build_hash_path today produces _hash/{schema_name}/{fold_path}/{content_hash} with configurable subfolding from the store spec. {schema} is load-bearing today for isolation; subfolding is per-store-tunable. Both seem worth preserving in the normative shape, or explicitly dropping with rationale.

On the <object@> normative metadata. Spec line 174 has {path, size, hash: None, ext, is_dir, timestamp, item_count?, mime_type?}. Today:

ObjectCodec.encode returns {path, store, size, ext, is_dir, item_count, timestamp} (builtin_codecs/object.py:166-174) — has store, no hash, no mime_type
staged_insert._compute_metadata returns {path, size, hash: None, ext, is_dir, timestamp, item_count} or with mime_type for files

So three shapes today (encode, staged, spec). The conformance test will catch the impl-side divergence, but the spec itself should pick whichever is normative and signal that the others converge to it.

On <blob@> / <attach@> "structurally equal to encode()". Today BlobCodec.encode and AttachCodec.encode return bytes; the metadata dict ({hash, path, store, size}) comes from the chained <hash@> codec, not from those codecs directly. The spec implies their encode() returns dicts directly — which is what the impl PR will do, but a sentence like "the implementation refactors BlobCodec.encode/AttachCodec.encode to return metadata dicts directly" would tell plugin authors what's authoritative.

Small related: HashCodec's own metadata shape is documented three different ways in source (class docstring → {hash, store, size}; encode docstring → {hash, path, schema, store, size}; this spec → {hash, path, store, size}). Worth picking one as part of the impl PR.

On forward-looking pieces. HashAddressedCodec, _build_staging_path, _enrich_staged_metadata, and the three protocol methods on Codec don't exist in source yet (grep -rn "class HashAddressedCodec" src/datajoint/ is empty). Spec describes them present-tense — totally fine for a normative spec, but a small banner near the top ("Implementation status: spec only; reference impl lands in datajoint-python#TBD") would tell readers what's authoritative-today vs. authoritative-after-impl.

On staging vs canonical path consistency. Spec gives staging as _staging/{schema}/{table}/{field}_{token}{ext} (with schema/table), and canonical as _hash/{hash[:2]}/{hash[2:4]}/{hash} (no schema). If {schema} stays in canonical (suggestion 2 above), they align naturally; if it goes, the staging path is more granular than necessary.

On the cross-link to ../../how-to/staged-insert.md. That target lives on #175 (approved but unmerged). The link will resolve once #175 lands; sequencing #175 → #177 (or batching them) avoids a temporary broken link if linkcheck runs in between.

On the aa0f66d <zarr@> framing. Like it. The "two paths — insert1 for in-memory, staged_insert1 for content that can't materialize" framing is clear, and demoting <object@> to "generic multi-file fallback" is the right call. The _enrich_staged_metadata line ("reads the just-written Zarr array's metadata via zarr.open(store, mode='r')") is concrete enough that a dj-zarr-codecs plugin author can implement it from the spec alone.

On the conformance contract (comment #3). The six required + two hash-addressed tests are well-scoped. One addition worth considering: a test_staged_handle_rejects_non_participating_codecs that asserts the default staged_handle on Codec raises DataJointError with the codec name in the message — that's the gate the spec relies on to make non-participation explicit.

On <photon@> deferral (comment #4). Reasonable. The "caller pre-transforms per chunk" pattern is the right starting hypothesis, and validating against dj-zarr-codecs first before generalizing the protocol to transforming codecs is the right risk-mitigation order.

None of this is showstopper — the spec's structure and the sequencing plan are both right. Mostly nudges around making "matches today" claims actually match today (or signaling that they're aspirational), and a couple of small forward-looking framing improvements.

dimitri-yatsenko · 2026-05-21T15:25:21Z

Thank you @MilagrosMarin — every claim you pulled from source was correct. Pushed ec3a0dd with corrections.

Addressed in this commit

#	Your point	Resolution
1	Hash algorithm: SHA-256 ≠ today's MD5+base32	Spec corrected to MD5 + base32 → 26-char lowercase, matching `hash_registry.compute_hash` (`hash_registry.py:51-67`). Not a migration.
2	Hash canonical path shape	Spec corrected to `_hash/{schema}/{content_hash}` (flat) or `_hash/{schema}/{fold_}/{content_hash}` (subfolded), matching `hash_registry.build_hash_path`. `{schema}` is now load-bearing in the normative shape; `{fold_}` derived from `subfolding` in the store spec.
3	`<object@>` metadata had three shapes (encode / staged / spec)	Pinned the normative shape to `ObjectCodec.encode`'s actual output `{path, store, size, ext, is_dir, item_count, timestamp}` (`builtin_codecs/object.py:166-174`). Removed `hash: None` and `mime_type?` from the spec. Noted that the impl PR refactors `StagedInsert._compute_metadata` to converge on this.
4	`<blob@>` / `<attach@>` `encode()` returns bytes, not dict	Added a paragraph clarifying that today's encode returns bytes, the dict shape comes from the chained `<hash@>` codec, and the impl PR refactors `BlobCodec.encode`/`AttachCodec.encode` to return the dict shape directly.
5	`HashCodec` shape documented three ways in source	Noted in the same paragraph as #4 that consolidation to `{hash, path, store, size}` happens as part of the impl PR.
6	Forward-looking pieces (`HashAddressedCodec`, `_build_staging_path`, `_enrich_staged_metadata`, the three protocol methods)	Added an "Implementation status" admonition at the top of the spec, explicitly calling out which pieces are forward-looking vs as-shipped, with source line numbers as anchors.
7	Staging path has `{schema}/{table}`, canonical hash path didn't	Resolved naturally by fix #2 — `{schema}` is now in the canonical hash path too, so the two paths align on the schema dimension. (Staging keeps the `{table}` segment for GC traceability — flagged in the new path table.)

Tracked for the final pre-merge commit on this branch

These ship alongside the Zarr framing (already in aa0f66d) when the implementation PR lands:

#	Your point	Plan
8	Cross-link to `../../how-to/staged-insert.md` (target on #175, unmerged)	Sequence #175 → #177. If #175 doesn't land first, I'll downgrade the link to a non-clickable reference in the final commit.
10	Add `test_staged_handle_rejects_non_participating_codecs` to the conformance suite	Added to my notes for the conformance section (planned final pre-merge commit). Tests that the default `staged_handle` on `Codec` raises `DataJointError` with the codec name in the message — closes the explicit-non-participation loop.

No action

Fix typos #9 (aa0f66d Zarr framing) — appreciated, no change needed.
Minor fixes #11 (<photon@> deferral) — confirmed reasoning; deferred until dj-zarr-codecs conformance lands.

Re-review whenever you have time. If you'd rather see the conformance section and the rejection-test now (rather than at final pre-merge), say the word and I'll fold them into this PR.

MilagrosMarin · 2026-05-21T15:33:04Z

Thanks @dimitri-yatsenko — verified ec3a0dd against master line-by-line, all seven fixes land cleanly:

✅ Hash algo / canonical-path / {schema} segment / subfolding-as-store-config — all now point to hash_registry.compute_hash and hash_registry.build_hash_path with the right shape and the right rationale.
✅ <object@> shape pinned to ObjectCodec.encode's actual output; the convergence note explicitly enumerates the three places that diverge today and which is normative.
✅ <blob@> / <attach@> clarification correctly explains today's bytes return + chained <hash@>, and surfaces the HashCodec triple-documentation issue as part of the same impl-PR refactor.
✅ The "Implementation status" admonition is exactly the right framing — three source anchors at the top is enough for a reader to verify the as-shipped state without spelunking.

On your question — defer the conformance section + rejection test to the final pre-merge commit. The spec is reviewable as a design doc now; the conformance section becomes meaningful only once the impl PR is concrete enough that the test names anchor to real assertions. Folding it in now risks drift between conformance and what ships.

The PR reads well in its current state.

Defines the staged-insert contract as a normative spec so the implementation has a single source of truth and third-party codec authors have a documented protocol to implement. Covers: - Lifecycle (setup → drafting → finalization → unwinding) - The codec-side staged-write protocol (staged_handle / finalize_staged / cleanup_staged on the Codec base class) - Two concrete lifecycle variants: schema-addressed (handle at canonical path, finalize computes metadata) and hash-addressed (handle at _staging path, finalize hashes content and renames to canonical _hash/ path with dedup) - Path-construction shapes for both addressing schemes - Per-codec metadata contracts (testable invariants matching each codec's encode() output) - Atomicity model (at-most-once with cleanup; not transactional) - Concurrency behavior (per-PK, hash dedup, transaction interaction, BaseException leakage) - Codec compatibility matrix (the four built-in object-store codecs in, in-table and reference codecs explicitly out) - Worked examples for <object@>, <npy@>, <blob@>, <attach@> - Future-work scope notes for filepath staging, multi-row variants, and resumable inserts Implementation is deferred to a follow-up PR in datajoint-python; this spec is the design that PR will reference. Nav: add under Reference → Specifications → Data Operations alongside data-manipulation.md and autopopulate.md.

Adds <zarr@> (from dj-zarr-codecs) as a first-class supported codec in the staged-insert spec: - New "Concrete protocol behavior" subsection describing both usage paths: ordinary insert1 (canonical for in-memory arrays) and staged_insert1 (for arrays too large to materialize, via direct FSMap-driven Zarr writes). - New row in the Codec compatibility matrix. - New Examples entry showing both paths side-by-side; demoted the generic <object@> example to a multi-file/directory fallback.

…ert spec Corrections grounded in datajoint-python master: - Hash algorithm: spec said sha256/hex; corrected to MD5+base32 → 26-char lowercase token, matching hash_registry.compute_hash (hash_registry.py:51-67). - Hash-addressed canonical path: spec said `_hash/{h[:2]}/{h[2:4]}/{h}`; corrected to `_hash/{schema}/{content_hash}` (flat) or `_hash/{schema}/{fold_*}/{content_hash}` (subfolded), matching hash_registry.build_hash_path. The {schema} segment is load-bearing for isolation; subfolding is per-store-tunable. - <object@> normative metadata shape: pinned to ObjectCodec.encode's actual output `{path, store, size, ext, is_dir, item_count, timestamp}` (builtin_codecs/object.py:166-174). Noted the two-place convergence work the impl PR will do (StagedInsert._compute_metadata refactor; earlier draft of this spec). - <blob@>/<attach@> shape: clarified that today's BlobCodec.encode and AttachCodec.encode return raw bytes, and the dict shape comes from the chained <hash@> codec — the impl PR refactors them to return dicts directly. Also noted that HashCodec's three-way documented inconsistency will be consolidated as part of the same refactor. - Implementation-status banner: added at top of spec to signal which pieces are forward-looking vs as-shipped, with source line numbers as anchors. Items still in flight (planned for final pre-merge commit on this branch): - Conformance test section (incl. new test_staged_handle_rejects_non_participating_codecs per Milagros' suggestion) - Cross-link sequencing vs PR #175 (how-to) - aa0f66d Zarr framing edits (already in)

@Schema

Every example in §Examples now includes the @Schema class declaration with definition string, matching the house style in codec-api.md. Readers can copy a complete, self-contained snippet rather than mentally fill in the table schema. int32 used throughout per the core-types-in-docs convention. Covers <zarr@> (both ordinary and staged paths), <object@>, <npy@>, <blob@>, <attach@>.

Staged insert applies only to codecs whose content format has an incremental-write API. Of the built-in codecs, only <object@> qualifies: - <blob@>, <hash@>: atomic byte sequences from a materialized Python object - <npy@>: np.save takes a materialized array - <attach@>: file is already on disk; ordinary insert1 suffices - <filepath@>: reference, not copy Rewrite the spec around this principle: - Drop the codec-protocol generalization (staged_handle / finalize_staged / cleanup_staged on Codec base class) — defer until a second codec actually needs it. - Drop the hash-addressed lifecycle and all <blob@>/<attach@>/<hash@> staged paths — no pathway exists. - Drop the <zarr@> staged example — the proper dj-zarr-codecs API is insert1(array); a staged <zarr@> path is future work. The Zarr example shown is honest about column type: declared <object@>, written through <object@>'s FSMap with zarr.open. - Drop the "Implementation status" admonition — no forward-looking pieces remain; the spec now documents shipped behavior. - Trim 421 → 154 lines. Future work section captures the deferred surface: staged insert for other codecs (incremental-API candidates: <zarr@>, <hdf5@>, parquet), hash-addressed staged, multi-row, resumable. Also fix data-manipulation.md §2.9: drop stale "codec protocol" / "codec compatibility matrix" cross-link phrasing (those sections no longer exist), and fix the snippet that incorrectly assigned the zarr handle to staged.rec — the framework computes the metadata dict; the caller does not assign anything to the staged field.

dimitri-yatsenko · 2026-06-07T22:00:43Z

Scope change: minimal spec (commit `d20f6a8`)

Reconsidered the spec scope. The earlier revision (ea29078 … fb2b228) specified staged insert for all four built-in codecs plus a generalized three-method codec protocol. That implicitly assumed every codec has a useful incremental-write pathway — but walking through each, only <object@> does:

<blob@>, <hash@> — BlobCodec.encode takes a materialized Python object and emits bytes. No streaming serialization API. A "staged" insert for <blob@> doesn't reduce memory pressure; the caller still has to materialize the object first.
<npy@> — np.save takes a materialized numpy array. No chunked-write entry point.
<attach@> — the file already exists on disk; ordinary insert1 is sufficient.
<filepath@> — registers a reference, not a copy. Different lifecycle entirely.
<object@> — multi-file directory, write files into an FSMap one at a time. The only codec where staged insert genuinely buys something insert1 can't provide.

The rewrite (d20f6a8) collapses the spec to <object@> only:

Drops the codec-protocol generalization (staged_handle / finalize_staged / cleanup_staged on the Codec base class). Defer until a second codec needs it.
Drops the hash-addressed lifecycle (no participating codec).
Drops the <zarr@> example that wrote through the <object@> interface (column-type-vs-machinery mismatch was misleading). The example now declares <object@> honestly and uses <object@>'s FSMap with zarr.open — same effect, accurate framing. The proper dj-zarr-codecs API for <zarr@> columns remains insert1(array); a staged <zarr@> pathway is future work.
Drops <blob@>, <npy@>, <attach@> from the supported list with a one-line rationale per codec in the Scope section.
421 → 154 lines.

The Conformance section also goes away — it was scoped to the multi-codec protocol that no longer exists. When a second codec adopts staged insert and we introduce the generalized protocol, conformance comes back with it.

Side effect: the spec is now describing shipped behavior, so the "Implementation status" admonition is gone. The how-to (#175, merged) already scopes to <object@> only, so spec and how-to are consistent.

Also fixed two bugs in data-manipulation.md §2.9 that were on this branch: the cross-link mentioned a "codec protocol" and "codec compatibility matrix" that no longer exist, and the snippet incorrectly assigned the zarr handle to staged.rec[\"raw_data\"] (the framework computes the metadata dict; the caller does not assign).

Re-review whenever you have time.

dimitri-yatsenko requested a review from MilagrosMarin May 21, 2026 14:42

dimitri-yatsenko mentioned this pull request May 21, 2026

docs: dedicated how-to page for staged insert #175

Merged

5 tasks

dimitri-yatsenko requested review from mweitzel and ttngu207 May 21, 2026 14:43

dimitri-yatsenko requested review from kushalbakshi and removed request for kushalbakshi May 21, 2026 16:12

dimitri-yatsenko added 4 commits May 21, 2026 11:55

dimitri-yatsenko force-pushed the docs/spec-staged-insert branch from 21b781a to fb2b228 Compare May 21, 2026 16:56

dimitri-yatsenko marked this pull request as draft May 21, 2026 18:47

MilagrosMarin mentioned this pull request Jun 5, 2026

docs: env-var configuration of stores (DJ_STORES, DJ_IGNORE_CONFIG_FILE) + Storage Adapter API spec (2.2.4) #172

Merged

4 tasks

dimitri-yatsenko mentioned this pull request Jun 7, 2026

fix(staged_insert): converge metadata shape with ObjectCodec.encode datajoint/datajoint-python#1465

Open

4 tasks

dimitri-yatsenko marked this pull request as ready for review June 7, 2026 22:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(specs): add Staged Insert Specification#177

docs(specs): add Staged Insert Specification#177
dimitri-yatsenko wants to merge 5 commits into
mainfrom
docs/spec-staged-insert

dimitri-yatsenko commented May 21, 2026 •

edited

Loading

Uh oh!

dimitri-yatsenko commented May 21, 2026

Plugin codecs (examples)

`<zarr@>` — Zarr array (via `dj-zarr-codecs` plugin)

`<object@>` — Generic multi-file directory

Uh oh!

dimitri-yatsenko commented May 21, 2026

`<object@>` — Streaming Zarr / HDF5 / multi-file directories

Uh oh!

dimitri-yatsenko commented May 21, 2026

Conformance tests

Required for any participating codec

Additional for hash-addressed codecs

Uh oh!

dimitri-yatsenko commented May 21, 2026

Uh oh!

MilagrosMarin commented May 21, 2026

Uh oh!

dimitri-yatsenko commented May 21, 2026

Uh oh!

MilagrosMarin commented May 21, 2026

Uh oh!

dimitri-yatsenko commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dimitri-yatsenko commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Spec contents

What this PR does not do

Test plan

Uh oh!

dimitri-yatsenko commented May 21, 2026

Planned edits before merge

Edit 1 — Add a "Plugin codecs" sub-section to the Codec compatibility matrix

Plugin codecs (examples)

Edit 2 — Replace the <object@> Zarr example with a <zarr@> example

<zarr@> — Zarr array (via dj-zarr-codecs plugin)

<object@> — Generic multi-file directory

Merge sequencing

Uh oh!

dimitri-yatsenko commented May 21, 2026

Revised planned edits — superseding my previous comment

Revised Edit 1 — Add a "When to use staged insert" callout near the top of the spec

Revised Edit 2 — Add <zarr@> to the compatibility matrix with the right framing

Dropped: the previous "replace the <object@> Zarr example" edit

<object@> — Streaming Zarr / HDF5 / multi-file directories

Merge sequencing unchanged

Uh oh!

dimitri-yatsenko commented May 21, 2026

Test & validate dj-zarr-codecs against the spec

Part 1 — Add a Conformance section to this spec (planned third edit, ships with the others before merge)

Conformance tests

Required for any participating codec

Additional for hash-addressed codecs

Part 2 — Implement the conformance tests in dj-zarr-codecs

Sequencing

Uh oh!

dimitri-yatsenko commented May 21, 2026

Deferred: <photon@> (from dj-photon-codecs)

Uh oh!

MilagrosMarin commented May 21, 2026

Uh oh!

dimitri-yatsenko commented May 21, 2026

Addressed in this commit

Tracked for the final pre-merge commit on this branch

No action

Uh oh!

MilagrosMarin commented May 21, 2026

Uh oh!

dimitri-yatsenko commented Jun 7, 2026

Scope change: minimal spec (commit d20f6a8)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dimitri-yatsenko commented May 21, 2026 •

edited

Loading

Edit 2 — Replace the `<object@>` Zarr example with a `<zarr@>` example

`<zarr@>` — Zarr array (via `dj-zarr-codecs` plugin)

`<object@>` — Generic multi-file directory

Revised Edit 2 — Add `<zarr@>` to the compatibility matrix with the right framing

Dropped: the previous "replace the `<object@>` Zarr example" edit

`<object@>` — Streaming Zarr / HDF5 / multi-file directories

Test & validate `dj-zarr-codecs` against the spec

Part 2 — Implement the conformance tests in `dj-zarr-codecs`

Deferred: `<photon@>` (from `dj-photon-codecs`)

Scope change: minimal spec (commit `d20f6a8`)