docs(specs): add Staged Insert Specification#177
Conversation
Planned edits before mergeWhile discussing the staged-insert spec, we surfaced that Two edits are planned. They will be applied as a final commit on this branch immediately before merge, after the matching implementation PR in Edit 1 — Add a "Plugin codecs" sub-section to the Codec compatibility matrixAfter the built-in compatibility matrix (currently the last row is "Other custom codec"), add:
(Other plugins — Edit 2 — Replace the
|
Revised planned edits — superseding my previous comment@dimitri-yatsenko corrected my framing: That means my earlier "Edit 2: replace The real divide is array size, not codec preference. Revised plan: Revised Edit 1 — Add a "When to use staged insert" callout near the top of the specAdd this paragraph after the Overview, before §Scope:
Revised Edit 2 — Add
|
| Codec | Plugin package | Staged insert? | Notes |
|---|---|---|---|
<zarr@> |
dj-zarr-codecs |
Not typical | Use ordinary insert1 with a numpy or zarr array; the codec serializes to Zarr internally. Staged insert is rarely the right tool here — <zarr@>'s encode requires a materialized array. For streaming Zarr writes that don't fit in memory, use <object@> with staged.store(field, '.zarr') and open Zarr directly. |
This is more honest than putting <zarr@> in the "supported" column — it technically inherits the protocol from SchemaCodec, but it's not the right tool for the staged use case.
Dropped: the previous "replace the <object@> Zarr example" edit
The current <object@> Zarr example in the spec stays. It correctly shows the streaming pattern that <zarr@> can't serve. I'll just add a small note pointing readers to <zarr@> + insert1 for the in-memory case:
<object@>— Streaming Zarr / HDF5 / multi-file directoriesUse
<object@>when the data is built up incrementally and doesn't fit in memory. For Zarr arrays that do fit in memory, use<zarr@>with ordinaryinsert1instead — it's simpler and yields a typed fetch result.# [existing example unchanged]
Merge sequencing unchanged
These edits still land as one final commit just before merge, after the implementation PR in datajoint-python ships the generalized gate.
Test & validate
|
| Test | Asserts |
|---|---|
test_codec_admitted_by_staged_insert_gate |
A table whose field uses this codec accepts with table.staged_insert1 without raising. |
test_staged_write_lands_at_canonical_path |
After a clean exit, the written content exists at the path the codec returned (schema-addressed canonical, or hash-addressed canonical after the rename). |
test_staged_insert_metadata_matches_encode |
The metadata dict assigned to staged.rec[field] on finalization is structurally equal to what the same codec's encode() would produce for equivalent content. |
test_staged_insert_fetch_roundtrip |
After staged insert, fetching the field returns a value indistinguishable from what an ordinary insert1 of the same content would have produced. |
test_staged_cleanup_on_exception |
Raising inside the with block leaves no row inserted and no canonical artifact (and for hash-addressed codecs, no staging artifact). |
test_staged_primary_key_required |
Calling staged.open() or staged.store() before all primary key attributes are set on staged.rec raises DataJointError. |
Additional for hash-addressed codecs
| Test | Asserts |
|---|---|
test_staged_dedup_hit |
Two staged inserts of the same content to different primary keys produce one canonical hash-addressed object; both rows reference it. |
test_staged_concurrent_canonical_collision |
A staging-to-canonical rename whose destination is concurrently created falls through to the dedup branch without error. |
Part 2 — Implement the conformance tests in dj-zarr-codecs
Once the datajoint-python implementation PR merges, open a PR against dj-zarr-codecs that:
-
Bumps the
datajoint-pythonpin inpyproject.toml(pixi.toml) fromrev = "f4b02583251c"to the merged implementation commit. -
Adds
tests/test_staged_insert.pyimplementing the six SchemaCodec conformance tests above against<zarr@>:class TestZarrStagedConformance: def test_codec_admitted_by_staged_insert_gate(self, schema): ... def test_staged_write_lands_at_canonical_path(self, schema): ... def test_staged_insert_metadata_matches_encode(self, schema): ... def test_staged_insert_fetch_roundtrip(self, schema): ... def test_staged_cleanup_on_exception(self, schema): ... def test_staged_primary_key_required(self, schema): ...
Each test exercises a small
<zarr@>table, comparing staged-insert behavior against ordinaryinsert1for the same array. -
Adds
<zarr@>-specific tests beyond the generic conformance contract:Test Asserts test_staged_zarr_shape_dtype_recordedStaged-inserted <zarr@>metadata column containsshape,dtype,store, andprovenancematching what<zarr@>'sencode()would have produced.test_staged_zarr_chunked_write_roundtripOpen Zarr via zarr.open(staged.store(field, '.zarr')), write in chunks larger than memory budget, fetch, assert chunk-by-chunk equality. Demonstrates the streaming case<zarr@>was previously not designed for.test_zarr_insert1_still_worksRegression guard: the existing test_numpy_array_roundtrip/test_zarray_roundtriptests still pass after the gate generalization. (The<zarr@>insert1path is the idiomatic one for in-memory arrays — must not regress.) -
Confirms the codec compatibility matrix claim by running the conformance suite against both schema-addressed (
<zarr@>) and hash-addressed (a small<blob@>-style sanity test) codecs to prove the design isn't<zarr@>-specific.
Sequencing
| Step | Repo | Status |
|---|---|---|
| 1. Spec review | datajoint-docs #177 | this PR |
| 2. Implementation | datajoint-python | not yet opened — references the spec as design |
| 3. Spec edits (Zarr framing, conformance section) commit | datajoint-docs #177 | held until step 2 merges |
| 4. dj-zarr-codecs conformance tests | dj-zarr-codecs | held until step 3 merges |
| 5. Merge sequence: step 2 → 3 → 4 |
Steps 2 and 4 land sequentially because step 4 needs the implementation to pass. Step 3 (the spec) lands between them because the spec describes what step 2 shipped and what step 4 validates.
Deferred:
|
|
Read this carefully against the dj-python source — and caught up on On the hash algorithm (Open Decision #3). The decision says SHA-256 is spec'd "to match today's md5_digest = hashlib.md5(data).digest()
return base64.b32encode(md5_digest).decode("ascii").rstrip("=").lower()Two ways to reconcile:
On the hash-addressed canonical path. Spec line 162 gives On the
So three shapes today (encode, staged, spec). The conformance test will catch the impl-side divergence, but the spec itself should pick whichever is normative and signal that the others converge to it. On Small related: On forward-looking pieces. On staging vs canonical path consistency. Spec gives staging as On the cross-link to On the On the conformance contract (comment #3). The six required + two hash-addressed tests are well-scoped. One addition worth considering: a On None of this is showstopper — the spec's structure and the sequencing plan are both right. Mostly nudges around making "matches today" claims actually match today (or signaling that they're aspirational), and a couple of small forward-looking framing improvements. |
|
Thank you @MilagrosMarin — every claim you pulled from source was correct. Pushed ec3a0dd with corrections. Addressed in this commit
Tracked for the final pre-merge commit on this branchThese ship alongside the Zarr framing (already in
No action
Re-review whenever you have time. If you'd rather see the conformance section and the rejection-test now (rather than at final pre-merge), say the word and I'll fold them into this PR. |
|
Thanks @dimitri-yatsenko — verified ✅ Hash algo / canonical-path / On your question — defer the conformance section + rejection test to the final pre-merge commit. The spec is reviewable as a design doc now; the conformance section becomes meaningful only once the impl PR is concrete enough that the test names anchor to real assertions. Folding it in now risks drift between conformance and what ships. The PR reads well in its current state. |
Defines the staged-insert contract as a normative spec so the implementation has a single source of truth and third-party codec authors have a documented protocol to implement. Covers: - Lifecycle (setup → drafting → finalization → unwinding) - The codec-side staged-write protocol (staged_handle / finalize_staged / cleanup_staged on the Codec base class) - Two concrete lifecycle variants: schema-addressed (handle at canonical path, finalize computes metadata) and hash-addressed (handle at _staging path, finalize hashes content and renames to canonical _hash/ path with dedup) - Path-construction shapes for both addressing schemes - Per-codec metadata contracts (testable invariants matching each codec's encode() output) - Atomicity model (at-most-once with cleanup; not transactional) - Concurrency behavior (per-PK, hash dedup, transaction interaction, BaseException leakage) - Codec compatibility matrix (the four built-in object-store codecs in, in-table and reference codecs explicitly out) - Worked examples for <object@>, <npy@>, <blob@>, <attach@> - Future-work scope notes for filepath staging, multi-row variants, and resumable inserts Implementation is deferred to a follow-up PR in datajoint-python; this spec is the design that PR will reference. Nav: add under Reference → Specifications → Data Operations alongside data-manipulation.md and autopopulate.md.
Adds <zarr@> (from dj-zarr-codecs) as a first-class supported codec in the staged-insert spec: - New "Concrete protocol behavior" subsection describing both usage paths: ordinary insert1 (canonical for in-memory arrays) and staged_insert1 (for arrays too large to materialize, via direct FSMap-driven Zarr writes). - New row in the Codec compatibility matrix. - New Examples entry showing both paths side-by-side; demoted the generic <object@> example to a multi-file/directory fallback.
…ert spec
Corrections grounded in datajoint-python master:
- Hash algorithm: spec said sha256/hex; corrected to MD5+base32 → 26-char
lowercase token, matching hash_registry.compute_hash (hash_registry.py:51-67).
- Hash-addressed canonical path: spec said `_hash/{h[:2]}/{h[2:4]}/{h}`;
corrected to `_hash/{schema}/{content_hash}` (flat) or
`_hash/{schema}/{fold_*}/{content_hash}` (subfolded), matching
hash_registry.build_hash_path. The {schema} segment is load-bearing for
isolation; subfolding is per-store-tunable.
- <object@> normative metadata shape: pinned to ObjectCodec.encode's actual
output `{path, store, size, ext, is_dir, item_count, timestamp}`
(builtin_codecs/object.py:166-174). Noted the two-place convergence work
the impl PR will do (StagedInsert._compute_metadata refactor; earlier
draft of this spec).
- <blob@>/<attach@> shape: clarified that today's BlobCodec.encode and
AttachCodec.encode return raw bytes, and the dict shape comes from the
chained <hash@> codec — the impl PR refactors them to return dicts
directly. Also noted that HashCodec's three-way documented inconsistency
will be consolidated as part of the same refactor.
- Implementation-status banner: added at top of spec to signal which pieces
are forward-looking vs as-shipped, with source line numbers as anchors.
Items still in flight (planned for final pre-merge commit on this branch):
- Conformance test section (incl. new test_staged_handle_rejects_non_participating_codecs
per Milagros' suggestion)
- Cross-link sequencing vs PR #175 (how-to)
- aa0f66d Zarr framing edits (already in)
Every example in §Examples now includes the @Schema class declaration with definition string, matching the house style in codec-api.md. Readers can copy a complete, self-contained snippet rather than mentally fill in the table schema. int32 used throughout per the core-types-in-docs convention. Covers <zarr@> (both ordinary and staged paths), <object@>, <npy@>, <blob@>, <attach@>.
21b781a to
fb2b228
Compare
Staged insert applies only to codecs whose content format has an incremental-write API. Of the built-in codecs, only <object@> qualifies: - <blob@>, <hash@>: atomic byte sequences from a materialized Python object - <npy@>: np.save takes a materialized array - <attach@>: file is already on disk; ordinary insert1 suffices - <filepath@>: reference, not copy Rewrite the spec around this principle: - Drop the codec-protocol generalization (staged_handle / finalize_staged / cleanup_staged on Codec base class) — defer until a second codec actually needs it. - Drop the hash-addressed lifecycle and all <blob@>/<attach@>/<hash@> staged paths — no pathway exists. - Drop the <zarr@> staged example — the proper dj-zarr-codecs API is insert1(array); a staged <zarr@> path is future work. The Zarr example shown is honest about column type: declared <object@>, written through <object@>'s FSMap with zarr.open. - Drop the "Implementation status" admonition — no forward-looking pieces remain; the spec now documents shipped behavior. - Trim 421 → 154 lines. Future work section captures the deferred surface: staged insert for other codecs (incremental-API candidates: <zarr@>, <hdf5@>, parquet), hash-addressed staged, multi-row, resumable. Also fix data-manipulation.md §2.9: drop stale "codec protocol" / "codec compatibility matrix" cross-link phrasing (those sections no longer exist), and fix the snippet that incorrectly assigned the zarr handle to staged.rec — the framework computes the metadata dict; the caller does not assign anything to the staged field.
Scope change: minimal spec (commit
|
Summary
A normative spec at
src/reference/specs/staged-insert.mddefining the staged-insert contract for the<object@>codec. The principle that scopes the spec: staged insert applies only to codecs whose content format has an incremental-write API. Among the built-in codecs, only<object@>qualifies —<blob@>,<hash@>,<npy@>, and<attach@>all require a materialized value at insert time and have no pathway through which staged insert would help.What changed
src/reference/specs/staged-insert.md(new)mkdocs.yamlReference → Specifications → Data Operationssrc/reference/specs/data-manipulation.mdSpec contents
<object@>in; all others out with one-line rationale per codec{path, store, size, ext, is_dir, item_count, timestamp}matchingObjectCodec.encode(builtin_codecs/object.py:166-174)BaseException<object@>directory from a streaming source<zarr@>,<hdf5@>, parquet candidates); hash-addressed staged; multi-row; resumableWhat this PR does not do
datajoint-python. The spec documents the as-shipped<object@>gate atstaged_insert.py:100-101.Test plan
mkdocs serverenders the spec underReference → Specifications → Data Operationscodec-api.md,data-manipulation.md,type-system.md,object-store-configuration.md, the how-to, garbage-collection)data-manipulation.mdreads correctlysrc/how-to/staged-insert.md, merged in docs: dedicated how-to page for staged insert #175) agree on scope and idioms