feat: rewrite S3 metadata backend as op journal#383
Conversation
The previous backend stored each namespace as one JSON object, read-modify-written under an S3 lock object. Queries served local state that was never refreshed, so replicas never observed each other's writes and cross-replica invalidation did not work; every write cost four sequential round trips, serialized fleet-wide by a racy and redundant lock. Replace it with an append-only journal: each Apply group-commits a batch of ops as one immutable segment PUT, replicas rebuild state by replaying segments in a canonical (LastModified, key) order refreshed by a per-namespace sync tick, and a leaderless CAS-elected compaction folds aged segments into a rollup snapshot. No locks anywhere; queries stay local; writes cost one unconditional PUT; staleness is bounded by sync-interval. Legacy state objects seed the first rollup lazily. See docs/metadatadb-s3.md for the design and its invariants. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3d2059e982
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Doc comments on exported items and one-line invariant markers stay; multi-line unexported function headers restating docs/metadatadb-s3.md, struct-field narration, and test commentary go. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d3506a8467
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
The two 16-arm switches and the flat wireOp struct collapse into a prototype registry keyed by reflected type name, with the "op" discriminator spliced into each op's own JSON encoding and a two-pass unmarshal (discriminator, then concrete type) on decode. Splicing rather than round-tripping through a map preserves int64 precision, which also removes the IntValue workaround: int64 fields now decode directly into the concrete op's typed fields. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Runs the soak workload across several backends sharing one bucket, with real background ticks and aggressive compaction thresholds so elections, probes, and rebases fire mid-run, then verifies monotonic invariants on every replica and full cross-replica state convergence (including list element order, which locks canonical replay order). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The SOAK_TEST guard lived inside the harness, after the tests had already started the MinIO container and constructed backends. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4fd765f62b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
- Apply no longer returns the caller's cancellation once a write is accepted by the group-commit writer: the write still lands, and the misreported failure would invite a retry that double-applies non-idempotent ops. - Flush ticks run under the caller's context (merged with the backend lifetime), so an expired deadline aborts the tick promptly instead of occupying the namespace loop. - All decodes into any-typed destinations use json.Number: plain unmarshalling coerced int64 values above 2^53 to float64 across rollup and segment round-trips, silently corrupting counters on replicas that rebuild from them. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5d56e48f47
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
A struct map key marshals in field-declaration order locally but re-marshals from a map with sorted keys after a segment round-trip, so replayed entries landed under a different state key than local ones — lookups and deletes with the original key then missed. marshalKey now canonicalizes composite keys via a decode/re-marshal. A JSON null legacy object or rollup state decoded "successfully" into a nil map, panicking on the first write into it; unmarshalState now rejects null, routing it through the corrupt-legacy handling. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The memory backend now queries through QueryStateInto, leaving the helper with a single caller. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0de815c002
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // Immediate first tick so a restarted replica converges without waiting | ||
| // a full interval. | ||
| if n.b.initialTick { | ||
| background() |
There was a problem hiding this comment.
Honor Flush deadlines during the initial tick
When Flush is the first operation on a namespace and its context has a deadline, namespace() starts runLoop, but the loop always runs this initial background() tick with n.b.ctx before it can receive the flushCh request. The caller can time out while waiting to send the flush request, yet the S3 LIST/GET started here continues under the backend lifetime context and keeps the namespace loop occupied, so retries and later periodic syncs queue behind work the Flush caller already abandoned; the remaining path is this initial tick rather than the fixed flushCh tick path.
Useful? React with 👍 / 👎.
The previous backend stored each namespace as one JSON object,
read-modify-written under an S3 lock object. Queries served local
state that was never refreshed, so replicas never observed each
other's writes and cross-replica invalidation did not work; every
write cost four sequential round trips, serialized fleet-wide by a
racy and redundant lock.
Replace it with an append-only journal: each Apply group-commits a
batch of ops as one immutable segment PUT, replicas rebuild state by
replaying segments in a canonical (LastModified, key) order refreshed
by a per-namespace sync tick, and a leaderless CAS-elected compaction
folds aged segments into a rollup snapshot. No locks anywhere; queries
stay local; writes cost one unconditional PUT; staleness is bounded by
sync-interval. Legacy state objects seed the first rollup lazily.
See docs/metadatadb-s3.md for the design and its invariants.
Co-Authored-By: Claude Fable 5 noreply@anthropic.com