Skip to content

feat: rewrite S3 metadata backend as op journal#383

Open
alecthomas wants to merge 12 commits into
mainfrom
aat/metadatadb-s3-journal
Open

feat: rewrite S3 metadata backend as op journal#383
alecthomas wants to merge 12 commits into
mainfrom
aat/metadatadb-s3-journal

Conversation

@alecthomas

Copy link
Copy Markdown
Collaborator

The previous backend stored each namespace as one JSON object,
read-modify-written under an S3 lock object. Queries served local
state that was never refreshed, so replicas never observed each
other's writes and cross-replica invalidation did not work; every
write cost four sequential round trips, serialized fleet-wide by a
racy and redundant lock.

Replace it with an append-only journal: each Apply group-commits a
batch of ops as one immutable segment PUT, replicas rebuild state by
replaying segments in a canonical (LastModified, key) order refreshed
by a per-namespace sync tick, and a leaderless CAS-elected compaction
folds aged segments into a rollup snapshot. No locks anywhere; queries
stay local; writes cost one unconditional PUT; staleness is bounded by
sync-interval. Legacy state objects seed the first rollup lazily.

See docs/metadatadb-s3.md for the design and its invariants.

Co-Authored-By: Claude Fable 5 noreply@anthropic.com

The previous backend stored each namespace as one JSON object,
read-modify-written under an S3 lock object. Queries served local
state that was never refreshed, so replicas never observed each
other's writes and cross-replica invalidation did not work; every
write cost four sequential round trips, serialized fleet-wide by a
racy and redundant lock.

Replace it with an append-only journal: each Apply group-commits a
batch of ops as one immutable segment PUT, replicas rebuild state by
replaying segments in a canonical (LastModified, key) order refreshed
by a per-namespace sync tick, and a leaderless CAS-elected compaction
folds aged segments into a rollup snapshot. No locks anywhere; queries
stay local; writes cost one unconditional PUT; staleness is bounded by
sync-interval. Legacy state objects seed the first rollup lazily.

See docs/metadatadb-s3.md for the design and its invariants.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@alecthomas alecthomas requested a review from a team as a code owner July 4, 2026 10:51
@alecthomas alecthomas requested review from stuartwdouglas and removed request for a team July 4, 2026 10:51

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3d2059e982

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread internal/metadatadb/s3/writer.go Outdated
Doc comments on exported items and one-line invariant markers stay;
multi-line unexported function headers restating docs/metadatadb-s3.md,
struct-field narration, and test commentary go.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d3506a8467

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread internal/metadatadb/s3/tick.go Outdated
alecthomas and others added 5 commits July 4, 2026 21:08
The two 16-arm switches and the flat wireOp struct collapse into a
prototype registry keyed by reflected type name, with the "op"
discriminator spliced into each op's own JSON encoding and a two-pass
unmarshal (discriminator, then concrete type) on decode. Splicing
rather than round-tripping through a map preserves int64 precision,
which also removes the IntValue workaround: int64 fields now decode
directly into the concrete op's typed fields.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Runs the soak workload across several backends sharing one bucket,
with real background ticks and aggressive compaction thresholds so
elections, probes, and rebases fire mid-run, then verifies monotonic
invariants on every replica and full cross-replica state convergence
(including list element order, which locks canonical replay order).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The SOAK_TEST guard lived inside the harness, after the tests had
already started the MinIO container and constructed backends.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4fd765f62b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread internal/metadatadb/s3/tick.go Outdated
Comment thread internal/metadatadb/s3/wire.go Outdated
- Apply no longer returns the caller's cancellation once a write is
  accepted by the group-commit writer: the write still lands, and the
  misreported failure would invite a retry that double-applies
  non-idempotent ops.
- Flush ticks run under the caller's context (merged with the backend
  lifetime), so an expired deadline aborts the tick promptly instead
  of occupying the namespace loop.
- All decodes into any-typed destinations use json.Number: plain
  unmarshalling coerced int64 values above 2^53 to float64 across
  rollup and segment round-trips, silently corrupting counters on
  replicas that rebuild from them.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5d56e48f47

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread internal/metadatadb/s3/wire.go
Comment thread internal/metadatadb/s3/wire.go Outdated
alecthomas and others added 4 commits July 4, 2026 22:10
A struct map key marshals in field-declaration order locally but
re-marshals from a map with sorted keys after a segment round-trip,
so replayed entries landed under a different state key than local
ones — lookups and deletes with the original key then missed.
marshalKey now canonicalizes composite keys via a decode/re-marshal.

A JSON null legacy object or rollup state decoded "successfully" into
a nil map, panicking on the first write into it; unmarshalState now
rejects null, routing it through the corrupt-legacy handling.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The memory backend now queries through QueryStateInto, leaving the
helper with a single caller.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0de815c002

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

// Immediate first tick so a restarted replica converges without waiting
// a full interval.
if n.b.initialTick {
background()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor Flush deadlines during the initial tick

When Flush is the first operation on a namespace and its context has a deadline, namespace() starts runLoop, but the loop always runs this initial background() tick with n.b.ctx before it can receive the flushCh request. The caller can time out while waiting to send the flush request, yet the S3 LIST/GET started here continues under the backend lifetime context and keeps the namespace loop occupied, so retries and later periodic syncs queue behind work the Flush caller already abandoned; the remaining path is this initial tick rather than the fixed flushCh tick path.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant