Skip to content

fix(storage): protect head root from state pruning#396

Merged
MegaRedHand merged 1 commit into
mainfrom
fix/prune-protects-head-state
May 28, 2026
Merged

fix(storage): protect head root from state pruning#396
MegaRedHand merged 1 commit into
mainfrom
fix/prune-protects-head-state

Conversation

@MegaRedHand
Copy link
Copy Markdown
Collaborator

Summary

  • Add the fork-choice head root to prune_old_data's protected_roots so the pruner never evicts the state currently in use by fork choice.

Problem

prune_old_data only protected the latest finalized and justified block roots:

let protected_roots = [self.latest_finalized().root, self.latest_justified().root];

prune_old_states then keeps the top STATES_TO_KEEP = 3_000 entries by slot (descending) and deletes the rest of the States table, ignoring fork-choice membership.

When finalization stalls and a competing branch keeps producing blocks on top of an unfinalized region, that branch's high-slot headers fill the retention window even though it isn't the fork-choice head. The actual head can fall outside the top 3000 by slot, and its state row is deleted. The very next call into:

pub fn head_state(&self) -> State {
    self.get_state(&self.head())
        .expect("head state is always available")
}

panics, taking down the blockchain actor (the container stays up, P2P keeps running, but every gossip message logs err=Actor stopped).

Observed

In a stalled 8-node devnet, finalization froze around slot 2866 while a minority fork from 2 surviving aggregators kept proposing one empty block per slot, reaching slot ~15000+. On the next gossip block (slot 15010, proposer 2), 6 of 8 nodes pruned 2666 states and panicked simultaneously at store.rs:1314 with head state is always available. Logs on every affected node:

2026-05-28T13:15:07.133  INFO Pruned old states and blocks pruned_states=2666 pruned_blocks=0
2026-05-28T13:15:07.802  thread 'tokio-rt-worker' panicked at crates/storage/src/store.rs:1314:14:
                          head state is always available
2026-05-28T13:15:07.802  ERROR spawned_concurrency::tasks::actor: Panic in message handler

Fix

Add self.head() to protected_roots. The head is now retained regardless of its slot position relative to other branches' tips.

Test plan

  • Existing prune_old_states_* tests still pass.
  • Add regression test: head root at a low slot, retention window otherwise filled by a higher-slot branch; assert head's state is preserved after prune_old_data.
  • Re-run the long-lived devnet scenario that stalled finalization and verify nodes no longer panic when fork-choice head lags behind competing branch tips.

## Problem

`prune_old_data` only protects the latest finalized and justified block
roots:

```rust
let protected_roots = [self.latest_finalized().root, self.latest_justified().root];
```

`prune_old_states` then keeps the top `STATES_TO_KEEP = 3_000` entries
sorted by **slot** (descending) and deletes the rest of the `States`
table, ignoring fork-choice membership.

When finalization stalls and a competing branch keeps producing blocks
on top of an unfinalized region, that branch's high-slot headers can
fill the retention window even though it isn't the fork-choice head.
The actual head can fall outside the top 3000 by slot, and its state
row is deleted. The very next call into:

```rust
pub fn head_state(&self) -> State {
    self.get_state(&self.head())
        .expect("head state is always available")
}
```

panics, taking down the blockchain actor (the container stays up, P2P
keeps running, but every gossip message logs `err=Actor stopped`).

Observed in a stalled 8-node devnet: finalization froze with head at
slot ~2891 while a minority fork advanced past slot 15_000. On
receiving the next gossip block, six nodes pruned 2666 states and
panicked simultaneously at `store.rs:1314`.

## Fix

Add `self.head()` to `protected_roots` so the pruner never deletes
the state currently in use by fork choice, regardless of its slot
position relative to other branches' tips.
@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

This is a critical bug fix for chain availability. The change correctly prevents pruning of the current LMD GHOST head, which is essential for maintaining the canonical chain tip.

Analysis

File: crates/storage/src/store.rs
Lines: 777-780

The Issue

The original code only protected finalized and justified checkpoint roots. In Ethereum consensus (Casper FFG + LMD GHOST), the canonical head is typically ahead of the justified checkpoint. Without protecting the head:

  1. State Regeneration Failure: If prune_old_states removes states between the justified checkpoint and the head, the node cannot regenerate the head state to process new blocks.
  2. Fork Choice Breakage: If prune_old_blocks removes blocks in the current head's ancestry, the fork choice rule cannot execute properly, potentially causing chain stalls or crashes during reorgs.
  3. Infinite Loops: As noted in the existing comment (line 773), this could trigger "infinite re-processing loops" when the node tries to fetch pruned data required for processing pending attestations or blocks.

Verification

The fix is correct because:

  • The protected_roots array is passed to prune_old_states and prune_old_blocks, which presumably retain these roots and their ancestors while pruning older data.
  • Since head() is always a descendant of the justified checkpoint (in valid consensus states), protecting it ensures the entire chain segment from finalized → justified → head remains available.
  • The &mut self borrow ensures atomicity—head, justified, and finalized values are sampled consistently without race conditions.

Suggestions

1. Add explanatory comment (line 777)

let protected_roots = [
    self.latest_finalized().root,
    self.latest_justified().root,
    self.head(), // Protect current head to preserve chain tip for fork choice
];

2. Verify ancestor completeness
Ensure prune_old_states and prune_old_blocks handle the case where protected roots might share ancestry (e.g., when head == justified) to avoid redundant retention logic, though this is harmless for correctness.

3. Database consistency check
Consider asserting that self.head() exists in the block store before pruning. If the head root is somehow corrupted or missing, pruning should abort:

debug_assert!(self.blocks.contains_key(&self.head()), "Head block must exist");

Conclusion

Approve. This fixes a potential consensus-critical bug where the node could delete its own chain tip. The change is minimal, safe, and follows the existing pattern.


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

No correctness, security, or performance regressions stood out in the diff. Protecting self.head() in crates/storage/src/store.rs:776 is the right fix for the deferred-pruning path.

One test gap remains:

  1. crates/storage/src/store.rs:1693 through :1709 does not actually cover the regression this PR fixes. The test sets head_root to the newest inserted root, which is already inside the retention window, so it would pass even without the new self.head() protection at :777-780. Add a case where head() is outside retention and distinct from both justified/finalized, then assert its block and state both survive prune_old_data().

Aside from that missing regression coverage, the patch looks good.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Now I have all the context needed for a thorough review. Here is the review:


Review: fix(storage): protect head root from state pruning

Overview

The PR fixes a production panic (head state is always available) that occurred when finalization stalled and a minority fork accumulated enough high-slot blocks to push the actual fork-choice head below the bottom of the STATES_TO_KEEP = 3_000 retention window. Adding self.head() to protected_roots is the correct minimal fix.


Correctness

The fix is sound. The protected_roots set acts as an escape hatch in prune_old_states (line 913) and prune_old_blocks (line 961): entries are deleted only if they fall outside the top-N window and are not in the set. Adding the head root ensures the head's state row and block row survive regardless of slot ranking. The HashSet deduplication handles the case where head == finalized.root or head == justified.root without any extra work.

The fix also correctly protects the head's block entry via prune_old_blocks, which is good—head_state() reads the state, but other call sites (e.g., store.rs:103 in the blockchain crate) read the head state via get_state(&store.head()).expect("head state exists"), and safe_target_slot() at line 1334-1337 reads the block header with another .expect(). Block protection was already implicit through the larger BLOCKS_TO_KEEP window, but having an explicit guard is safer.


Issues

1. Stale doc comments (minor)

prune_old_states doc at line 882 and prune_old_blocks doc at line 931 both still say "(finalized, justified)":

// line 882
/// Keeps the most recent `STATES_TO_KEEP` states (by slot), plus any
/// states whose roots appear in `protected_roots` (finalized, justified).

These should now read "(finalized, justified, head)" to match what prune_old_data actually passes.

2. Regression test is missing (notable)

The PR's own test plan includes an unchecked box for the critical regression case:

Add regression test: head root at a low slot, retention window otherwise filled by a higher-slot branch; assert head's state is preserved after prune_old_data.

The existing fallback_pruning_removes_old_states_and_blocks test (line 1660) sets head_root = root(total_states - 1), which is the highest-slot entry and always inside the retention window. It does not exercise the bug scenario at all. The comment at line 1704 also needs updating—it says "2 are protected (finalized + justified)"—which is currently still true observationally for that test, but becomes misleading now that the head is a third protected root (which just happens to be in-window in this test, so the pruned count doesn't change).

A minimal regression test would look like:

  • Insert STATES_TO_KEEP + N states for a competing branch (slots 1 … 3000+N).
  • Insert 1 additional state for the head at slot 0 (or any slot < the retention window floor).
  • Call update_checkpoints to set the head root to the low-slot entry.
  • Call prune_old_data().
  • Assert the head's state entry still exists.

3. safe_target may be vulnerable to the same pattern (out of scope, but worth tracking)

safe_target_slot() at line 1334 calls get_block_header(&self.safe_target()).expect("safe target exists"). The safe target is updated by update_safe_target on each fork choice run and is normally near the head, but in the same stall scenario described in the PR the safe target could also drift outside the block retention window over a long enough stall. The safe target's block is not explicitly added to protected_roots. This PR is not the right place to fix it, but it is worth a follow-up issue.


Summary

The one-line fix is correct and properly addresses the panic. The main gap before merging is the missing regression test, which the PR itself acknowledges. The doc comment updates are low-stakes but should be made to keep the inline documentation accurate.


Automated review by Claude (Anthropic) · sonnet · custom prompt

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 28, 2026

Greptile Summary

Adds self.head() to protected_roots inside prune_old_data so the fork-choice head block root is never evicted during state/block pruning, preventing the head state is always available panic that occurred when finalization stalled and a minority fork's higher-slot headers crowded the head out of the 3,000-entry retention window.

  • prune_old_data now passes three roots to prune_old_states and prune_old_blocks: finalized, justified, and head — protecting the head regardless of its slot position relative to competing branches.
  • The fix is correct and minimal; the doc comments on prune_old_states/prune_old_blocks and an inline test comment still describe protected_roots as "(finalized, justified)" only and need updating.
  • A regression test covering the exact failure scenario (head at low slot, competing branch fills the retention window) is noted in the PR checklist but not yet included.

Confidence Score: 4/5

Safe to merge — the one-line fix correctly addresses the panic, and the only gaps are stale doc comments and an unwritten regression test.

The change is small and targeted. The core logic is sound: adding the head root to protected_roots prevents it from being evicted when a minority fork's higher-slot headers dominate the retention window. The existing periodic-pruning tests still pass because the head in those tests sits inside the retention window anyway, so they don't validate the exact failure scenario. The missing regression test and stale doc comments are the only open items.

The doc comment updates and regression test are both in crates/storage/src/store.rs.

Important Files Changed

Filename Overview
crates/storage/src/store.rs Adds self.head() to protected_roots in prune_old_data so the fork-choice head state is never evicted; fix is correct but the targeted regression test (head at low slot, competing branch fills window) is not yet added, and two doc comments still describe protected_roots as "(finalized, justified)" only.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[prune_old_data called] --> B[Build protected_roots]
    B --> C[latest_finalized root]
    B --> D[latest_justified root]
    B --> E[self.head - NEW]
    C & D & E --> F[prune_old_states\nkeep top STATES_TO_KEEP by slot\n+ protected_roots]
    C & D & E --> G[prune_old_blocks\nkeep top BLOCKS_TO_KEEP by slot\n+ protected_roots]
    F --> H{head slot inside\ntop 3000 window?}
    H -- Yes\nnormal case --> I[head state safe\nvia retention window]
    H -- No\nstalled finalization --> J[head state safe\nvia protected_roots]
    J --> K[head_state succeeds\nno panic]
    I --> K
Loading

Comments Outside Diff (4)

  1. crates/storage/src/store.rs, line 879-882 (link)

    P2 The doc comments on prune_old_states and prune_old_blocks still describe protected_roots as "(finalized, justified)" only. Since the head root is now also passed in, the parenthetical is stale and could mislead future readers.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: crates/storage/src/store.rs
    Line: 879-882
    
    Comment:
    The doc comments on `prune_old_states` and `prune_old_blocks` still describe `protected_roots` as "(finalized, justified)" only. Since the head root is now also passed in, the parenthetical is stale and could mislead future readers.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

  2. crates/storage/src/store.rs, line 930-931 (link)

    P2 Same stale parenthetical in prune_old_blocks — the head root is now a third protected root alongside finalized and justified.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: crates/storage/src/store.rs
    Line: 930-931
    
    Comment:
    Same stale parenthetical in `prune_old_blocks` — the head root is now a third protected root alongside finalized and justified.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

  3. crates/storage/src/store.rs, line 1704-1706 (link)

    P2 The inline count comment is now slightly misleading: protected_roots now carries three entries (finalized, justified, head), but the head in this test sits at slot 3004—inside the top-3000 window—so it never becomes a pruning candidate and the arithmetic stays correct. Clarifying this avoids confusion for the next person reading the test.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: crates/storage/src/store.rs
    Line: 1704-1706
    
    Comment:
    The inline count comment is now slightly misleading: `protected_roots` now carries three entries (finalized, justified, head), but the head in this test sits at slot 3004—inside the top-3000 window—so it never becomes a pruning candidate and the arithmetic stays correct. Clarifying this avoids confusion for the next person reading the test.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

  4. crates/storage/src/store.rs, line 1722-1759 (link)

    P2 Regression test for the exact panic scenario is missing

    The PR description marks this as an outstanding checkbox, but neither of the existing periodic-pruning tests covers the actual failure: a head at a low slot while a competing branch's higher-slot headers fill the retention window. Both fallback_pruning_removes_old_states_and_blocks and fallback_pruning_no_op_within_retention place the head at the top of the slot range (slot 3004 / slot 2999), so the head was never in the pruning-candidate set even before this fix — those tests would pass unchanged with or without self.head() in protected_roots. A test that inserts STATES_TO_KEEP + N headers on a high-slot branch, sets the head to a low-slot root, calls prune_old_data, and then asserts head_state() succeeds (and the head's state row still exists) would directly validate the fix.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: crates/storage/src/store.rs
    Line: 1722-1759
    
    Comment:
    **Regression test for the exact panic scenario is missing**
    
    The PR description marks this as an outstanding checkbox, but neither of the existing periodic-pruning tests covers the actual failure: a head at a *low* slot while a competing branch's higher-slot headers fill the retention window. Both `fallback_pruning_removes_old_states_and_blocks` and `fallback_pruning_no_op_within_retention` place the head at the *top* of the slot range (slot 3004 / slot 2999), so the head was never in the pruning-candidate set even before this fix — those tests would pass unchanged with or without `self.head()` in `protected_roots`. A test that inserts `STATES_TO_KEEP + N` headers on a high-slot branch, sets the head to a low-slot root, calls `prune_old_data`, and then asserts `head_state()` succeeds (and the head's state row still exists) would directly validate the fix.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 4 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 4
crates/storage/src/store.rs:879-882
The doc comments on `prune_old_states` and `prune_old_blocks` still describe `protected_roots` as "(finalized, justified)" only. Since the head root is now also passed in, the parenthetical is stale and could mislead future readers.

```suggestion
    /// Prune old states beyond the retention window.
    ///
    /// Keeps the most recent `STATES_TO_KEEP` states (by slot), plus any
    /// states whose roots appear in `protected_roots` (finalized, justified, head).
```

### Issue 2 of 4
crates/storage/src/store.rs:930-931
Same stale parenthetical in `prune_old_blocks` — the head root is now a third protected root alongside finalized and justified.

```suggestion
    /// Keeps the most recent `BLOCKS_TO_KEEP` blocks (by slot), plus any
    /// blocks whose roots appear in `protected_roots` (finalized, justified, head).
```

### Issue 3 of 4
crates/storage/src/store.rs:1704-1706
The inline count comment is now slightly misleading: `protected_roots` now carries three entries (finalized, justified, head), but the head in this test sits at slot 3004—inside the top-3000 window—so it never becomes a pruning candidate and the arithmetic stays correct. Clarifying this avoids confusion for the next person reading the test.

```suggestion
        // 3005 headers total. Top 3000 by slot are kept in the retention window,
        // leaving 5 candidates. 2 of those candidates are protected (finalized=slot0,
        // justified=slot1); head=slot3004 is also protected but already inside the
        // retention window, so it is not a candidate and doesn't change the count.
        // 5 - 2 = 3 pruned → 3005 - 3 = 3002 states remaining.
```

### Issue 4 of 4
crates/storage/src/store.rs:1722-1759
**Regression test for the exact panic scenario is missing**

The PR description marks this as an outstanding checkbox, but neither of the existing periodic-pruning tests covers the actual failure: a head at a *low* slot while a competing branch's higher-slot headers fill the retention window. Both `fallback_pruning_removes_old_states_and_blocks` and `fallback_pruning_no_op_within_retention` place the head at the *top* of the slot range (slot 3004 / slot 2999), so the head was never in the pruning-candidate set even before this fix — those tests would pass unchanged with or without `self.head()` in `protected_roots`. A test that inserts `STATES_TO_KEEP + N` headers on a high-slot branch, sets the head to a low-slot root, calls `prune_old_data`, and then asserts `head_state()` succeeds (and the head's state row still exists) would directly validate the fix.

Reviews (1): Last reviewed commit: "fix(storage): protect head root from sta..." | Re-trigger Greptile

@MegaRedHand MegaRedHand merged commit 172dd92 into main May 28, 2026
7 checks passed
@MegaRedHand MegaRedHand deleted the fix/prune-protects-head-state branch May 28, 2026 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants