feat(p2p): use BlocksByRange for long-range sync by dicethedev · Pull Request #351 · lambdaclass/ethlambda

dicethedev · 2026-05-08T05:05:58Z

🗒️ Description / Motivation

This PR closes #347 by wiring the BlocksByRange protocol added in #348 into the status-response sync path.

Previously, when a peer's Status response revealed it was ahead of our local head, we had no mechanism to backfill the gap. Now, when the gap exceeds a configurable threshold (LONG_RANGE_SYNC_THRESHOLD = 2 slots), we request the missing range using BlocksByRange instead of relying on gossip or individual BlocksByRoot fetches.

For small gaps (1–2 slots), we defer to the existing FetchBlock path since roots are typically already available from gossip and BlocksByRoot is more precise for that case.

What Changed

lib.rs

Added LONG_RANGE_SYNC_THRESHOLD: u64 = 2 constant

req_resp/handlers.rs

Updated handle_status_response to branch on gap size:
- gap > LONG_RANGE_SYNC_THRESHOLD → request_blocks_by_range_from_peer
- gap ≤ threshold → defers to gossip / FetchBlock

Correctness / Behavior Guarantees

request_blocks_by_range_from_peer already batches internally at MAX_REQUEST_BLOCKS (1024), so nodes thousands of slots behind are handled correctly across multiple requests with no additional changes
handle_blocks_by_range_response (added in feat(p2p): add inbound BlocksByRange req/resp support #348) already forwards each block to the blockchain layer — the response path is complete
BlocksByRoot behavior for individual missing blocks (FetchBlock, retry/backoff logic) is unchanged

Tests Added / Run

No new tests required. The range response handling and canonical block selection are covered by the test added in #348 (blocks_by_range_returns_canonical_blocks_in_requested_order).

Related Issues / PRs

Closes Use BlocksByRange for long range syncing #347
Depends on / builds on feat(p2p): add inbound BlocksByRange req/resp support #348
Related to feat: add BlocksByRange req/resp protocol leanEthereum/leanSpec#691

✅ Verification Checklist

Ran make fmt — clean
Ran make lint (clippy with -D warnings) — clean
Ran cargo test --workspace --release — all passing

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

greptile-apps · 2026-05-08T05:09:50Z

Greptile Summary

This PR wires the previously-added BlocksByRange protocol into the status-response sync path: when a peer's Status reveals it is more than LONG_RANGE_SYNC_THRESHOLD (2) slots ahead, the node now requests the missing range in batches of up to 1024 blocks instead of relying solely on gossip.

handle_status_response now computes the slot gap and dispatches request_blocks_by_range_from_peer, which loops ⌈gap / 1024⌉ times sending batched requests; the gap is taken directly from the peer's untrusted Status message with no upper bound.
canonical_blocks_by_range serves inbound requests by walking the canonical chain backwards from store.head() until reaching start_slot; traversal cost is O(head_slot − start_slot), not O(count), making it vulnerable to cheap deep-history requests.
handle_blocks_by_range_response forwards received blocks to the blockchain layer with no slot-range validation and no retry path wired into the OutboundFailure event handler.

Confidence Score: 3/5

The codec, messages, and module wiring changes are safe, but handlers.rs has three defects on the changed sync path that should be addressed before merging.

The gap passed to the batch-request loop is taken directly from a peer's unauthenticated Status message with no ceiling, so a malicious peer can send an astronomically large head slot and pin the node in a near-infinite async loop. Separately, BlocksByRange request IDs are never registered in request_id_map, so every outbound-failure event silently drops the failure with no retry, leaving the node permanently behind if any batch times out. Additionally, the inbound handler walks the canonical chain from head to the requested start_slot, making its cost proportional to chain depth rather than the requested block count — a cheap way to induce heavy work on the responding node.

crates/net/p2p/src/req_resp/handlers.rs warrants close attention across handle_status_response, request_blocks_by_range_from_peer, canonical_blocks_by_range, and the OutboundFailure branch of handle_req_resp_message.

Important Files Changed

Filename	Overview
crates/net/p2p/src/req_resp/handlers.rs	Core file with the most significant changes: adds handle_status_response gap-based branching, canonical_blocks_by_range chain traversal, and BlocksByRange request/response handlers — has uncapped gap loop, no OutboundFailure recovery, and O(head_slot) traversal vulnerability
crates/net/p2p/src/req_resp/codec.rs	Cleanly refactors decode_blocks_by_root_response into shared decode_blocks_response helper and wires BlocksByRange into the codec; no issues found
crates/net/p2p/src/req_resp/messages.rs	Adds BlocksByRangeRequest struct, BlocksByRange response payload variant, MAX_REQUEST_BLOCKS constant, and removes the dead_code attribute from error_message; straightforward and correct
crates/net/p2p/src/lib.rs	Adds LONG_RANGE_SYNC_THRESHOLD constant and registers BlocksByRange protocol with ProtocolSupport::Full in the swarm builder; no issues found
crates/net/p2p/src/req_resp/mod.rs	Re-exports new BlocksByRange symbols; trivial change with no issues

Sequence Diagram

sequenceDiagram
    participant Local as Local Node
    participant Peer as Remote Peer
    participant Blockchain as Blockchain Layer

    Local->>Peer: Status Request
    Peer-->>Local: Status Response (head_slot, finalized)
    
    Local->>Local: "gap = peer.head_slot - our_head_slot"

    alt "gap <= LONG_RANGE_SYNC_THRESHOLD (2)"
        Local->>Local: rely on gossip / FetchBlock (BlocksByRoot)
    else "gap > LONG_RANGE_SYNC_THRESHOLD"
        loop ceil(gap / MAX_REQUEST_BLOCKS) batches
            Local->>Peer: "BlocksByRange Request (start_slot, count<=1024, step=1)"
            Peer-->>Local: BlocksByRange Response [blocks...]
            Local->>Blockchain: new_block() for each block
        end
    end

    note over Local,Peer: Inbound path (serving requests)
    Peer->>Local: BlocksByRange Request
    Local->>Local: canonical_blocks_by_range() walk chain from head to start_slot
    Local-->>Peer: BlocksByRange Response [canonical blocks]

Comments Outside Diff (1)

crates/net/p2p/src/req_resp/handlers.rs, line 93-105 (link)

BlocksByRange outbound failures are silently discarded with no recovery

request_id_map is only populated for BlocksByRoot requests (see fetch_block_from_peer). When an OutboundFailure fires for a BlocksByRange request, the if let Some(root) = server.request_id_map.remove(&request_id) branch is never taken, so the failure is logged but nothing else happens. If any batch in a long-range sync fails (network error, peer disconnect, timeout), the sync silently stops with no retry or fallback. The node remains stuck behind its peers with no automatic recovery until the next Status message happens to arrive.

Prompt To Fix With AI

This is a comment left during a code review.
Path: crates/net/p2p/src/req_resp/handlers.rs
Line: 93-105

Comment:
**BlocksByRange outbound failures are silently discarded with no recovery**

`request_id_map` is only populated for `BlocksByRoot` requests (see `fetch_block_from_peer`). When an `OutboundFailure` fires for a `BlocksByRange` request, the `if let Some(root) = server.request_id_map.remove(&request_id)` branch is never taken, so the failure is logged but nothing else happens. If any batch in a long-range sync fails (network error, peer disconnect, timeout), the sync silently stops with no retry or fallback. The node remains stuck behind its peers with no automatic recovery until the next Status message happens to arrive.

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

Fix the following 4 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 4
crates/net/p2p/src/req_resp/handlers.rs:141-151
**Uncapped gap triggers a near-infinite request loop**

`gap` is directly used as the `count` passed to `request_blocks_by_range_from_peer`, which loops `⌈gap / MAX_REQUEST_BLOCKS⌉` times sending requests. A malicious peer sending `Status { head.slot = u64::MAX }` would cause the loop to iterate ~1.8 × 10¹⁶ times — effectively hanging the node until the swarm channel closes (whose capacity determines how long that takes). Even a "legitimate" peer claiming to be 10 million slots ahead would immediately queue ~9,766 requests. There is no upper bound on how many batches are dispatched in a single call.

### Issue 2 of 4
crates/net/p2p/src/req_resp/handlers.rs:93-105
**BlocksByRange outbound failures are silently discarded with no recovery**

`request_id_map` is only populated for `BlocksByRoot` requests (see `fetch_block_from_peer`). When an `OutboundFailure` fires for a `BlocksByRange` request, the `if let Some(root) = server.request_id_map.remove(&request_id)` branch is never taken, so the failure is logged but nothing else happens. If any batch in a long-range sync fails (network error, peer disconnect, timeout), the sync silently stops with no retry or fallback. The node remains stuck behind its peers with no automatic recovery until the next Status message happens to arrive.

### Issue 3 of 4
crates/net/p2p/src/req_resp/handlers.rs:221-267
**O(head\_slot) chain traversal for old `start_slot` requests**

`canonical_blocks_by_range` always starts walking from `store.head()` and traverses backwards one header at a time until `header.slot < start_slot`. If a peer requests `start_slot = 0, count = 1024` on a chain whose head is at slot 1,000,000, the loop performs 1,000,000 `store.get_block_header` calls before collecting any of the 1024 requested blocks. Since `count` is bounded by `MAX_REQUEST_BLOCKS` but `start_slot` is not validated against the local chain, this becomes an unbounded-work request handler. A peer can exploit this to perform a cheap DoS by repeatedly requesting from `start_slot = 0`.

### Issue 4 of 4
crates/net/p2p/src/req_resp/handlers.rs:328-342
`handle_blocks_by_range_response` does not verify that the returned blocks' slots fall within the requested range. A misbehaving peer can inject arbitrary blocks (from a different range or a different chain) and they will be forwarded unconditionally to the blockchain layer. At minimum, the slot of each block should be cross-checked against the requested `[start_slot, start_slot + count)` window, which is available from the request context.

```suggestion
    if let Some(ref blockchain) = server.blockchain {
        for block in blocks {
            let block_root = block.message.hash_tree_root();
            let slot = block.message.slot;
            // TODO: validate block.message.slot is within the originally requested range.
            let _ = blockchain.new_block(block).inspect_err(|err| {
                error!(
                    %peer,
                    %slot,
                    block_root = %ethlambda_types::ShortRoot(&block_root.0),
                    %err,
                    "Failed to forward range-fetched block to blockchain"
                )
            });
        }
    }
```

_{Reviews (1): Last reviewed commit: "fix(clippy): use is_multiple_of for slot..." | Re-trigger Greptile}

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

dicethedev · 2026-05-12T21:27:10Z

@MegaRedHand do we still need this PR now that #355 is planned? Since #355 adds a slot/block-number index for efficient BlocksByRange lookups, I’m wondering if this PR should wait for that, or if it’s still useful to keep this implementation for now and let #355 optimize the lookup path later.

MegaRedHand · 2026-05-13T13:55:49Z

Let's keep this PR and #355 separate. It's easier to review this way

MegaRedHand

Left some comments.

- Remove redundant MAX_SLOT_LOOKBACK guard - Always trigger long-range sync on status response - Deduplicate range requests across peers

dicethedev · 2026-05-15T03:28:59Z

Left some comments.

You can review again.

MegaRedHand

LGTM. We'll run some tests before merging

MegaRedHand · 2026-05-19T22:41:39Z

@dicethedev the CI is failing. The step field was removed in #365

…te request builder

MegaRedHand

Left some more comments, but they can be addressed in another PR.

Also, there's an issue with decode_payload (#384), which makes response decoding fail for BlocksByRange. We need to fix that before merging this. Feel free to look at that if you have time.

dicethedev · 2026-05-20T18:22:26Z

Also, there's an issue with decode_payload (#384), which makes response decoding fail for BlocksByRange. We need to fix that before merging this. Feel free to look at that if you have time.

@MegaRedHand Thanks! So regarding the decode_payload issue (#384), would you prefer I fix that on this branch as part of the current PR, or should I open a separate PR for it?

MegaRedHand · 2026-05-20T18:29:32Z

A separate PR would be best. We need to add tests for it anyway, so this PR isn't actually needed for testing.

MegaRedHand

Blocked until #384 is fixed

…nd_requests enum

dicethedev · 2026-06-01T13:35:24Z

@MegaRedHand still waiting for reviews

MegaRedHand · 2026-06-01T14:34:00Z

PR looks good, but it requires #384 to be fixed first

dicethedev · 2026-06-01T14:54:05Z

PR looks good, but it requires #384 to be fixed first

I will push a fix soon for #384

…403) When `--checkpoint-sync-url` is provided, the node currently always downloads a fresh finalized state from the peer, even if a recent state is already on disk. Skip the network round-trip when the persisted state is fresh enough to resume from. This PR introduces a state data freshness threshold - `MAX_RESUMABLE_DB_STATE_AGE = 450` slots (~30 min at 4s/slot) - picked conservatively considering the per-block backfill cost. I couldn't identify in the specs a formal weak-subjectivity period or a method of calculating it, so this is a judgement call; happy to take any suggestions on the value or a better approach for it. ## What Changed - `crates/storage/src/store.rs` — added `Store::from_db_state(backend, expected_genesis_time)`, a no-write constructor that wraps an already-initialized backend. Returns `None` if the backend is empty or its persisted `genesis_time` doesn't match. Added the `MAX_RESUMABLE_DB_STATE_AGE = 450` constant (~30 min at 4s/slot). - `crates/storage/src/lib.rs` — re-exported `MAX_RESUMABLE_DB_STATE_AGE`. - `bin/ethlambda/src/main.rs` — in the checkpoint-sync branch of `fetch_initial_state`, try `Store::from_db_state` first. If the persisted finalized slot is within `MAX_RESUMABLE_DB_STATE_AGE` of wall-clock, return the resumed store and skip checkpoint sync. Otherwise warn and fall through to the existing sync path. ## Correctness / Behavior Guarantees - New short-circuit fires only when: checkpoint URL provided AND DB populated AND persisted `genesis_time` matches AND `current_slot - latest_finalized.slot <= MAX_RESUMABLE_DB_STATE_AGE`. Everything else remains unchanged. - 30 min threshold is conservative for the current `BlocksByRoot`-only backfill cost; can be increased once `BlocksByRange` long-range sync (#351) is added. ## Tests Added / Run Three unit tests in `crates/storage/src/store.rs` covering the `from_db_state` contract: - `from_db_state_returns_none_on_empty_backend` - `from_db_state_returns_some_on_matching_genesis_time` - `from_db_state_returns_none_on_genesis_time_mismatch` ## Related Issues / PRs - Closes #121 - Related to #351 (BlocksByRange long-range sync — once landed, `MAX_RESUMABLE_DB_STATE_AGE` can probably be raised toward `STATES_TO_KEEP`) ## ✅ Verification Checklist - [x] Ran `make fmt` — clean - [x] Ran `make lint` (clippy with `-D warnings`) — clean - [x] Ran `cargo test --workspace --release` — all passing - [x] Local devnet test --------- Co-authored-by: Pablo Deymonnaz <pdeymon@fi.uba.ar>

## 🗒️ Description / Motivation - Fixes req-resp payload decoding so multi-chunk responses are handled correctly. - `decode_payload` previously read the entire remaining stream, causing BlocksByRange / multi-block responses to fail when the first chunk decoder consumed bytes belonging to following chunks. - This solves snappy decode failures for responses containing multiple payload chunks. ## What Changed - Updated `crates/net/p2p/src/req_resp/encoding.rs`. - Reworked `decode_payload` to read one varint-prefixed snappy payload at a time. - Added async helpers for reading the varint prefix and one snappy frame. - Added tests covering adjacent payloads, large snappy-framed payloads, and empty payload boundaries. ## Correctness / Behavior Guarantees - Each response chunk is decoded independently. - The next response code / payload remains unread for the outer response loop. - Existing maximum payload and compressed payload size checks are preserved. - Response size metrics now report the compressed size for the current payload only. ## Tests Added / Run - Added req-resp encoding tests. - Ran: - `cargo fmt --check` - `cargo test -p ethlambda-p2p` ## Related Issues / PRs - Closes #384 - Related to #351 ## ✅ Verification Checklist - [ x] Ran `make fmt` — clean - [ x] Ran `make lint` (clippy with `-D warnings`) — clean - [ x] Ran `cargo test --workspace --release` — all passing Co-authored-by: Tomás Grüner <47506558+MegaRedHand@users.noreply.github.com>

MegaRedHand

Retry functionality is not working, currently. Let's remove it before merging this PR.

We can add the retry functionality in another PR. I left some comments regarding that.

MegaRedHand · 2026-06-03T18:53:36Z

+                    server
+                        .pending_range_requests
+                        .remove(&(start_slot, end_slot));
+                    send_after(
+                        Duration::from_millis(500),
+                        ctx.clone(),
+                        p2p_protocol::RetryRangeSync {
+                            peer_id: peer,
+                            start_slot,
+                            end_slot: total_end_slot, // retry the full remaining range
+                        },
+                    );


Let's make this similar to handle_fetch_failure. In fact, I think we can remove this match altogether, something like before:

if let Some(pending_request) = server.outbound_requests.remove(&request_id) { handle_fetch_failure(server, pending_request, peer, ctx).await; }

And then have the match inside handle_fetch_failure. Also, we should add retry tracking to the range requests too.

MegaRedHand · 2026-06-03T18:54:40Z

+        // safety check: if already synced, skip retry
+        let still_needed = !self
+            .pending_range_requests
+            .contains(&(start_slot, end_slot));
+
+        if still_needed {
+            tracing::trace!(%peer, start_slot, end_slot, "Skipping retry, range already resolved");
+            return;
+        }


We remove the entry from pending_range_requests right before sending the RetryRangeSync message. That would make this always land on this skip on error.

MegaRedHand · 2026-06-03T19:15:31Z

+    loop {
+        let covered = server
+            .pending_range_requests
+            .iter()
+            .find(|&&(s, e)| s <= effective_start && effective_start <= e)
+            .copied();
+        match covered {
+            Some((_, covered_end)) => effective_start = covered_end + 1,
+            None => break,
+        }
+        if effective_start > total_end_slot {
+            info!(
+                %peer,
+                start_slot,
+                total_end_slot,
+                "BlocksByRange fully covered by in-flight requests, skipping"
+            );
+            return true;
+        }
+    }


We should look for a better way to handle deduplication. I'm thinking we can simplify this by having a single BlocksByRange request in progress, then we can just merge all seen ranges.

When we receive the first new range, we initialize our total range to the start and end of that range. Since the start can't possibly change unless we advance our head, we can add to this range by bumping the end slot whenever we find a newer range from peer status responses. Then, whenever we receive a response for a BlocksByRange message, we'd increase the start slot to the end of the received range and request the next chunk. Once we reach the end slot, we can stop requesting blocks, and represent that by setting our "queue" to None.

For peer tracking, we can have a list of peers with the latest slot we saw in their status message. We can then request ranges from peers that we know have them, while dropping peers with an old head.

// We can also use std::range::Range<u64> instead range_sync_state: Option<RangeSyncState> struct RangeSyncState { current_range: Range<u64>, peer_set: HashMap<PeerId, u64>, }

dicethedev and others added 5 commits May 7, 2026 01:17

feat(p2p): add inbound BlocksByRange req/resp support

576b191

Update crates/net/p2p/src/req_resp/handlers.rs

cba8385

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Merge branch 'main' into feat/blocks-by-range-inbound-support

0130146

feat(p2p): use BlocksByRange for long-range sync

815814e

fix(clippy): use is_multiple_of for slot step check

9fc8d73

greptile-apps Bot reviewed May 8, 2026

View reviewed changes

Comment thread crates/net/p2p/src/req_resp/handlers.rs Outdated

Comment thread crates/net/p2p/src/req_resp/handlers.rs Outdated

Comment thread crates/net/p2p/src/req_resp/handlers.rs Outdated

dicethedev and others added 5 commits May 8, 2026 06:10

Update crates/net/p2p/src/req_resp/handlers.rs

9028a29

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

fix(p2p): cap BlocksByRange sync count to prevent unbounded request loop

71bc472

fix(p2p): bound canonical_blocks_by_range traversal depth to prevent DoS

300a1af

Merge branch 'main' into feat/blocks-by-range-long-range-sync

659456e

Fix BlocksByRange response routing

1e97021

MegaRedHand reviewed May 14, 2026

View reviewed changes

Comment thread crates/net/p2p/src/req_resp/handlers.rs Outdated

Comment thread crates/net/p2p/src/req_resp/handlers.rs Outdated

Comment thread crates/net/p2p/src/req_resp/handlers.rs Outdated

fix(p2p): address BlocksByRange review feedback

41bcebf

- Remove redundant MAX_SLOT_LOOKBACK guard - Always trigger long-range sync on status response - Deduplicate range requests across peers

MegaRedHand approved these changes May 19, 2026

View reviewed changes

Merge branch 'main' into feat/blocks-by-range-long-range-sync

f201a38

fix(p2p): remove legacy step field from BlocksByRangeRequest and upda…

50e85ac

…te request builder

dicethedev requested a review from MegaRedHand May 19, 2026 23:52

MegaRedHand reviewed May 20, 2026

View reviewed changes

pablodeymo added 2 commits May 22, 2026 18:30

Merge branch 'main' into feat/blocks-by-range-long-range-sync

3cac9a6

Merge branch 'main' into feat/blocks-by-range-long-range-sync

f7c9d34

MegaRedHand requested changes May 27, 2026

View reviewed changes

refactor(p2p): merge request_id_map and range_request_ids into outbou…

c430e79

…nd_requests enum

dicethedev requested a review from MegaRedHand May 28, 2026 15:55

conache mentioned this pull request Jun 1, 2026

feat(node): skip checkpoint sync when recent state data is in the db #403

Merged

4 tasks

dicethedev mentioned this pull request Jun 1, 2026

Handle req-resp responses with multiple chunks #407

Merged

Merge branch 'main' into feat/blocks-by-range-long-range-sync

641dc51

MegaRedHand reviewed Jun 3, 2026

View reviewed changes

Conversation

dicethedev commented May 8, 2026

🗒️ Description / Motivation

What Changed

Correctness / Behavior Guarantees

Tests Added / Run

Related Issues / PRs

✅ Verification Checklist

Uh oh!

greptile-apps Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dicethedev commented May 12, 2026

Uh oh!

MegaRedHand commented May 13, 2026

Uh oh!

MegaRedHand left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dicethedev commented May 15, 2026

Uh oh!

MegaRedHand left a comment

Choose a reason for hiding this comment

Uh oh!

MegaRedHand commented May 19, 2026

Uh oh!

MegaRedHand left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dicethedev commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MegaRedHand commented May 20, 2026

Uh oh!

MegaRedHand left a comment

Choose a reason for hiding this comment

Uh oh!

dicethedev commented Jun 1, 2026

Uh oh!

MegaRedHand commented Jun 1, 2026

Uh oh!

dicethedev commented Jun 1, 2026

Uh oh!

MegaRedHand left a comment

Choose a reason for hiding this comment

Uh oh!

MegaRedHand Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

MegaRedHand Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

MegaRedHand Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented May 8, 2026 •

edited

Loading

dicethedev commented May 20, 2026 •

edited

Loading