Skip to content

Statement-store: defer statement protocol connections during major sync#11487

Open
DenzelPenzel wants to merge 38 commits intomasterfrom
denzelpenzel/statement-store-loss-during-major-sync
Open

Statement-store: defer statement protocol connections during major sync#11487
DenzelPenzel wants to merge 38 commits intomasterfrom
denzelpenzel/statement-store-loss-during-major-sync

Conversation

@DenzelPenzel
Copy link
Copy Markdown
Contributor

@DenzelPenzel DenzelPenzel commented Mar 24, 2026

Description

Implement #11411

Reconnect all statement protocol peers when major sync ends. Removing and re-adding peers to the reserved set closes and reopens the notification substreams on both sides, triggering a fresh bidirectional initial sync. This causes peers to re-deliver any statements that were dropped during the sync window.

  • On was_major_syncing → false transition, call reconnect_statement_peers which removes then re-adds all connected peers from the statement
    protocol reserved set
    • Both sides receive NotificationStreamClosed then NotificationStreamOpened, restarting the 100ms initial sync burst

Changes

  • reconnect_statement_peers: fires when major sync ends, reconnects all statement peers to trigger fresh initial sync
  • New zombienet integration test: statement_store_peer_disconnect_during_major_sync — verifies that statements submitted before a late-joining nodeenters major sync are fully recovered after sync completes, confirmed via log assertion that reconnect_statement_peers fired

@DenzelPenzel DenzelPenzel requested a review from alexggh March 24, 2026 22:01
Comment thread substrate/client/network/statement/src/lib.rs
@DenzelPenzel DenzelPenzel marked this pull request as draft March 25, 2026 10:14
@DenzelPenzel DenzelPenzel marked this pull request as ready for review March 26, 2026 12:10
@DenzelPenzel DenzelPenzel added the T0-node This PR/Issue is related to the topic “node”. label Mar 27, 2026
@DenzelPenzel
Copy link
Copy Markdown
Contributor Author

/cmd prdoc --audience node_dev --bump patch

Copy link
Copy Markdown
Contributor

@s0me0ne-unkn0wn s0me0ne-unkn0wn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good (modulo my poor knowledge of Substrate networking internals)

Comment thread substrate/client/network/statement/src/lib.rs Outdated
@DenzelPenzel DenzelPenzel changed the title statement-store: defer statement protocol connections during major sync Statement-store: defer statement protocol connections during major sync Mar 31, 2026
Comment thread substrate/client/network/statement/src/lib.rs Outdated
Comment thread substrate/client/network/statement/src/lib.rs
Comment thread substrate/client/network/statement/src/lib.rs
Comment thread cumulus/zombienet/zombienet-sdk/tests/zombie_ci/statement_store.rs Outdated
@DenzelPenzel DenzelPenzel requested a review from alexggh April 1, 2026 11:26
…ent-store-loss-during-major-sync

# Conflicts:
#	cumulus/zombienet/zombienet-sdk/tests/zombie_ci/statement_store.rs
#	cumulus/zombienet/zombienet-sdk/tests/zombie_ci/statement_store/bench.rs
@DenzelPenzel DenzelPenzel force-pushed the denzelpenzel/statement-store-loss-during-major-sync branch from 555918e to 2934407 Compare April 2, 2026 10:26
@DenzelPenzel
Copy link
Copy Markdown
Contributor Author

/cmd fmt

Copy link
Copy Markdown
Contributor

@AndreiEres AndreiEres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall ok, but feel uneasy that we loose the peerset on error in reconnect_statement_peers. But zombienet test doesn't catch if the fix works

Comment thread substrate/client/network/statement/src/lib.rs Outdated
Comment thread substrate/client/network/statement/src/lib.rs Outdated
Comment thread cumulus/zombienet/zombienet-sdk/tests/zombie_ci/statement_store/integration.rs Outdated
Comment thread substrate/client/network/statement/src/lib.rs Outdated
@DenzelPenzel DenzelPenzel force-pushed the denzelpenzel/statement-store-loss-during-major-sync branch from 0c0b061 to 66157b3 Compare April 13, 2026 12:40
… github.com:paritytech/polkadot-sdk into denzelpenzel/statement-store-loss-during-major-sync

# Conflicts:
#	cumulus/zombienet/zombienet-sdk/tests/zombie_ci/statement_store/integration.rs
@DenzelPenzel DenzelPenzel force-pushed the denzelpenzel/statement-store-loss-during-major-sync branch from 66157b3 to b076543 Compare April 13, 2026 12:51
@DenzelPenzel
Copy link
Copy Markdown
Contributor Author

/cmd fmt

@DenzelPenzel DenzelPenzel requested a review from AndreiEres April 13, 2026 18:33
Copy link
Copy Markdown
Contributor

@AndreiEres AndreiEres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread prdoc/pr_11487.prdoc Outdated
Comment thread .github/zombienet-tests/zombienet_cumulus_tests.yml Outdated
Comment thread cumulus/zombienet/zombienet-sdk/tests/zombie_ci/statement_store/integration.rs Outdated
/// Scenario:
/// 1. Spawn charlie only and let the relay chain advance ~10 blocks
/// 2. Submit multiple statements to charlie
/// 3. Add dave as a late joiner — dave will enter major sync because the chain has already
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have anything to proove 3 happens ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this cond confirm it

dave_logs.lines().any(|l| l.contains("Major sync complete, reconnecting")),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But ... this confirms just that you had a major sync happening not that, you received statements while you were in major sync.

Comment thread substrate/client/network/statement/src/lib.rs Outdated
Copy link
Copy Markdown
Member

@bkchr bkchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation is trying to achieve the goal in a very hacky way. We should just not add the peers to the reserved set as long as we are in major sync mode.

SyncEvent::PeerConnected(remote) => {
let addr = iter::once(multiaddr::Protocol::P2p(remote.into()))
.collect::<multiaddr::Multiaddr>();
let result = self.network.add_peers_to_reserved_set(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not do this when you are in major sync mode. There is no need to connect to the other side, as long as you are doing the major sync.

When the major sync is then ready, you can start adding them to the reserved set and the substream should be opened.

Copy link
Copy Markdown
Contributor Author

@DenzelPenzel DenzelPenzel Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that adding new peers to the reserved set during major sync is the right approach for late joiners we can buffer them and add them cleanly when sync ends, avoiding any substream open during the sync window.

However, there is a gap for already-connected peers (nodes whose substream was open before major sync started) we don't have a way to re-trigger it without forcing a close+reopen. So even with the reserved-set approach we are still need something equivalent to the remove+add for already-connected peers to recover missed statements, wdty? @bkchr

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, there is a gap for already-connected peers (nodes whose substream was open before major sync started)

Major sync should start immediately, the moment the first peer is connected.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this:

if target > best_block && target - best_block > MAJOR_SYNC_BLOCKS.into() {

There still seems to be a gap where if fall behind 5 blocks, we get into major sync and the we will drop statements here, from nodes that are already connected:

https://github.com/paritytech/polkadot-sdk/pull/11487/changes#diff-1b228aee91454a7222489dbb0382d7a7161a19f17048cb2de2d7845f5cac533aL925.

So maybe the mitigation for this situations is to also accept statements when we are in major sync, I think that should be good enough.

Or if we want to be bullet proof we could note we dropped statements because it found us in a major sync and have a timer every(1 minute) where we remove some random peer from reserved set if we dropped statements(forcing a disconnect basically) and then next minute we re-add it to reserved set. I would just disconnect a few peers because we don't want to re-sync with all of them.

@DenzelPenzel DenzelPenzel force-pushed the denzelpenzel/statement-store-loss-during-major-sync branch from f6f1aa8 to 894595d Compare April 16, 2026 11:42
@DenzelPenzel DenzelPenzel force-pushed the denzelpenzel/statement-store-loss-during-major-sync branch from 894595d to 1803dab Compare April 16, 2026 11:44
@DenzelPenzel
Copy link
Copy Markdown
Contributor Author

/cmd fmt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A4-backport-stable2603 Pull request must be backported to the stable2603 release branch T0-node This PR/Issue is related to the topic “node”.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants