fix: include sharded channels in PubSub.conn() so cluster reconnect routes to the slot owner#3807
Open
igorkofman wants to merge 1 commit intoredis:masterfrom
Open
Conversation
…routing PubSub.conn() builds the channel list passed to newConn from c.channels only. For a PubSub with only sharded subscriptions (the normal ClusterClient.SSubscribe case), that list is empty on reconnect, so ClusterClient.pubSub falls back to nodes.Random(). The SSUBSCRIBE that resubscribe() then writes to the wrong node fails silently — _subscribe is write-only and isBadConn ignores MOVED to a different address — so the PubSub looks healthy but never receives another message. Append c.schannels to the list, after c.channels. The argument is only used for slot resolution in the cluster client; single-node, ring, and sentinel clients ignore it for routing. Adds a regression test that SSubscribes, kills the PubSub conn on every node (CLIENT KILL TYPE pubsub), triggers reconnect, and asserts delivery, repeated 8 times so a lucky random-node hit can't mask the bug.
|
Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset. In case there are security findings, they will be communicated to you as a comment inside the PR. Hope you’ll enjoy using Jit. Questions? Comments? Want to learn more? Get in touch with us. |
Member
|
Hello @igorkofman , there are couple of initiatives around pub sub at the moment. I do think the biggest one right now is #3717, maybe check it and align the changes to it if needed? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #3806.
Problem
PubSub.conn()builds the channel list passed tonewConnfromc.channelsonly. For aPubSubwith only sharded subscriptions (thenormal
ClusterClient.SSubscribecase), that list is empty on reconnect.ClusterClient.pubSub'snewConnclosure falls back tonodes.Random()when
len(channels) == 0, so the reconnect lands on a random node. TheSSUBSCRIBEwritten byresubscribe()to the wrong node fails silently(
_subscribeis write-only andisBadConnignoresMOVEDto a differentaddress), leaving a
PubSubthat looks healthy —Pingsucceeds,Receivekeeps running — but never receives another message until theprocess restarts.
Fix
Append
c.schannelsto the list, afterc.channels. Thechannelsargument is only used for slot resolution in the cluster client; the
single-node, ring, and sentinel clients ignore it for routing, so the
change is scoped to cluster behavior. Ordering schannels after channels
keeps behavior identical for
PubSubs that have any regularsubscriptions.
Confidence
osscluster_test.gothatSSubscribes,kills the PubSub conn on every cluster node (
CLIENT KILL TYPE pubsub),triggers reconnect, and asserts message delivery — repeated 8 times so a
lucky random-node hit can't mask the bug. Without the fix the chance of
a false pass on the 3+3 node test cluster is
(1/6)^8 ≈ 6e-7.Redis Cluster; the PubSub silently dropped to ~10% of expected message
delivery after a transient connection blip and stayed degraded until a
process restart. Confirmed the fix restores delivery after
CLIENT KILL.go build,go vet,gofmtclean on the patched files.Note
Medium Risk
Touches cluster PubSub reconnect/routing behavior; a bug here can silently drop messages after reconnects, though the change is small and covered by a targeted regression test.
Overview
Fixes a Redis Cluster sharded PubSub reconnect bug where
PubSub.conn()could reconnect to a random node when onlySSUBSCRIBEchannels were in use, causing resubscribe to land on the wrong shard and silently stop message delivery.PubSub.conn()now includes sharded channels (c.schannels) when building the channel list passed tonewConnso slot-owner resolution works on reconnect. Adds a regression test that repeatedly kills pubsub connections across all cluster nodes and verifiesSPUBLISHcontinues to reach subscribers after each reconnect.Reviewed by Cursor Bugbot for commit 7271dec. Bugbot is set up for automated code reviews on this repo. Configure here.