Skip to content

fix: include sharded channels in PubSub.conn() so cluster reconnect routes to the slot owner#3807

Open
igorkofman wants to merge 1 commit intoredis:masterfrom
igorkofman:fix/cluster-ssubscribe-reconnect-slot-routing
Open

fix: include sharded channels in PubSub.conn() so cluster reconnect routes to the slot owner#3807
igorkofman wants to merge 1 commit intoredis:masterfrom
igorkofman:fix/cluster-ssubscribe-reconnect-slot-routing

Conversation

@igorkofman
Copy link
Copy Markdown

@igorkofman igorkofman commented May 7, 2026

Fixes #3806.

Problem

PubSub.conn() builds the channel list passed to newConn from
c.channels only. For a PubSub with only sharded subscriptions (the
normal ClusterClient.SSubscribe case), that list is empty on reconnect.
ClusterClient.pubSub's newConn closure falls back to nodes.Random()
when len(channels) == 0, so the reconnect lands on a random node. The
SSUBSCRIBE written by resubscribe() to the wrong node fails silently
(_subscribe is write-only and isBadConn ignores MOVED to a different
address), leaving a PubSub that looks healthy — Ping succeeds,
Receive keeps running — but never receives another message until the
process restarts.

Fix

Append c.schannels to the list, after c.channels. The channels
argument is only used for slot resolution in the cluster client; the
single-node, ring, and sentinel clients ignore it for routing, so the
change is scoped to cluster behavior. Ordering schannels after channels
keeps behavior identical for PubSubs that have any regular
subscriptions.

Confidence

  • Added a regression test in osscluster_test.go that SSubscribes,
    kills the PubSub conn on every cluster node (CLIENT KILL TYPE pubsub),
    triggers reconnect, and asserts message delivery — repeated 8 times so a
    lucky random-node hit can't mask the bug. Without the fix the chance of
    a false pass on the 3+3 node test cluster is (1/6)^8 ≈ 6e-7.
  • Reproduced the bug in production on a 25-shard cluster behind a managed
    Redis Cluster; the PubSub silently dropped to ~10% of expected message
    delivery after a transient connection blip and stayed degraded until a
    process restart. Confirmed the fix restores delivery after CLIENT KILL.
  • go build, go vet, gofmt clean on the patched files.

Note

Medium Risk
Touches cluster PubSub reconnect/routing behavior; a bug here can silently drop messages after reconnects, though the change is small and covered by a targeted regression test.

Overview
Fixes a Redis Cluster sharded PubSub reconnect bug where PubSub.conn() could reconnect to a random node when only SSUBSCRIBE channels were in use, causing resubscribe to land on the wrong shard and silently stop message delivery.

PubSub.conn() now includes sharded channels (c.schannels) when building the channel list passed to newConn so slot-owner resolution works on reconnect. Adds a regression test that repeatedly kills pubsub connections across all cluster nodes and verifies SPUBLISH continues to reach subscribers after each reconnect.

Reviewed by Cursor Bugbot for commit 7271dec. Bugbot is set up for automated code reviews on this repo. Configure here.

…routing

PubSub.conn() builds the channel list passed to newConn from c.channels
only. For a PubSub with only sharded subscriptions (the normal
ClusterClient.SSubscribe case), that list is empty on reconnect, so
ClusterClient.pubSub falls back to nodes.Random(). The SSUBSCRIBE that
resubscribe() then writes to the wrong node fails silently — _subscribe
is write-only and isBadConn ignores MOVED to a different address — so the
PubSub looks healthy but never receives another message.

Append c.schannels to the list, after c.channels. The argument is only
used for slot resolution in the cluster client; single-node, ring, and
sentinel clients ignore it for routing.

Adds a regression test that SSubscribes, kills the PubSub conn on every
node (CLIENT KILL TYPE pubsub), triggers reconnect, and asserts delivery,
repeated 8 times so a lucky random-node hit can't mask the bug.
@jit-ci
Copy link
Copy Markdown

jit-ci Bot commented May 7, 2026

Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset.

In case there are security findings, they will be communicated to you as a comment inside the PR.

Hope you’ll enjoy using Jit.

Questions? Comments? Want to learn more? Get in touch with us.

@igorkofman igorkofman marked this pull request as ready for review May 7, 2026 13:29
@ndyakov
Copy link
Copy Markdown
Member

ndyakov commented May 8, 2026

Hello @igorkofman , there are couple of initiatives around pub sub at the moment. I do think the biggest one right now is #3717, maybe check it and align the changes to it if needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ClusterClient.SSubscribe silently re-subscribes to a random node after reconnect — PubSub.conn() ignores c.schannels

2 participants