Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions pkg/scheduler/cache/event_handlers.go
Original file line number Diff line number Diff line change
Expand Up @@ -621,6 +621,11 @@ func (sc *SchedulerCache) SyncNode(nodeName string) error {
}

if !sc.nodeCanAddCache(node) {
if deleteErr := sc.RemoveNode(nodeName); deleteErr != nil {
klog.Errorf("Failed to remove node <%s> from cache after label change: %s", nodeName, deleteErr.Error())
return deleteErr
}
Comment on lines +624 to +627
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RemoveNode() function returns an error if the node doesn't exist in the cache (line 541). This causes SyncNode() to fail when nodeCanAddCache() returns false for a node that was never added to cache. This breaks the existing test case "Node not added to cache" which expects wantErr: false. The RemoveNode() call should only propagate errors that aren't "node does not exist" errors, or RemoveNode() should be modified to not error when the node is not found.

Copilot uses AI. Check for mistakes.
Comment on lines +624 to +627
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description states that "RemoveNode() is safe to call even if the node is not in the cache (it will simply be a no-op)", but the implementation at line 540-541 returns an error if the node doesn't exist in cache. This is a discrepancy between the PR description and the actual code behavior.

Copilot uses AI. Check for mistakes.
klog.V(3).Infof("Node <%s> no longer matches node selector, removed from cache.", nodeName)
Comment on lines +624 to +628
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The call to sc.RemoveNode(nodeName) returns an error if the node is not present in the cache (see lines 540-542). Since SyncNode is triggered for all node updates, and nodeCanAddCache returns false for any node that does not match the configured selector, this implementation will cause SyncNode to return an error for every non-matching node in the cluster. This results in continuous error logging and unnecessary retries in the nodeQueue. You should check if the node exists in the cache before attempting to remove it.

Suggested change
if deleteErr := sc.RemoveNode(nodeName); deleteErr != nil {
klog.Errorf("Failed to remove node <%s> from cache after label change: %s", nodeName, deleteErr.Error())
return deleteErr
}
klog.V(3).Infof("Node <%s> no longer matches node selector, removed from cache.", nodeName)
sc.Mutex.Lock()
_, ok := sc.Nodes[nodeName]
sc.Mutex.Unlock()
if ok {
if deleteErr := sc.RemoveNode(nodeName); deleteErr != nil {
klog.Errorf("Failed to remove node <%s> from cache after label change: %s", nodeName, deleteErr.Error())
return deleteErr
}
klog.V(3).Infof("Node <%s> no longer matches node selector, removed from cache.", nodeName)
}

return nil
}
nodeCopy := node.DeepCopy()
Expand Down
23 changes: 23 additions & 0 deletions pkg/scheduler/cache/event_handlers_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1573,3 +1573,26 @@ func TestSchedulerCache_SyncHyperNode(t *testing.T) {
})
}
}

func TestSchedulerCache_SyncNode_RemoveOnLabelChange(t *testing.T) {
// Simulate: node was in cache with matching label, then label is removed
n1WithLabel := util.BuildNode("n1", nil, map[string]string{"train": "true"})
n1WithoutLabel := util.BuildNode("n1", nil, map[string]string{})

sc := NewDefaultMockSchedulerCache("volcano")
sc.nodeSelectorLabels = map[string]sets.Empty{
"train:true": {},
}

// Step 1: Add node with matching label to cache
sc.nodeInformer.Informer().GetIndexer().Add(n1WithLabel)
err := sc.SyncNode("n1")
assert.NoError(t, err)
assert.Contains(t, sc.Nodes, "n1", "node should be in cache after sync with matching label")

// Step 2: Update node to remove the label, then sync again
sc.nodeInformer.Informer().GetIndexer().Update(n1WithoutLabel)
err = sc.SyncNode("n1")
assert.NoError(t, err)
assert.NotContains(t, sc.Nodes, "n1", "node should be removed from cache after label removal")
}
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new test only covers the case where a node is removed from cache after label removal. It should also test the case where a node never matches the selector in the first place, to ensure no regression in existing behavior. This edge case is currently broken by the fix.

Suggested change
}
}
func TestSchedulerCache_SyncNode_NeverMatchesSelector(t *testing.T) {
// Simulate: node never has the matching label, so it should never be added to cache
n2 := util.BuildNode("n2", nil, map[string]string{})
sc := NewDefaultMockSchedulerCache("volcano")
sc.nodeSelectorLabels = map[string]sets.Empty{
"train:true": {},
}
// Add node without matching label to informer and sync
sc.nodeInformer.Informer().GetIndexer().Add(n2)
err := sc.SyncNode("n2")
assert.NoError(t, err)
assert.NotContains(t, sc.Nodes, "n2", "node without matching label should not be added to cache")
}

Copilot uses AI. Check for mistakes.
Loading