-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Agent-Scheduler:concurrent map read and map write in handleSkipPredicatePlugin function #5146
Copy link
Copy link
Open
Labels
kind/bugCategorizes issue or PR as related to a bug.Categorizes issue or PR as related to a bug.priority/high
Description
Description
The test cluster has about 1500 nodes. I created 1,000 Pods and then started the agent-scheduler. After scheduling some of the Pods, a "concurrent map read and map write" error occurred, causing the agent-scheduler to crash.
fatal error: concurrent map read and map write
goroutine 57493 [running]:
internal/runtime/maps.fatal({0x103cb860f?, 0x10227b2cc?})
/Users/chenjiabin/.g/versions/1.25.3/src/runtime/panic.go:1046 +0x20
k8s.io/apimachinery/pkg/util/sets.Set[...].Has(...)
/Users/chenjiabin/Mi/dev/volcano/vendor/k8s.io/apimachinery/pkg/util/sets/set.go:78
volcano.sh/volcano/pkg/scheduler/plugins/predicates.handleSkipPredicatePlugin(...)
/Users/chenjiabin/Mi/dev/volcano/pkg/scheduler/plugins/predicates/predicates.go:782
volcano.sh/volcano/pkg/scheduler/plugins/predicates.(*PredicatesPlugin).Predicate(0x1402aab2480, 0x141399c3260, 0x141865ecd00, 0x140ed3f8fa0)
/Users/chenjiabin/Mi/dev/volcano/pkg/scheduler/plugins/predicates/predicates.go:642 +0x888
volcano.sh/volcano/pkg/agentscheduler/plugins/predicates.(*predicatesPlugin).OnPluginInit.func2(0x106ef93e8?, 0x1400f5788a0?)
/Users/chenjiabin/Mi/dev/volcano/pkg/agentscheduler/plugins/predicates/predicates.go:73 +0x94
volcano.sh/volcano/pkg/agentscheduler/framework.(*Framework).PredicateFn(0x1401af49380, 0x141399c3260, 0x141865ecd00)
/Users/chenjiabin/Mi/dev/volcano/pkg/agentscheduler/framework/plugins.go:94 +0xf8
volcano.sh/volcano/pkg/agentscheduler/actions/allocate.(*Action).predicateForAllocateAction(0x140fcd19c20?, 0x141399c3260, 0x141865ecd00)
/Users/chenjiabin/Mi/dev/volcano/pkg/agentscheduler/actions/allocate/allocate.go:274 +0x2c
volcano.sh/volcano/pkg/agentscheduler/actions/allocate.(*Action).predicate(0x140003ee4c8, 0x141399c3260, 0x141865ecd00)
/Users/chenjiabin/Mi/dev/volcano/pkg/agentscheduler/actions/allocate/allocate.go:270 +0x1b0
volcano.sh/volcano/pkg/scheduler/util.(*predicateHelper).PredicateNodes.func1(0x140daa44cf0?)
/Users/chenjiabin/Mi/dev/volcano/pkg/scheduler/util/predicate_helper.go:110 +0x584
k8s.io/client-go/util/workqueue.ParallelizeUntil.func1()
/Users/chenjiabin/Mi/dev/volcano/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:90 +0x108
created by k8s.io/client-go/util/workqueue.ParallelizeUntil in goroutine 294
/Users/chenjiabin/Mi/dev/volcano/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:76 +0x1b4
Steps to reproduce the issue
- We used OpenKruise(https://openkruise.io/kruiseagents/introduction) to generate the workload, creating 1,000 Pods.
apiVersion: agents.kruise.io/v1alpha1
kind: SandboxSet
metadata:
name: bench
namespace: default
spec:
replicas: 1000
persistentContents:
- ip
template:
metadata:
annotations:
foo: bar
# Final Pod Spec
spec:
schedulerName: agent-scheduler
containers:
- name: nginx
image: nginx:alpine
volumes:
- name: agent-runtime-volume
emptyDir: {}
- Start the agent-scheduler with the following configuration:
--kubeconfig=/Users/chenjiabin/.kube/config
--logtostderr
--scheduler-conf=agent-scheduler.conf
-v=1
--scheduler-name=agent-scheduler
--enable-healthz=true
--enable-metrics=true
--leader-elect=false
--kube-api-qps=2000
--kube-api-burst=2000
--node-worker-threads=20
--scheduler-worker-count=4
- The agent-scheduler then panics after scheduling some of the Pods.
Describe the results you received and expected
not panic
What version of Volcano are you using?
master
Any other relevant information
- When I set scheduler-worker-count to 1, I ran the test several times and didn't encounter this issues.
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
kind/bugCategorizes issue or PR as related to a bug.Categorizes issue or PR as related to a bug.priority/high