Description
The test cluster has about 1500 nodes. I created 1,000 Pods and then started the agent-scheduler. After scheduling some of the Pods, a "concurrent map read and map write" error occurred, causing the agent-scheduler to crash.
fatal error: concurrent map read and map write
goroutine 57493 [running]:
internal/runtime/maps.fatal({0x103cb860f?, 0x10227b2cc?})
/Users/chenjiabin/.g/versions/1.25.3/src/runtime/panic.go:1046 +0x20
k8s.io/apimachinery/pkg/util/sets.Set[...].Has(...)
/Users/chenjiabin/Mi/dev/volcano/vendor/k8s.io/apimachinery/pkg/util/sets/set.go:78
volcano.sh/volcano/pkg/scheduler/plugins/predicates.handleSkipPredicatePlugin(...)
/Users/chenjiabin/Mi/dev/volcano/pkg/scheduler/plugins/predicates/predicates.go:782
volcano.sh/volcano/pkg/scheduler/plugins/predicates.(*PredicatesPlugin).Predicate(0x1402aab2480, 0x141399c3260, 0x141865ecd00, 0x140ed3f8fa0)
/Users/chenjiabin/Mi/dev/volcano/pkg/scheduler/plugins/predicates/predicates.go:642 +0x888
volcano.sh/volcano/pkg/agentscheduler/plugins/predicates.(*predicatesPlugin).OnPluginInit.func2(0x106ef93e8?, 0x1400f5788a0?)
/Users/chenjiabin/Mi/dev/volcano/pkg/agentscheduler/plugins/predicates/predicates.go:73 +0x94
volcano.sh/volcano/pkg/agentscheduler/framework.(*Framework).PredicateFn(0x1401af49380, 0x141399c3260, 0x141865ecd00)
/Users/chenjiabin/Mi/dev/volcano/pkg/agentscheduler/framework/plugins.go:94 +0xf8
volcano.sh/volcano/pkg/agentscheduler/actions/allocate.(*Action).predicateForAllocateAction(0x140fcd19c20?, 0x141399c3260, 0x141865ecd00)
/Users/chenjiabin/Mi/dev/volcano/pkg/agentscheduler/actions/allocate/allocate.go:274 +0x2c
volcano.sh/volcano/pkg/agentscheduler/actions/allocate.(*Action).predicate(0x140003ee4c8, 0x141399c3260, 0x141865ecd00)
/Users/chenjiabin/Mi/dev/volcano/pkg/agentscheduler/actions/allocate/allocate.go:270 +0x1b0
volcano.sh/volcano/pkg/scheduler/util.(*predicateHelper).PredicateNodes.func1(0x140daa44cf0?)
/Users/chenjiabin/Mi/dev/volcano/pkg/scheduler/util/predicate_helper.go:110 +0x584
k8s.io/client-go/util/workqueue.ParallelizeUntil.func1()
/Users/chenjiabin/Mi/dev/volcano/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:90 +0x108
created by k8s.io/client-go/util/workqueue.ParallelizeUntil in goroutine 294
/Users/chenjiabin/Mi/dev/volcano/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:76 +0x1b4
Steps to reproduce the issue
- We used OpenKruise(https://openkruise.io/kruiseagents/introduction) to generate the workload, creating 1,000 Pods.
apiVersion: agents.kruise.io/v1alpha1
kind: SandboxSet
metadata:
name: bench
namespace: default
spec:
replicas: 1000
persistentContents:
- ip
template:
metadata:
annotations:
foo: bar
# Final Pod Spec
spec:
schedulerName: agent-scheduler
containers:
- name: nginx
image: nginx:alpine
volumes:
- name: agent-runtime-volume
emptyDir: {}
- Start the agent-scheduler with the following configuration:
--kubeconfig=/Users/chenjiabin/.kube/config
--logtostderr
--scheduler-conf=agent-scheduler.conf
-v=1
--scheduler-name=agent-scheduler
--enable-healthz=true
--enable-metrics=true
--leader-elect=false
--kube-api-qps=2000
--kube-api-burst=2000
--node-worker-threads=20
--scheduler-worker-count=4
- The agent-scheduler then panics after scheduling some of the Pods.
Describe the results you received and expected
not panic
What version of Volcano are you using?
master
Any other relevant information
- When I set scheduler-worker-count to 1, I ran the test several times and didn't encounter this issues.
No response
Description
The test cluster has about 1500 nodes. I created 1,000 Pods and then started the agent-scheduler. After scheduling some of the Pods, a "concurrent map read and map write" error occurred, causing the agent-scheduler to crash.
Steps to reproduce the issue
Describe the results you received and expected
not panic
What version of Volcano are you using?
master
Any other relevant information
No response