Skip to content

Agent-Scheduler:concurrent map read and map write in handleSkipPredicatePlugin function #5146

@JBinin

Description

@JBinin

Description

The test cluster has about 1500 nodes. I created 1,000 Pods and then started the agent-scheduler. After scheduling some of the Pods, a "concurrent map read and map write" error occurred, causing the agent-scheduler to crash.

fatal error: concurrent map read and map write

goroutine 57493 [running]:
internal/runtime/maps.fatal({0x103cb860f?, 0x10227b2cc?})
        /Users/chenjiabin/.g/versions/1.25.3/src/runtime/panic.go:1046 +0x20
k8s.io/apimachinery/pkg/util/sets.Set[...].Has(...)
        /Users/chenjiabin/Mi/dev/volcano/vendor/k8s.io/apimachinery/pkg/util/sets/set.go:78
volcano.sh/volcano/pkg/scheduler/plugins/predicates.handleSkipPredicatePlugin(...)
        /Users/chenjiabin/Mi/dev/volcano/pkg/scheduler/plugins/predicates/predicates.go:782
volcano.sh/volcano/pkg/scheduler/plugins/predicates.(*PredicatesPlugin).Predicate(0x1402aab2480, 0x141399c3260, 0x141865ecd00, 0x140ed3f8fa0)
        /Users/chenjiabin/Mi/dev/volcano/pkg/scheduler/plugins/predicates/predicates.go:642 +0x888
volcano.sh/volcano/pkg/agentscheduler/plugins/predicates.(*predicatesPlugin).OnPluginInit.func2(0x106ef93e8?, 0x1400f5788a0?)
        /Users/chenjiabin/Mi/dev/volcano/pkg/agentscheduler/plugins/predicates/predicates.go:73 +0x94
volcano.sh/volcano/pkg/agentscheduler/framework.(*Framework).PredicateFn(0x1401af49380, 0x141399c3260, 0x141865ecd00)
        /Users/chenjiabin/Mi/dev/volcano/pkg/agentscheduler/framework/plugins.go:94 +0xf8
volcano.sh/volcano/pkg/agentscheduler/actions/allocate.(*Action).predicateForAllocateAction(0x140fcd19c20?, 0x141399c3260, 0x141865ecd00)
        /Users/chenjiabin/Mi/dev/volcano/pkg/agentscheduler/actions/allocate/allocate.go:274 +0x2c
volcano.sh/volcano/pkg/agentscheduler/actions/allocate.(*Action).predicate(0x140003ee4c8, 0x141399c3260, 0x141865ecd00)
        /Users/chenjiabin/Mi/dev/volcano/pkg/agentscheduler/actions/allocate/allocate.go:270 +0x1b0
volcano.sh/volcano/pkg/scheduler/util.(*predicateHelper).PredicateNodes.func1(0x140daa44cf0?)
        /Users/chenjiabin/Mi/dev/volcano/pkg/scheduler/util/predicate_helper.go:110 +0x584
k8s.io/client-go/util/workqueue.ParallelizeUntil.func1()
        /Users/chenjiabin/Mi/dev/volcano/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:90 +0x108
created by k8s.io/client-go/util/workqueue.ParallelizeUntil in goroutine 294
        /Users/chenjiabin/Mi/dev/volcano/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:76 +0x1b4

Steps to reproduce the issue

  1. We used OpenKruise(https://openkruise.io/kruiseagents/introduction) to generate the workload, creating 1,000 Pods.
apiVersion: agents.kruise.io/v1alpha1
kind: SandboxSet
metadata:
  name: bench
  namespace: default
spec:
  replicas: 1000
  persistentContents:
    - ip
  template:
    metadata:
      annotations:
        foo: bar
    # Final Pod Spec
    spec:
      schedulerName: agent-scheduler
      containers:
        - name: nginx
          image: nginx:alpine
      volumes:
      - name: agent-runtime-volume
        emptyDir: {}
  1. Start the agent-scheduler with the following configuration:
--kubeconfig=/Users/chenjiabin/.kube/config
--logtostderr
--scheduler-conf=agent-scheduler.conf
-v=1
--scheduler-name=agent-scheduler
--enable-healthz=true
--enable-metrics=true
--leader-elect=false
--kube-api-qps=2000
--kube-api-burst=2000
--node-worker-threads=20
--scheduler-worker-count=4
  1. The agent-scheduler then panics after scheduling some of the Pods.

Describe the results you received and expected

not panic

What version of Volcano are you using?

master

Any other relevant information

  • When I set scheduler-worker-count to 1, I ran the test several times and didn't encounter this issues.

No response

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.priority/high

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions