Gang Scheduling in Kubernetes
First, let's understand the concept of gang scheduling.
As per wikipedia
In computer science, gang scheduling is a scheduling algorithm for parallel systems that schedules related threads or processes to run simultaneously on different processors.
Gang scheduling is a scheduling technique used in distributed computing systems, particularly in high-performance computing environments. It involves scheduling a group (or "gang") of related tasks or processes to run simultaneously on multiple processors or nodes.
The core idea is that all tasks within the gang must start and execute together.
Real World Example
Now let's look at a real-world example where gang scheduling is used.
One classic example is in deep learning workloads. Deep learning frameworks (Tensorflow, PyTorch etc) require all the workers to be running during the training process.
In this scenario, when you deploy training workloads, all the components should be scheduled and deployed to ensure the training works as expected.
Otherwise it might lead to resource segmentation and deadlocks. Lets understand what it means.
Resource Segmentation
Resource segmentation occurs when resources are fragmented across multiple tasks or jobs that cannot proceed because they lack the complete set of resources they need.
For example, due to limited resources, Kubernetes schedules only some of the Pods:
- 2 Worker Pods are running.
- 1 Parameter Server Pod is running.
- Remaining Pods are pending.
The running Pods are consuming CPU and memory resources. However, they cannot proceed with training because they need to communicate with all other Pods.
Deadlocks
Deadlocks happen when two or more tasks are waiting indefinitely for resources held by each other, creating a cycle of dependency that halts progress.
For example,
- Running Worker Pods are waiting for all Parameter Server Pods to be active to send their computed gradients.
- The Running Parameter Server Pod is waiting for data from all Worker Pods to update the model parameters.
Workers can't proceed without all Parameter Servers. Parameter Servers can't proceed without all Workers.
Since the remaining Pods are pending due to insufficient resources, this cycle causes both sides to wait indefinitely.
Gang scheduling
Gang scheduling avoids resource segmentation and deadlocks
Gang scheduling is like making sure that all tasks in a group that depend on each other start running together at the same time, or they don't start at all—it's an all-or-nothing approach.
By scheduling all these tasks together, we keep all the resources unified, ensuring that each task has everything it needs to work properly.
This method prevents situations where some tasks start while others are still waiting. Such scenarios can lead to problems like deadlocks, where tasks are stuck waiting for each other indefinitely, or inefficient use of resources because some tasks are running without being able to proceed effectively.
Gang Scheduling in Kubernetes
In Kubernetes, the default scheduler assigns Pods one by one based on available resources and certain scheduling rules. This means that sometimes, in a distributed application, some Pods start running while others are still waiting because there aren't enough resources.
This partial scheduling can be a problem for applications that need all their components to be running at the same time to work properly.
To address this, gang scheduling is implemented in Kubernetes through custom schedulers . These schedulers enhance Kubernetes' native scheduling capabilities by allowing groups of Pods to be scheduled as a single unit.
Volcano scheduler
The custom scheduler that supports gang scheduling is the Volcano scheduler.
It is an open-source Kubernetes scheduler designed for high-performance workloads. Volcano has advanced scheduling features like gang scheduling, queuing, and job priorities, ensuring that interdependent Pods are scheduled together.
Here is an example Volcano PodGroup
CRD which is used to implement gang scheduling .
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
generation: 5
name: test
namespace: default
ownerReferences:
- apiVersion: batch.volcano.sh/v1alpha1
blockOwnerDeletion: true
controller: true
kind: Job
name: test
spec:
minMember: 10
minResources:
cpu: "3"
memory: "2048Mi"
priorityClassName: high-priority
queue: default
minMember: 10
: Requires that all 10 pods must be scheduled together.
Coscheduling plugin
The Kubernetes SIGs maintain a list of scheduling plugins, among which the Coscheduling plugin supports all-or-nothing gang scheduling.
This plugin needs to be enabled in the cluster schdeuler configuration to utilize the PodGroup
CRD that supports all-or-nothing scheduling.
Here is an example manifest that shows the PodGroup associated with the replicaset.
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
name: nginx
spec:
scheduleTimeoutSeconds: 10
minMember: 3
---
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: nginx
labels:
app: nginx
spec:
replicas: 6
selector:
matchLabels:
app: nginx
template:
metadata:
name: nginx
labels:
app: nginx
pod-group.scheduling.sigs.k8s.io: nginx
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
cpu: 3000m
memory: 500Mi
requests:
cpu: 3000m
memory: 500Mi
In the above manifest, if minMember is set to 3 and the cluster can't schedule a minimum of 3 pods due to resource constraints, all the pods will go to the pending state.
Further Learning
Following are the resources that will help you learn more about Volcano and custom schdling for AI, big data worklaods.