Skip to main content

Gang Scheduling in Kubernetes

Bibin Wilson

First, let's understand the concept of gang scheduling.

As per wikipedia

In computer science, gang scheduling is a scheduling algorithm for parallel systems that schedules related threads or processes to run simultaneously on different processors. 

Gang scheduling is a scheduling technique used in distributed computing systems, particularly in high-performance computing environments. It involves scheduling a group (or "gang") of related tasks or processes to run simultaneously on multiple processors or nodes.

The core idea is that all tasks within the gang must start and execute together.

Real World Example

Now let's look at a real-world example where gang scheduling is used.

One classic example is in deep learning workloads. Deep learning frameworks (Tensorflow, PyTorch etc) require all the workers to be running during the training process.

In this scenario, when you deploy training workloads, all the components should be scheduled and deployed to ensure the training works as expected.

Otherwise it might lead to resource segmentation and deadlocks. Lets understand what it means.

Resource Segmentation

Resource segmentation occurs when resources are fragmented across multiple tasks or jobs that cannot proceed because they lack the complete set of resources they need.

For example, due to limited resources, Kubernetes schedules only some of the Pods:

  1. 2 Worker Pods are running.
  2. 1 Parameter Server Pod is running.
  3. Remaining Pods are pending.

The running Pods are consuming CPU and memory resources. However, they cannot proceed with training because they need to communicate with all other Pods.

Deadlocks

Deadlocks happen when two or more tasks are waiting indefinitely for resources held by each other, creating a cycle of dependency that halts progress.

For example,

  1. Running Worker Pods are waiting for all Parameter Server Pods to be active to send their computed gradients.
  2. The Running Parameter Server Pod is waiting for data from all Worker Pods to update the model parameters.

Workers can't proceed without all Parameter Servers. Parameter Servers can't proceed without all Workers.

Since the remaining Pods are pending due to insufficient resources, this cycle causes both sides to wait indefinitely.

Gang scheduling

Gang scheduling avoids resource segmentation and deadlocks

Gang scheduling is like making sure that all tasks in a group that depend on each other start running together at the same time, or they don't start at all—it's an all-or-nothing approach.

By scheduling all these tasks together, we keep all the resources unified, ensuring that each task has everything it needs to work properly.

This method prevents situations where some tasks start while others are still waiting. Such scenarios can lead to problems like deadlocks, where tasks are stuck waiting for each other indefinitely, or inefficient use of resources because some tasks are running without being able to proceed effectively.

Gang Scheduling in Kubernetes

In Kubernetes, the default scheduler assigns Pods one by one based on available resources and certain scheduling rules. This means that sometimes, in a distributed application, some Pods start running while others are still waiting because there aren't enough resources.

This partial scheduling can be a problem for applications that need all their components to be running at the same time to work properly.

To address this, gang scheduling is implemented in Kubernetes through custom schedulers . These schedulers enhance Kubernetes' native scheduling capabilities by allowing groups of Pods to be scheduled as a single unit.

Volcano scheduler

The custom scheduler that supports gang scheduling is the Volcano scheduler.

It is an open-source Kubernetes scheduler designed for high-performance workloads. Volcano has advanced scheduling features like gang scheduling, queuing, and job priorities, ensuring that interdependent Pods are scheduled together.

Here is an example Volcano PodGroup CRD which is used to implement gang scheduling .

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  generation: 5
  name: test
  namespace: default
  ownerReferences:
  - apiVersion: batch.volcano.sh/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: test
spec:
  minMember: 10
  minResources:
    cpu: "3"
    memory: "2048Mi"
  priorityClassName: high-priority
  queue: default

minMember: 10: Requires that all 10 pods must be scheduled together.

Coscheduling plugin

The Kubernetes SIGs maintain a list of scheduling plugins, among which the Coscheduling plugin supports all-or-nothing gang scheduling.

This plugin needs to be enabled in the cluster schdeuler configuration to utilize the PodGroup CRD that supports all-or-nothing scheduling.

Here is an example manifest that shows the PodGroup associated with the replicaset.

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: nginx
spec:
  scheduleTimeoutSeconds: 10
  minMember: 3
---
apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 6
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      name: nginx
      labels:
        app: nginx
        pod-group.scheduling.sigs.k8s.io: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        resources:
          limits:
            cpu: 3000m
            memory: 500Mi
          requests:
            cpu: 3000m
            memory: 500Mi

In the above manifest, if minMember is set to 3 and the cluster can't schedule a minimum of 3 pods due to resource constraints, all the pods will go to the pending state.

Further Learning

Following are the resources that will help you learn more about Volcano and custom schdling for AI, big data worklaods.

1. Cloud native batch scheduling with Volcano

2. Cloud Native Batch System for AI, BigData and HPC