Skip to main content

Optimizing Workloads with Kubernetes Descheduler

Bibin Wilson

The default Kubernetes scheduler is quite complex and handles many advanced scheduling decisions.

However, once it assigns a pod to a node, it doesn't re-evaluate the pod's placement if a better node becomes available later.

For example, if you add a NodeAffinity soft requirement with weights to prefer a specific type of GPU node, but no GPU nodes are available, the pod might be scheduled on a regular node. If a GPU node becomes available later, the default scheduler won’t automatically move the pod to that node.

This behavior is not necessarily a bad thing. The design is likely intended for performance, stability, and stateful application considerations. The scheduler makes its best decision based on the current state of the cluster.

This behavior can also lead to some challenges, particularly in large clusters with many workloads.

  • For instance, pods may remain on less optimal nodes even when better options become available (suboptimal pod placements)
  • If new constraints or policies are introduced, existing pods might not automatically adjust to meet these updated requirements.
  • Additionally, as new nodes or zones are added, they may remain underutilized, resulting in existing nodes or zones carrying a uneven load.

Kubernetes Descheduler

The issues mentioned earlier can be addressed by the Kubernetes Descheduler.

The Descheduler evaluates the current state of the cluster against the desired policies and constraints.

When it detects violations or suboptimal pod placements, it takes corrective action by evicting the pods. Once a pod is evicted, the default scheduler is able to make a fresh, more informed decision on where to place the pod, now considering the current state of the cluster and the availability of optimal nodes.

Github Repo: https://github.com/kubernetes-sigs/descheduler

Real-World Example of the Descheduler

Now, let's look at a real-world example of a deployment using pod topology constraints, the default scheduling behavior, pod imbalance, and how the Descheduler can help in effectively balancing the pods.

Topology Spread Constraints

Lets say you have multiple zones (e.g., Zone A, Zone B, and Zone C), each containing a number of nodes. Topology spread constraints are configured to ensure pods are evenly distributed across these zones.

These constraints define a max skew, which is the difference in the number of pods between any two zones. For instance, with a max skew of 1, the difference between the number of pods in any two zones cannot exceed 1.

The Kubernetes scheduler places pods sequentially, balancing them across the zones. For example, the first pod goes to Zone A, the second to Zone B, and the third to Zone C.

After scheduling six pods, each zone would have two pods, maintaining an even load distribution.

What Happens When a New Zone or Node is Added?

Now, let’s say a new zone (Zone D) is added, but no new pods are scheduled there yet. This means Zone D has no pods, while Zones A, B, and C are full.

As a result, an imbalance occurs—some zones are fully occupied (overpopulated), while the new zone remains empty.

The Kubernetes scheduler doesn't automatically redistribute pods when new zones or nodes are introduced. It simply schedules new pods in the available zones.

How the Descheduler Helps

This is where the Descheduler comes into play.

It monitors the cluster and detects when pod distribution violates topology constraints, such as the imbalance caused by the new zone. The Descheduler evicts pods from the overpopulated zones (A, B, or C) and returns them to the Kubernetes scheduler.

Rebalancing the Pods

Once a pod is evicted, the Kubernetes scheduler reschedules it in a zone that satisfies the topology constraint.

In this case, it would likely place the evicted pod in Zone D to restore balance. This process continues until the topology spread constraints are met, ensuring an even distribution of pods across all zones.

Here is an example DeschedulerPolicy CRD for rebalancing pods that violate Topology Spread Constraints across a Kubernetes cluster

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "RemovePodsViolatingTopologySpreadConstraint"
      args:
        constraints:
          - DoNotSchedule
          - ScheduleAnyway
    plugins:
      balance:
        enabled:
          - "RemovePodsViolatingTopologySpreadConstraint"

Here is another example policy to remove pods that have exceeded a certain threshold of restarts, specifically targeting pods in a CrashLoopBackOff state.

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "RemovePodsHavingTooManyRestarts"
      args:
        podRestartThreshold: 100
        includingInitContainers: true
        states:
        - CrashLoopBackOff
    plugins:
      deschedule:
        enabled:
          - "RemovePodsHavingTooManyRestarts"

You can find more examples here

Service Continuity During Pod Evictions

The question you might have is: "Evicting pods could cause service interruptions. Isn’t that a bad thing?"

the descheduler provide several mechanisms to mitigate service disruptions while still optimizing workloads and resource distribution.

For example,

  1. The descheduler evicts pods gradually, ensuring that services remain available by avoiding mass eviction at once.
  2. By using PDBs, you can guarantee that a minimum number of pods remain available during voluntary disruptions like descheduling, preventing downtime.
  3. The descheduler supports configurable thresholds, allowing limits on how many pods can be evicted at a time and ensuring a balance between performance optimization and stability.
  4. With multiple scheduling profiles, different rules and priorities can be applied, ensuring that only the right pods are evicted while maintaining critical services.

References

  1. Descheduler: Your pods in the right place, all the time (Detailed video with examples)
Bibin Wilson