Skip to main content

MostAllocated k8s Scheduling Strategy Saves Millions of Dollars For ClickHouse

โ€” Bibin Wilson

ClickHouse is an open-source columnar database designed for online analytical processing (OLAP).

ClickHouse Cloud (serverless version of ClickHouse) is a fully managed cloud service. It runs ClickHouse clusters on managed Kubernetes clusters (EKS , GKE and AKS) faced rapidly rising infrastructure costs due to underutilized worker nodes.

Like many companies using Kubernetes, they initially used the default Kubernetes scheduler policy, which spreads pods across nodes using a LeastAllocated approach.

This method resulted in inefficient resource utilization, where many nodes had low CPU and memory usage, yet costs remained high because EC2 instances are billed by the hour.

You can read about scheduling policies here.

Inefficient Resource Usage (Key Problem)

By analyzing EKS node utilization, ClickHouse engineers discovered that the LeastAllocated scheduler was spreading pods too sparsely across nodes. This led to a higher number of EC2 nodes being needed than actually necessary, ultimately increasing infrastructure costs.

Since each node had underutilized CPU and memory, this inefficient use of resources meant paying for idle capacity.

Bin-Packing with MostAllocated Scheduling (Solution)

To address this inefficiency, the team switched to a MostAllocated scheduling policy. Instead of spreading pods across nodes, this approach packs pods more tightly onto fewer nodes.

This bin-packing method increases overall node utilization, allowing underutilized nodes to be reclaimed by the Kubernetes cluster autoscaler, reducing the total number of EC2 instances required.

However, since AWS EKS does not support customization of the default scheduler, ClickHouse had to implement a custom scheduler with the MostAllocated policy. This involved setting up the custom scheduler to prioritize nodes with higher resource utilization, ensuring efficient bin-packing of pods.

Here is how they implemented HA for the custom scheduler.

  1. Deployed three scheduler pods to ensure redundancy.
  2. Only one pod schedules actively; others are on standby, ready to take over (uses leader election)

Dual-scheduler approach

Clickhouse used both default system schdeuler and the custom scheduler (Dual-scheduler )

The Default Scheduler handles system-level components (like CoreDNS, ArgoCD), keeping the clusterโ€™s core services stable and unaffected by optimization strategies.

Custom Scheduler is used for ClickHouse pods, focusing on bin-packing to increase resource utilization and reduce EC2 costs.

This approach allows ClickHouse to balance between cost efficiency and system reliability.

Rolling Out the Custom Scheduler

ClickHouse had to roll out a custom scheduler for the existing cluster.

They started by applying the custom scheduler to a few smaller clusters first, testing the impact and ensuring no disruption.

After gaining confidence, they gradually rolled out the custom scheduler to larger clusters. To minimize disruptions, they used a PodDisruptionBudget, ensuring that only a limited number of pods were rescheduled at a time, avoiding interruptions to running services.

ClickHouse did not restart all pods immediately. Instead, pods were rescheduled onto nodes with the new custom scheduler over time as part of their normal lifecycle, such as during restarts or upgrades.

After the rollout, they monitored the clusters using tools like EKS Node Viewer and internal dashboards to assess utilization improvements and measure cost savings.

Results: Increased Efficiency and Cost Savings

The implementation of this custom scheduler resulted in a significant increase in resource utilization, improving node utilization by 20-30%.

This bin-packing allowed for a 10% reduction in the number of EC2 nodes, especially high-cost nodes. Overall, this change led to a 20% reduction in EC2 costs, saving millions of dollars annually.

Key Takeaways for DevOps Engineers

  1. Understanding how to use bin packing strategies like MostAllocated to optimize resource usage and reduce infrastructure costs.
  2. How to deploy and manage a custom scheduler in Kubernetes for specific workload optimization.
  3. Limiting disruptions during pod rescheduling to ensure minimal service interruptions using PodDisruptionBudget
  4. The importance of phased rollouts and monitoring to manage changes in production environments.

Sources

  1. Bin-Packing Pods in Managed Kubernetes in AWS, GCP and Azure
  2. Saving Millions of Dollars by Bin-Packing ClickHouse Pods in AWS EKS