When VerticalPodAutoscaler Goes Rogue: How an Autoscaler Took Down Our Cluster

11 min read4 days ago

An investigation story of a strange cluster behavior until discovering the important missing configuration that can make our platform more stable

What started as a minor alert about missing metrics quickly escalated into a relentless wave of pod evictions, crippling our cluster’s observability and threatening its stability.

This blog post walks you through the night of February 10th, when Fabián Sellés Rosa, Thibault JAMET and I pulled an all-nighter, chasing down a hidden misconfiguration in our Vertical Pod Autoscaler (VPA) — one that had been lurking unnoticed for months. What followed was a deep dive into VPA’s internals, hours of debugging, and ultimately, the discovery of a single missing configuration that finally saved the day.

Background: VPA and Why it matters

At Adevinta, we run SCHIP, our internal Kubernetes-based platform that provides a multi-tenant runtime comprising 30+ Kubernetes clusters in 4 regions, 90k pods, and serving more than 300k requests per second at peak. One of the key components we rely on to keep our fleet solid and autonomous is the Vertical Pod Autoscaler (VPA), which helps dynamically adjust resource requests for workloads.

VPA is essential in maintaining a large, diverse fleet of clusters, each varying in size and workload intensity. Without it, we would have to manually set and adjust resource requests for thousands of workloads, leading to inefficiencies, wasted resources, and potential instability. Autoscaling ensures that resources are allocated efficiently, but as we learned, it can take down critical parts of the system when it goes wrong.

The Incident: When VPA Started Evicting Everything

10:30 AM — Metrics Disappear, Pager Goes Off

Our first indication that something was wrong came from our Prometheus, which started to lose some metrics. This quickly triggered a page.

We checked Prometheus pods and found that they were evicted — by VPA. While it’s normal for VPA to kick in when resource usage spikes, something seemed off.

Evictions weren’t stopping.

We suspected an issue with Prometheus itself, as it’s common for the VPA to trigger a restart while Prometheus is still loading metrics from its volume. In such cases, an existing Prometheus instance can be OOMKilled before the newly started one has a chance to stabilize. To mitigate this, we attempted to recover it by deleting its Persistent Volume Claims (PVCs), but this did not resolve the issue.

11:00 AM — More Evictions Appear

Looking at cluster-wide events, we noticed that evictions weren’t limited to Prometheus. A significant number of system pods were also being evicted repeatedly.

37s Normal EvictedByVPA pod/xxx-b45cd8b9d-bbcrn Pod was evicted by VPA Updater to apply resource recommendation.
35s Normal EvictedByVPA pod/xxxxx Pod was evicted by VPA Updater to apply resource recommendation.
36s Normal EvictedByVPA pod/xxxxxxx Pod was evicted by VPA Updater to apply resource recommendation.
36s Normal EvictedByVPA pod/xxxxx Pod was evicted by VPA Updater to apply resource recommendation.
34s Normal EvictedByVPA pod/xxxxx Pod was evicted by VPA Updater to apply resource recommendation.
35s Normal EvictedByVPA pod/xxxxx Pod was evicted by VPA Updater to apply resource recommendation.
34s Normal EvictedByVPA pod/xxxxx Pod was evicted by VPA Updater to apply resource recommendation.
33s Normal EvictedByVPA pod/prometheus-cluster-metrics-prometheus-0 Pod was evicted by VPA Updater to apply resource recommendation.
34s Normal EvictedByVPA pod/xxxxxxx Pod was evicted by VPA Updater to apply resource recommendation.
32s Normal EvictedByVPA pod/prometheus-xxxx-dev-0 Pod was evicted by VPA Updater to apply resource recommendation.
33s Normal EvictedByVPA pod/prometheus-xxxxx-pro-0 Pod was evicted by VPA Updater to apply resource recommendation.
32s Normal EvictedByVPA pod/prometheusxxxxx-pro-0 Pod was evicted by VPA Updater to apply resource recommendation.
32s Normal EvictedByVPA pod/prometheus-xxxx-pro-0 Pod was evicted by VPA Updater to apply resource recommendation.

Kubernetes eviction events spiked significantly

Since this turned out to be a more global problem we revised the recent changes deployed, we were chasing ghosts and reverting some PRs that might have contributed to VPA instability but nothing helped.

It also makes sense that when a large number of evictions happen, it leads to pod churn, which increases high cardinality in Prometheus metrics. This could explain why Prometheus was being OOMKilled — the constant churn of pods was creating excessive new time-series data, pushing its memory usage over the edge.

This means losing metrics wasn’t the root cause of the problem — it was a consequence of the ongoing eviction loop.

We turned our attention back to VPA.

For context, VPA consists of three components:

Recommender — Suggests CPU/memory requests based on observed usage.
Admission Controller — Injects recommended resource requests into pods when they start.
Updater — Evicts pods that don’t match recommendations so they can be recreated with new resource values.

For the most part, this works well. Until it doesn’t.

11:30 AM — Debugging VPA

At this point, we decided to dive into VPA-specific data:

The state of the VPA recommender and its logs
VPA configuration & resource updates
VerticalPodAutoscalerCheckpoint which is an object that stores data points about resource usage.


apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscalerCheckpoint
metadata:
  name: nginx
spec:
  containerName: nginx-ingress-controller
  vpaObjectName: ingress
status:
  cpuHistogram:
    bucketWeights:
      "0": 39
      "2": 1
      "3": 104
      "4": 339
      "5": 2342
      "6": 3338
      "7": 10000
      "8": 8512
      "9": 2405
      "10": 470
      "11": 260
      "12": 168
      "13": 32
      "14": 59
      "15": 14
      "16": 20
      "17": 13
      "18": 2
    referenceTimestamp: "2025-02-12T00:00:00Z"
    totalWeight: 1007.6418972627253
  memoryHistogram:
    bucketWeights:
      "22": 6317
      "23": 8
      "24": 1307
      "29": 1616
      "30": 26
      "31": 10000
      "34": 2291
      "35": 1
      "36": 4660
      "37": 1
    totalWeight: 7.305844833039835
  totalSamplesCount: 1256603

We thought that the recommender was having a hard time which might come from the collected VerticalPodAutoscalerCheckpoint being corrupted (wild guess). So, we tried to delete several VerticalPodAutoscalerCheckpoint objects and restarted the recommender several times, but the Eviction never stopped.

3:00 PM — The Impact and First Mitigation

By this point, the issue showed no signs of stopping. The evictions continued, primarily affecting pods with associated VPA objects — mostly critical system pods like Prometheus, log routers, ingress controllers, and certificate managers. As a result, many of our service SLOs began to degrade, impacting the stability of our platform.

Running out of immediate solutions, we decided on two mitigation steps:

Scaling down the VPA Recommender — We hoped that without new recommendations, VPA would stop triggering evictions.
Manually increasing Prometheus resource requests — This was a defensive move to prevent Prometheus from being OOMKilled, ensuring we wouldn’t lose vital observability metrics.

8:00 PM — Webhook Failure Alert

While we were optimistic that scaling down the VPA Recommender would bring relief, it quickly became clear that it didn’t work. As we took a moment to reassess, another pager alert arrived — this time indicating webhook failures.

Since we run multiple admission webhooks and mutating webhooks inside the cluster, we had an alert configured to trigger when failure rates exceeded a critical threshold.

This new alert made it clear that:

Scaling down the recommender had no effect — evictions were still happening.
The VPA Admission Controller was struggling — indicating a deeper underlying issue.

At this stage, it was evident that we needed to shift our focus. Instead of trying to stop recommendations, we had to understand why the VPA Admission Controller was failing and whether that was the root cause of the eviction loop.

8:30 PM — Debugging the Admission Webhook

At this point, we knew that something deeper was wrong. The evictions weren’t stopping, and we needed to fully understand what was happening under the hood. Since the VPA Updater is responsible for evicting pods, we decided to take a step further.

We pulled the source code of the exact VPA updater version running inside our cluster, ran a local debugger with read-only cluster permissions to prevent unintentional issues, and set breakpoints to track how the eviction logic worked.

Through this debugging session, we discovered a crucial detail:

Pods were missing two key annotations: vpaObservedContainers and vpaUpdates.
These annotations tell the Updater that a pod has been processed by the Admission Controller.
Since they were missing, the Updater assumed the pods were outdated and kept evicting them repeatedly.

This discovery pointed us toward the right question — who is responsible for adding these annotations?

The Answer: The VPA Admission Controller

That’s when it clicked: the VPA Admission Controller is responsible for injecting these annotations into new or updated pods.

At the same time, this perfectly aligned with the webhook failure alerts we had been receiving.

Uncovering the Webhook Bottleneck

To confirm our theory, we checked the webhook latency metrics, and the results were staggering:

The p99 webhook latency exceeded 20 seconds, meaning most requests were dangerously close to Kubernetes’ timeout limit

We tried to mitigate this by trying to increase this timeout to more than 1 minute but we came to learn that

Kubernetes has a hard limit of 30 seconds for the admission webhook configuration.

Continue checking the logs of the Admission Controller, we found multiple entries showing client-side throttling lasting more than 20 seconds — a perfect explanation for the latency.

A Dead End: FlowSchemas Didn’t Help

Our first instinct was to increase the API server’s capacity by tweaking FlowSchemas, thinking that the Admission Controller was making too many calls to the API server.

We applied a new FlowSchema object, but it made no difference.

apiVersion: flowcontrol.apiserver.k8s.io/v1
kind: FlowSchema
metadata:
  annotations:
    apf.kubernetes.io/autoupdate-spec: "false"
  name: temp-vpa-fix
spec:
  distinguisherMethod:
    type: ByUser
  matchingPrecedence: 100
  priorityLevelConfiguration:
    name: exempt
  rules:
  - nonResourceRules:
    - nonResourceURLs:
      - '*'
      verbs:
      - '*'
    resourceRules:
    - apiGroups:
      - '*'
      clusterScope: true
      namespaces:
      - '*'
      resources:
      - '*'
      verbs:
      - '*'
    subjects:
    - kind: User
      user:
        name: system:serviceaccount:kube-system:vpa-recommender
    - kind: User
      user:
        name: system:serviceaccount:kube-system:vpa-updater
    - kind: User
      user:
        name: system:serviceaccount:kube-system:vpa-admission-controller

A Hidden Parameter in the Code

While stepping through the code in our debugging session, We noticed something subtle but critical — a parameter buried in the Admission Controller:

kube-api-qps: "5" # default
kube-api-burst: "10" # default

This setting wasn’t well-documented in the Helm chart or VPA documentation. so, we had never considered it before.

Realizing this could be the root cause, we immediately deployed a change to increase the API rate limits:

A Fix PR expanding the client-side rate limit

The Breakthrough: Instant Recovery

As soon as the change was applied, we saw an instant drop in webhook tail latency.

The p99 webhook latency dropped significantly.
The missing annotations were finally being injected.
Evictions stopped completely.

That was it! We had finally fixed the issue.

Tracing Back the Root Cause

Looking back, we found that an internal change had significantly increased the number of VPA objects inside the cluster.

The VPA version we were running made an excessive number of API calls, amplifying the throttling issue. Without realizing it, the Admission Controller was operating under a rate limit too low to handle the increased load, causing webhook failures and, ultimately, the endless eviction loop.

By uncovering an undocumented setting, debugging the source code, and analyzing the eviction flow, we managed to restore stability to the cluster.

Key Takeaways & Lessons Learned

1. Know Your Components

We initially focused on VPA Recommender and Checkpoints, but the real issue was the VPA Updater. This experience reinforced the importance of understanding the full lifecycle of how VPA manages workloads.

Each component plays a distinct role — misidentifying the source of a failure can lead to wasted time and ineffective mitigations. Whether it’s VPA, autoscalers, or other core infrastructure, knowing which component does what is essential for efficient debugging.

2. Beware of Fixation on Recent Changes

When something breaks, it’s easy to assume that the most recent changes are to blame. This tunnel vision led us to initially suspect the Prometheus chart upgrade as the root cause.

However, after extensive debugging, we realized that the problem had existed for months — the throttling issue started in July but only became critical when the number of VPA objects in the cluster increased dramatically.

This taught us that while recent changes can be contributing factors, the root cause might be buried deeper in system history. Instead of immediately blaming the last change, we now prioritize:

Looking at historical trends in metrics.
Validating correlations instead of assuming causation.
Systematically eliminating possibilities before jumping to conclusions.

3. Webhook Failures Are Silent Killers

Unlike validating webhooks, VPA’s mutating webhook doesn’t block pod creation (in our configuration) — which means its failures aren’t immediately obvious. Instead, failures cascade over time, leading to problems like persistent evictions, missing annotations, and escalating instability.

A failed webhook doesn’t just impact the pods directly involved — it can create widespread instability across multiple services. This incident showed us the importance of:

Monitoring webhook latency and failure rates.
Treating webhook failures as first-class incidents even when they don’t seem urgent.
Understanding the impact of webhook failures on downstream components.

4. Debugging Open-Source Software is Invaluable

Having access to VPA’s source code and running a live debugger was the key to finding the real issue. Without it, we may have continued down the wrong path for much longer.

We often assume open-source tools will “just work,” but when things go wrong, understanding how the software operates is crucial.

Key takeaways:

Reading source code can uncover hidden configurations (like kube-api-qps and kube-api-burst).
Live debugging lets you trace execution paths and confirm assumptions.
Knowing how to navigate open-source projects can turn a guessing game into a structured investigation.

By applying these lessons, we’ve improved our incident response process, fine-tuned our monitoring, and strengthened our understanding of Kubernetes internals. Most importantly, we now question our assumptions earlier and dig deeper when an issue isn’t immediately clear.

Final Thoughts

This incident was frustrating, but it was also a great learning experience. It reinforced how important it is to fully understand the tools we rely on — especially when they’re managing critical infrastructure.

Also, next time your autoscaler is acting up, check if it’s hitting a rate limit you didn’t know existed. It might just save you an entire day of debugging.