Erik Zilinsky Feb 09, 2025

Vertical Pod Autoscaler (VPA): The Recommender

Hey,

In this post, I want to dive deep into VPA's Recommender component (version 1.3.0), focusing mainly on how it aggregates CPU and memory samples and how the recommendations are calculated. If you're short on time, I provide an overview of VPA in the first section.

Happy reading!

Disclaimer: This post discusses the mechanisms of VPA version 1.3.0. If you are using a later version of VPA, the information in this post may be outdated, as new features or internal workings may have been reworked in newer versions.

Introduction

VPA is used to fine-tune CPU and memory resource requests and limits automatically, reducing slack (slack refers to the unused or excess resources that are allocated but not utilized). It also ensures that pods do not run out of resources. VPA is an independent community project under the Kubernetes GitHub organization that must be explicitly deployed to your Kubernetes cluster.

At the moment, VPA can only control memory and CPU resources. It needs to be enabled on a per-deployment basis in Kubernetes (although it also supports other objects like StatefulSets, Jobs, CronJobs, DaemonSets, and anything with a scale subresource that manages pods) by defining a CRD called VerticalPodAutoscaler.

VPA can react to events such as OOMKilled by increasing memory and to CPU starvation by increasing CPU requests and limits. Only resource requests are calculated based on historical (if available) and current usage metrics, while limits are set proportionally based on the initially defined request-to-limit ratio. If we omit limits, VPA will not set them for us - it will only calculate and set resource requests.

It is recommended to combine VPA with Cluster Autoscaler (CA) or Karpenter. Without them, VPA may increase resource requests, but if there isn't enough available memory or CPU, the pod may fail to deploy.

VPA is great because, without it, people tend to overprovision their workloads to avoid out-of-memory (OOM) errors and CPU throttling. However, this leads to wasted resources, which can become costly at scale. On the other hand, some underprovision their workloads, resulting in OOM errors and CPU throttling. Others may manually rightsize their workloads based on extensive load testing, but resource requests and limits can quickly become outdated as usage patterns change or the application evolves. VPA solves this by automatically rightsizing workloads within Kubernetes, ensuring efficient resource allocation and adapting to changing usage patterns.

VPA is not the ideal choice if your application is sensitive to disruptions or has difficulty handling pod termination (e.g. don't support graceful termination). It may not be suitable if your clients cannot manage retries during disruptions. This is because VPA may evict your pods to apply the recommended resource values - more on that later.

VPA consists of three components (each running as a separate pod):

VPA offers multiple modes, which need to be set in the VerticalPodAutoscaler CRD:

VPA does not account for the individual needs of single pods - over time, all pods within the same workload (e.g. Kubernetes Deployment) will receive the same recommendation. This can lead to significant resource waste if some pods require more or fewer resources than others within the same workload. We can reduce this risk by evenly distributing the load across our pods.

If there are PodDisruptionBudget (PDB) objects in your cluster, consider using them with the new AlwaysAllow option to support your workload. This option can help remove a pod from an OOMKilled CrashLoopBackoff state when the PDB is or would be violated.

With VPA, two new CRDs are introduced. Let's dive into them more deeply in the next two sections:

VerticalPodAutoscaler CRD

Here's a sample VPA object with additional comments explaining some of its fields:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: randomDeployment
  namespace: default
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: randomDeployment
  updatePolicy:
    updateMode: 'Auto'
  resourcePolicy:
    containerPolicies:
        // this scaling policy will apply to every container
      - containerName: '*'
        // specifies the minimal amount of resources that will be recommended
        minAllowed:
          memory: 50Mi
        // specifies the maximum amount of resources that will be recommended
        maxAllowed:
          memory: 500Mi
        // type of recommendations that will be computed
        controlledResources:
          - memory
        // which resource values should be controlled
        controlledValues: 'RequestsAndLimits'
Status:
  // calculated by the Recommender
  Recommendation:
    Container Recommendations:
      Container Name:  container1
      Lower Bound:
        Cpu:     25m
        Memory:  52428800
      Target:
        Cpu:     30m
        Memory:  102428800
      Uncapped Target:
        Cpu:     35m
        Memory:  132428800
      Upper Bound:
        Cpu:     35m
        Memory:  132428800

VerticalPodAutoscalerCheckpoint CRD

Handling OOMKilled events

The Recommender monitors pod eviction events. If a pod is evicted due to an OOMKilled event, the Recommender increases the memory in the corresponding VPA object. If the mode is set to auto, the pod is then recreated. It sets the memory request to the last observed maximum memory usage and adds an additional 20% (the constant DefaultOOMBumpUpRatio) to help mitigate future OOM errors.

Handling CPU Starvation

After a while, the VPA will recreate pods experiencing CPU starvation if we have created the corresponding VPA object with updatePolicy.mode set to Auto.

Recommender's loop

Now, let's first discuss the most important component of VPA - the Recommender - starting with two Golang structs that I consider important.

Resource requests for pods are calculated based on both historical (if available) and current resource usage. Historical data is available if the Recommender has been running for a longer period, as it gathers data from the Kubernetes Metrics Server and aggregates it into its memory by default. Another option to collect historical resource usage data is to pass the Prometheus address through a flag to the Recommender.

The calculation of resource requests involves the following steps (detailed breakdown of the VPA algorithm follows):

Update ClusterState with VPA objects (LoadVPAs func)

It loads the existing VPA objects' Status and Spec into the ClusterState, including recently computed values such as Target, LowerBound, UpperBound, and UncappedTarget.

Update ClusterState with Pod and Container Details (LoadPods func)

Basic pod (PodState) and container specifications (ContainerState) are loaded into the ClusterState, including labels attached to the pod, the containers belonging to the pod, current resource requests, container images, etc. Note that CPU and memory usage are collected in the next step.

If the loaded container does not already have an AggregateContainerState structure, one is created during this step. This includes initializing empty decaying histograms for CPU and memory usage. Link to code.

If the Recommender is running with the --memory-saver flag, it only tracks pods and containers that have an associated VPA object.

Update ClusterState with usage metrics (LoadRealTimeMetrics func)

By default, the current CPU and memory usage data is collected from the Kubernetes Metrics Server (KMS) for containers, with CPU measured in millicores and memory in bytes. For each container running in the cluster, usage data for each resource is gathered over a specified duration (indicated by the SnapshotWindow field) from the KMS.

Metrics collection can be restricted to a specific namespace using the --vpa-object-namespace flag. By default, metrics are collected across all namespaces and for all running containers.

At this stage, the Recommender attempts to aggregate each individual CPU and memory usage sample. To better understand this process, let's consider a sample Kubernetes deployment named resource-consumer, running in the default namespace with two pod replicas, each pod contains a single container named resource-consumer:

Pod: resource-consumer-748f7fc9b6-9mg4n
  └─ Container: resource-consumer

Pod: resource-consumer-748f7fc9b6-hsmtb
  └─ Container: resource-consumer

Based on this deployment, we will collect two snapshots from the KMS - one for each container:

First snapshot ContainerMetricsSnapshot[0]:

{
    "ID": {
        "PodID": {
            "Namespace": "default",
            "PodName": "resource-consumer-748f7fc9b6-9mg4n"
        },
        "ContainerName": "resource-consumer"
    },
    "SnapshotTime": "2025-02-01T08:06:44Z",
    "SnapshotWindow": "12393000000",
    "Usage": [
            "cpu": "233",
            "memory": "93356032"
    ]
}

Second snapshot ContainerMetricsSnapshot[1]:

{
    "ID": {
        "PodID": {
            "Namespace": "default",
            "PodName": "resource-consumer-748f7fc9b6-hsmtb"
        },
        "ContainerName": "resource-consumer"
    },
    "SnapshotTime": "2025-02-01T08:06:48Z",
    "SnapshotWindow": "12216000000",
    "Usage": [
            "cpu": "233",
            "memory": "93274112"
    ]
}

As you can see, the CPU usage for both pods is the same, while their memory usage differs slightly. Next, we loop through all the snapshots (in this case, there are just two). This means we have 2 CPU samples and 2 memory samples to aggregate.

Let's take a look at how one CPU sample aggregation works:

{
    "ContainerUsageSample": {
        "MeasureStart": "2025-02-01T08:06:44Z",
        "Usage": 233,
        "Request": 0,
        "Resource": "cpu"
    },
    "Container": {
        "PodID": {
            "Namespace": "default",
            "PodName": "resource-consumer-748f7fc9b6-9mg4n"
        },
        "ContainerName": "resource-consumer"
    }
}

A CPU sample is discarded (i.e., not added to the CPU histogram) if its MeasureStart timestamp is later than the start timestamp of the last CPU sample that was already aggregated into the histogram. This helps prevent the addition of duplicate or out-of-order CPU samples.

CPU Usage is converted into CPU cores, meaning that 233 milicores of Usage are converted to 0.233 CPU cores. Next, we determine the weight of the CPU sample, which is set to 0.1 by default. The next step is to multiply the weight by the value returned from the decaying factor method (decayFactor). The decaying factor is used to reduce the sample's weight ("importance") by half for each halfLife period, meaning that newer samples receive a higher weight than older ones. By default, the halfLife is set to 24 hours. In our example, the decaying factor is 161.793, which is then multiplied by 0.1 to give the final weight of 16.1793. Below is an example of how the decaying factor method's return value increases as the MeasureStart timestamp increases:

func main() {
	sampleTime1, _ := time.Parse(time.RFC3339, "2025-02-01T08:06:44Z") //current timestamp
	sampleTime2, _ := time.Parse(time.RFC3339, "2025-02-01T08:12:44Z") //timestamp shifted +6 mins
	sampleTime3, _ := time.Parse(time.RFC3339, "2025-02-02T08:08:44Z") //timestamp shifted +1 day
	referenceTimestamp, _ := time.Parse(time.RFC3339, "2025-01-25T00:00:00Z")

	HalfLife := time.Hour * 24

	// calculating the decaying factor
	fmt.Println(float64(math.Exp2(float64(sampleTime1.Sub(referenceTimestamp)) / float64(HalfLife))))
	fmt.Println(float64(math.Exp2(float64(sampleTime2.Sub(referenceTimestamp)) / float64(HalfLife))))
	fmt.Println(float64(math.Exp2(float64(sampleTime3.Sub(referenceTimestamp)) / float64(HalfLife))))

	// Outputs:
	// 161.79343499365757
	// 162.26138818303303
	// 323.8985384947324
}

Now that we have an usage value (0.233), a final weight (16.1793 = 161.793 * 0.1), and a timestamp (2025-02-01T08:06:44Z), we are ready to aggregate this CPU sample. The Recommender will then find the Kubernetes deployment's CPU histogram to which the sample should belong. Let's assume this is the CPU histogram before aggregation:

bucketWeight: [1,2,3,4,0,0,0,0,0,0,0,0,0,0,0,0,0....]
totalWeight: 10,
minBucket: 0, // Index of the first non-empty bucket
maxBucket: 3 // Index of the last non-empty bucket
firstBucketSize: 0.01,
ratio: 1.05,
epsilon: 0.0001
numBuckets: 176

Next, we determine the bucket's index using the FindBucket method. CPU histograms use an exponential bucketing scheme, with the smallest bucket size being 0.01 core and a maximum of 1000.0 cores. With the above-mentioned parameters, such as firstBucketSize, ratio and numBuckets, the first bucket starts at 0 with index 0, the second at 0.01, and the third at 0.020499. The start of a bucket with a given index is calculated using the GetBucketStart method:

package main

import (
	"fmt"
	"math"
)

var ratio float64 = 1.05
//var firstBucketSize float64 = 10000000 // 10 MB
var firstBucketSize float64 = 0.01 // 10 milicore
var numBuckets int = 176 // 176 for CPU and memory histograms

func GetBucketStart(bucket int) float64 {
	if bucket < 0 || bucket >= numBuckets {
		panic(fmt.Sprintf("index %d out of range [0..%d]", bucket, numBuckets-1))
	}
	if bucket == 0 {
		return 0.0
	}
	return firstBucketSize * (math.Pow(ratio, float64(bucket)) - 1) / (ratio - 1)
}

func main() {
	fmt.Println(GetBucketStart(0))
	fmt.Println(GetBucketStart(1))
	fmt.Println(GetBucketStart(2))
	fmt.Println(GetBucketStart(3))
}
// Outputs:
// 0
// 0.01
// 0.020499999999999987
// 0.031525

Okay, in our example, with a usage of 0.233 CPU cores, the bucket index would be 15. Once we identify the bucket index, we add the new weight (16.1793) to it. The updated bucketWeight[15] will be 16.1793:

bucketWeight: [1,2,3,4,0,0,0,0,0,0,0,0,0,0,0,16.1793,0....]
totalWeight: 26,1793,
minBucket: 0, // Index of the first non-empty bucket
maxBucket: 15 // Index of the last non-empty bucket
firstBucketSize: 0.01,
ratio: 1.05,
epsilon: 0.0001

Now that we've aggregated one CPU sample for a container, let's look at how a single Memory sample is aggregated in the following:

{
    "ContainerUsageSample": {
        "MeasureStart": "2025-02-01T08:06:44Z",
        "Usage": 93356032,
        "Request": 0,
        "Resource": "memory"
    },
    "Container": {
        "PodID": {
            "Namespace": "default",
            "PodName": "resource-consumer-748f7fc9b6-9mg4n"
        },
        "ContainerName": "resource-consumer"
    }
}

Each new memory sample is compared against the peak memory usage within the current aggregation interval, the end of which is stored in the container's WindowEnd field. VPA uses peak memory rather than the entire distribution (as it does with CPU), because we typically want to provision for (or near) the peak memory usage. The current peak memory is stored in the oomPeak field within the container's ContainerState struct. The aggregation interval can be adjusted using the --memory-aggregation-interval flag, with the default set to one day.

The comparison takes place at this line, and here are the steps involved:

  1. If the new memory sample is larger than the current peak and its timestamp is earlier than the end of the current aggregation interval (WindowEnd), the Recommender follows these steps:
    1. Subtract the old peak memory sample's weight from the Memory Histogram using this method. The method uses the old peak value in bytes, the weight (calculated as 1.0 * decayFactor(WindowEnd)), and the timestamp (which is the container's WindowEnd).
    2. Add the new peak's weight to the Memory histogram's appropriate bucket based on the new peak value. This is done using the AddSample method. The bucket index for the new weight is returned by the FindBucket method. The arguments for the AddSample method are:
      1. New peak memory usage in bytes
      2. Weight, calculated as 1.0 * decayFactor(WindowEnd)
      3. Timestamp, which is the container's WindowEnd
  2. If the new memory sample is smaller than the current peak within the current aggregation interval (WindowEnd), we don't aggregate the new sample. Instead, we simply move on to the next memory sample.

If the aggregation window ends, we set the peak memory to 0, as we're moving to the next interval (link). Of course, we add the current memory sample as the new peak.

Memory histograms use an exponential bucketing scheme (just like CPU histograms), with the smallest bucket size being 10MB and the maximum size being 1TB. The bucket with index 0 starts at 0, the next starts at 10 MB, followed by 20.49 MB, and then 31.5249 MB (calculated by GetBucketStart func).

Now that we've aggregated all the CPU and memory samples, we can move to the next step.

Update the status field of the VPA objects (UpdateVPAs func)

At this step, the status field of all the VPA CRD objects is updated, meaning that recommendations - Target and other values like UpperBound, LowerBound, and UncappedTarget are calculated. Here's the process for calculating these values for one aggregated container state (AggregateContainerState):

  1. Find the CPU and Memory distributions: The AggregateCPUUsage and AggregateMemoryPeaks are retrieved.
  2. Return specific percentiles: The Recommender returns percentiles of the CPU and memory peak distributions:
    • p50 (the median) for the LowerBound
    • p90 for the Target
    • p95 for the UpperBound
  3. Add a safety margin: A safety margin is added to all three values, with a default of +15%.
  4. Check against "min" flags : The values for LowerBound, UpperBound, and Target are compared against the values specified by the flags --pod-recommendation-min-cpu-millicores and --pod-recommendation-min-memory-mb. If the calculated values are lower than the values set by the flags, the flag values are used instead.
  5. Apply confidence multiplier: If calculating the LowerBound or UpperBound, a confidence multiplier is applied to these values (I will explain the confidence multiplier later).

Afterward, UncappedTarget is set to be the same as the Target. The values for Target, UpperBound, and LowerBound are then checked against the limits set in the VPA object (minAllowed and maxAllowed), and they are adjusted to comply with the policies.

Maintaining checkpoints

In this step, the Recommender serializes the aggregated historical usage of containers into checkpoint statuses, which are associated with VPA objects. For example, a CPU and memory histogram for a container with the same name, the same pod labels, and within the same namespace will be stored in a single checkpoint object.

New weights are derived from AggregateCPUUsage and AggregateMemoryPeaks, as illustrated based on the following sample histogram:

bucketWeight: [957.41,8.87,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0....]
totalWeight: 966.28,
minBucket: 0, // Index of the first non-empty bucket
maxBucket: 1 // Index of the last non-empty bucket
firstBucketSize: 0.01,
ratio: 1.05,
epsilon: 0.0001
  1. We obtain the max weight, which corresponds to the bucket at index 0 with a weight of 957.41.
  2. We compute the ratio as:
    ratio := float64(10000) / max
  3. Then, for each bucket, we calculate the new weight by multiplying its weight by the ratio, and rounding the result.

The new bucket weights are calculated as follows, using a ratio of 10.444:

BucketWeights: map[int]uint32 [
	0: 10000,
	1: 93
],
TotalWeight: 966.28

Next, we save the updated buckets with their new weights, along with additional fields like TotalWeight, FirstSampleStart, TotalSamplesCount, etc... into a checkpoint object. If the checkpoint object does not exist, it will be created.

Orphaned checkpoints are also garbage collected at this step, as mentioned earlier in the VerticalPodAutoscalerCheckpoint CRD section.

Garbage collection of obsolete aggregated usage measurements

Obsolete AggregateContainerState objects are garbage collected every hour by default, as defined by the aggregateContainerStateGCInterval constant.

An AggregateContainerState becomes obsolete if, for example, its last sample is older than 8 days (reference), or if the TotalSamplesCount equals 0 and there is no controller attached to the pods (e.g., Deployment).

Confidence multiplier

The most accurate recommendations are achieved by allowing the Recommender to collect 8 days' worth of usage metrics. Without historical data, the Recommender will rely solely on current usage metrics, which can result in a higher UpperBound and a lower LowerBound to avoid potential evictions. However, thanks to the confidence multiplier in the algorithm, as more historical usage data is gathered, both the UpperBound and LowerBound will converge closer to the Target. The LowerBound tends to converge much faster than the UpperBound because upscaling can be done more freely than downscaling.

Prometheus

To obtain stable recommendations immediately, we should connect Prometheus to the Recommender as a history provider instead of relying on checkpoint objects or AggregateContainerState structs, which may not be available for a specific group of pods. Here are the flags I used to successfully fetch historical data from Prometheus into the Recommender's memory:

--storage=prometheus
--prometheus-address=http://prometheus-operated.monitoring.svc.cluster.local:9090
--prometheus-cadvisor-job-name=kubelet
--pod-namespace-label=namespace
--pod-name-label=pod
--metric-for-pod-labels=kube_pod_labels{job="kube-state-metrics"}[8d]
--container-pod-name-label=pod
--container-name-label=container

Summary

To summarize, by default, the Recommender collects usage metrics for all running containers and saves them to its memory every minute. These metrics are aggregated into decaying histograms, which are periodically checkpointed for containers managed by a VPA object. The checkpoints act as persistent storage, ensuring data is retained if the Recommender pod is restarted, killed, or crashes.

If the Recommender has been running for some time and we create a VPA object, we can expect more stable recommendations for our set of pods, especially if they have been running for a couple of days after the Recommender was started in the cluster. A more reliable way to obtain stable recommendations immediately is to connect Prometheus to the Recommender as a history provider, rather than relying on checkpoint objects.

VPA is a powerful tool for automatically rightsizing workloads and reducing resource slack. However, it may not be suitable for everyone. For example, if your workloads cannot tolerate disruptions or have a bursty usage pattern, VPA might not be the best fit for you.

Back