Skip to content

For the complete documentation index and AI-optimized content, see /llms.txt. All pages support markdown format via .md extension or Accept: text/markdown header.

Kubernetes & Autoscaling Best Practices

For the complete documentation index and AI-optimized content, see /llms.txt. All pages support markdown format via .md extension or Accept: text/markdown header.

Autoscaling works best when pods are easy to schedule, quick to start, honest about readiness, and cheap to terminate. These recommendations apply to workloads scaled by Kedify, the Kubernetes Horizontal Pod Autoscaler, the Vertical Pod Autoscaler, or a node autoscaler.

Use horizontal scaling when work can be split across replicas: HTTP traffic, queue consumers, workers, and stateless APIs. Use vertical scaling when a replica needs more CPU or memory to do the same unit of work, or when the workload is not easy to parallelize.

Node autoscaling is the capacity layer underneath both. HPA or KEDA creates more pods; the node autoscaler can only add nodes if those pods are schedulable. Pod resource requests, affinity, topology constraints, and storage requirements all affect that decision.

Avoid letting multiple autoscalers fight over the same signal. If HPA scales on CPU utilization, changing CPU requests with VPA or another vertical scaler changes the denominator used by HPA. For workloads that use both horizontal and vertical scaling, prefer external or business metrics for horizontal scaling, or keep vertical changes bounded and gradual.

Lean pods scale faster and place more reliably.

  • Keep container images small and pin versions. Avoid latest; use immutable tags or digests. See Kubernetes image guidance.
  • Remove nonessential sidecars from autoscaled workloads. Every sidecar adds image pulls, resource requests, probes, shutdown handling, and scheduling constraints.
  • Keep init containers short and deterministic. Pre-bake dependencies into the image instead of downloading large files during startup.
  • Avoid running migrations, large cache builds, or remote dependency checks in the scale-up path.
  • Avoid persistent volume attachment for fast-scale replicas unless the workload really needs it.
  • Keep startup work independent of total cluster size and total data size. If warm-up is required, expose readiness only after the pod can safely serve traffic or consume work.

Kubernetes uses resource requests for scheduling. Node autoscalers also reason from requests, not from actual runtime usage. HPA CPU utilization also requires CPU requests to be set.

Set CPU and memory requests for every container, including sidecars. Requests that are too low cause overpacking, throttling, eviction pressure, and misleading autoscaling signals. Requests that are too high make pods harder to schedule, increase cost, and can block node consolidation.

Use QoS classes intentionally. For latency-sensitive pods, requests should cover normal sustained usage with headroom. For memory, set realistic limits because memory cannot be throttled like CPU; exceeding the limit can terminate the container. For CPU, avoid arbitrary low limits that throttle healthy traffic while the node still has spare CPU.

CPU and memory are useful, but they are often late signals. For fast horizontal scaling, prefer metrics that represent incoming work:

  • HTTP request concurrency, in-flight requests, or request rate
  • Queue depth, lag, or oldest-message age
  • Work item backlog or pending jobs
  • Domain-specific capacity signals, such as model inference queue length

Keep targets below saturation. Scaling at 90-100% utilization leaves little room for metric lag, pod startup time, and downstream retries. Also set a realistic maximum replica count so autoscaling does not overload databases, brokers, or APIs.

For KEDA external metric failures, configure ScaledObject fallback so workloads keep a safe baseline when the primary metric source is unavailable.

Scale up faster than you scale down. Fast scale-up protects latency and backlog. Slower scale-down reduces pod churn, avoids interrupting in-flight work, and gives the node autoscaler better consolidation signals.

Use HPA behavior to set stabilization windows and scale policies. For KEDA scale-to-zero behavior, tune pollingInterval, cooldownPeriod, activation thresholds, minReplicaCount, and maxReplicaCount for the workload’s cold-start tolerance.

Keep latency-critical services above zero replicas. Scale to zero is best for workloads where queuing, a waiting page, or a cold-start delay is acceptable.

At large scale, avoid extremely short polling and cooldown settings across many workloads. They can increase metric API load, pod churn, and control-plane pressure. Kubernetes documents additional limits and considerations in large cluster guidance.

Use startup, readiness, and liveness probes for different jobs:

  • startupProbe protects slow-starting applications from premature liveness failures.
  • readinessProbe should be true only when the pod can accept traffic or consume work.
  • livenessProbe should detect a process that needs a restart, not temporary overload.

Keep probe handlers cheap and local. A readiness probe that depends on slow external systems can amplify outages by removing healthy pods during dependency trouble. If the application needs a downstream dependency to serve correctly, make that dependency state explicit and test the behavior under partial failure.

Autoscaling, node drains, rollouts, and manual deletes all terminate pods. During pod termination, Kubernetes sends a stop signal to each container and waits for the pod’s terminationGracePeriodSeconds. If the process is still running when the grace period expires, Kubernetes sends SIGKILL.

Configure every workload so it can stop cleanly before that forced kill. Without graceful shutdown, scale-down events can interrupt in-flight HTTP requests, drop long-running jobs, leave queue messages half-processed, or close database connections without cleanup.

The usual pattern is:

  • Handle SIGTERM in the application process.
  • Stop accepting new requests or new queue messages immediately.
  • Make readiness fail while shutdown is in progress.
  • Let in-flight requests, jobs, and acknowledgements finish.
  • Close clients and background workers, then exit before the grace period expires.
  • Set terminationGracePeriodSeconds to the longest expected shutdown path plus a small buffer.

Use a preStop hook only for Kubernetes-specific coordination, such as calling a local drain endpoint or waiting briefly for endpoint updates to propagate. Keep it lightweight and idempotent: the hook runs inside the pod termination grace period and must complete before the stop signal is sent.

apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
template:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: api
image: registry.example.com/api:v1.2.3
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
failureThreshold: 1
lifecycle:
preStop:
httpGet:
path: /shutdown
port: 8080

In this example, the application should make /ready fail after /shutdown is called or after it receives SIGTERM, stop taking new work, drain active work, and exit within 60 seconds. For queue consumers, stop polling, finish or safely requeue in-flight messages, then exit.

Avoid kubectl delete --force --grace-period=0 for normal operations. It bypasses graceful termination and can leave application work in an unknown state.

Use PodDisruptionBudgets for replicated workloads that need to survive node drains, upgrades, and node consolidation. Size the PDB with the replica count in mind; a strict PDB on a low-replica workload can block useful maintenance.

Spread replicas across nodes or zones with topology spread constraints. For elastic workloads, prefer soft spreading with ScheduleAnyway unless strict placement is required. Hard anti-affinity and narrow node affinity can block scale-up during spikes if the cluster cannot satisfy the constraint quickly.

Use PriorityClasses sparingly. Priority helps critical workloads schedule and survive node pressure, but broad high-priority use can preempt lower-priority pods and make the cluster less predictable.

For Deployments, keep rolling updates compatible with availability targets. maxUnavailable: 0 and a small maxSurge are common for serving workloads, but they require enough spare capacity or node autoscaler headroom.

Use vertical scaling to right-size CPU and memory requests, reduce waste, and keep memory-bound pods stable. Kubernetes VPA and Kedify Vertical Scalers are useful when changing pod resources is a better answer than adding replicas.

Set lower and upper bounds for recommendations. Unbounded vertical growth can make pods impossible to schedule or can starve other workloads. For production, start with recommendations or conservative update policies, then allow automatic updates where disruption is acceptable.

If vertical updates recreate pods, combine them with PodDisruptionBudgets and graceful shutdown. If your cluster supports in-place resource resize, still account for Kubernetes version support, resizePolicy, and the fact that some memory changes may require container restart.

Do not use vertical workload autoscaling for DaemonSet pods when node autoscaling depends on predictable DaemonSet resource requests. Kubernetes node autoscaling guidance calls this out because autoscalers must predict DaemonSet overhead on new nodes.

For workloads that need a small warm footprint but should not fully scale to zero, Kedify Pod Resource Profiles and Pod Resource Autoscaler can shrink resources during idle periods and expand them when demand returns.

This example shows the kind of baseline to aim for. Tune the values to the workload, not the other way around.

apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 3
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: api
spec:
terminationGracePeriodSeconds: 60
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: api
containers:
- name: api
image: registry.example.com/api:v1.2.3
imagePullPolicy: IfNotPresent
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
memory: 1Gi
startupProbe:
httpGet:
path: /startup
port: 8080
failureThreshold: 30
periodSeconds: 2
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
failureThreshold: 1
livenessProbe:
httpGet:
path: /live
port: 8080
periodSeconds: 10
failureThreshold: 3
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api
spec:
minAvailable: 2
selector:
matchLabels:
app: api