Kubernetes & Autoscaling Best Practices
Autoscaling works best when pods are easy to schedule, quick to start, honest about readiness, and cheap to terminate. These recommendations apply to workloads scaled by Kedify, the Kubernetes Horizontal Pod Autoscaler, the Vertical Pod Autoscaler, or a node autoscaler.
Start With the Right Scaling Shape
Section titled “Start With the Right Scaling Shape”Use horizontal scaling when work can be split across replicas: HTTP traffic, queue consumers, workers, and stateless APIs. Use vertical scaling when a replica needs more CPU or memory to do the same unit of work, or when the workload is not easy to parallelize.
Node autoscaling is the capacity layer underneath both. HPA or KEDA creates more pods; the node autoscaler can only add nodes if those pods are schedulable. Pod resource requests, affinity, topology constraints, and storage requirements all affect that decision.
Avoid letting multiple autoscalers fight over the same signal. If HPA scales on CPU utilization, changing CPU requests with VPA or another vertical scaler changes the denominator used by HPA. For workloads that use both horizontal and vertical scaling, prefer external or business metrics for horizontal scaling, or keep vertical changes bounded and gradual.
Keep Pods Lean
Section titled “Keep Pods Lean”Lean pods scale faster and place more reliably.
- Keep container images small and pin versions. Avoid
latest; use immutable tags or digests. See Kubernetes image guidance. - Remove nonessential sidecars from autoscaled workloads. Every sidecar adds image pulls, resource requests, probes, shutdown handling, and scheduling constraints.
- Keep init containers short and deterministic. Pre-bake dependencies into the image instead of downloading large files during startup.
- Avoid running migrations, large cache builds, or remote dependency checks in the scale-up path.
- Avoid persistent volume attachment for fast-scale replicas unless the workload really needs it.
- Keep startup work independent of total cluster size and total data size. If warm-up is required, expose readiness only after the pod can safely serve traffic or consume work.
Set Resource Requests Deliberately
Section titled “Set Resource Requests Deliberately”Kubernetes uses resource requests for scheduling. Node autoscalers also reason from requests, not from actual runtime usage. HPA CPU utilization also requires CPU requests to be set.
Set CPU and memory requests for every container, including sidecars. Requests that are too low cause overpacking, throttling, eviction pressure, and misleading autoscaling signals. Requests that are too high make pods harder to schedule, increase cost, and can block node consolidation.
Use QoS classes intentionally. For latency-sensitive pods, requests should cover normal sustained usage with headroom. For memory, set realistic limits because memory cannot be throttled like CPU; exceeding the limit can terminate the container. For CPU, avoid arbitrary low limits that throttle healthy traffic while the node still has spare CPU.
Use Demand Metrics, Not Symptoms
Section titled “Use Demand Metrics, Not Symptoms”CPU and memory are useful, but they are often late signals. For fast horizontal scaling, prefer metrics that represent incoming work:
- HTTP request concurrency, in-flight requests, or request rate
- Queue depth, lag, or oldest-message age
- Work item backlog or pending jobs
- Domain-specific capacity signals, such as model inference queue length
Keep targets below saturation. Scaling at 90-100% utilization leaves little room for metric lag, pod startup time, and downstream retries. Also set a realistic maximum replica count so autoscaling does not overload databases, brokers, or APIs.
For KEDA external metric failures, configure ScaledObject fallback so workloads keep a safe baseline when the primary metric source is unavailable.
Tune Horizontal Scaling for Stability
Section titled “Tune Horizontal Scaling for Stability”Scale up faster than you scale down. Fast scale-up protects latency and backlog. Slower scale-down reduces pod churn, avoids interrupting in-flight work, and gives the node autoscaler better consolidation signals.
Use HPA behavior to set stabilization windows and scale policies. For KEDA scale-to-zero behavior, tune pollingInterval, cooldownPeriod, activation thresholds, minReplicaCount, and maxReplicaCount for the workload’s cold-start tolerance.
Keep latency-critical services above zero replicas. Scale to zero is best for workloads where queuing, a waiting page, or a cold-start delay is acceptable.
At large scale, avoid extremely short polling and cooldown settings across many workloads. They can increase metric API load, pod churn, and control-plane pressure. Kubernetes documents additional limits and considerations in large cluster guidance.
Make Startup and Readiness Honest
Section titled “Make Startup and Readiness Honest”Use startup, readiness, and liveness probes for different jobs:
startupProbeprotects slow-starting applications from premature liveness failures.readinessProbeshould be true only when the pod can accept traffic or consume work.livenessProbeshould detect a process that needs a restart, not temporary overload.
Keep probe handlers cheap and local. A readiness probe that depends on slow external systems can amplify outages by removing healthy pods during dependency trouble. If the application needs a downstream dependency to serve correctly, make that dependency state explicit and test the behavior under partial failure.
Gracefully Shut Down Pods
Section titled “Gracefully Shut Down Pods”Autoscaling, node drains, rollouts, and manual deletes all terminate pods. During pod termination, Kubernetes sends a stop signal to each container and waits for the pod’s terminationGracePeriodSeconds. If the process is still running when the grace period expires, Kubernetes sends SIGKILL.
Configure every workload so it can stop cleanly before that forced kill. Without graceful shutdown, scale-down events can interrupt in-flight HTTP requests, drop long-running jobs, leave queue messages half-processed, or close database connections without cleanup.
The usual pattern is:
- Handle
SIGTERMin the application process. - Stop accepting new requests or new queue messages immediately.
- Make readiness fail while shutdown is in progress.
- Let in-flight requests, jobs, and acknowledgements finish.
- Close clients and background workers, then exit before the grace period expires.
- Set
terminationGracePeriodSecondsto the longest expected shutdown path plus a small buffer.
Use a preStop hook only for Kubernetes-specific coordination, such as calling a local drain endpoint or waiting briefly for endpoint updates to propagate. Keep it lightweight and idempotent: the hook runs inside the pod termination grace period and must complete before the stop signal is sent.
Example Configuration:
Section titled “Example Configuration:”apiVersion: apps/v1kind: Deploymentmetadata: name: apispec: template: spec: terminationGracePeriodSeconds: 60 containers: - name: api image: registry.example.com/api:v1.2.3 readinessProbe: httpGet: path: /ready port: 8080 periodSeconds: 5 failureThreshold: 1 lifecycle: preStop: httpGet: path: /shutdown port: 8080In this example, the application should make /ready fail after /shutdown is called or after it receives SIGTERM, stop taking new work, drain active work, and exit within 60 seconds. For queue consumers, stop polling, finish or safely requeue in-flight messages, then exit.
Avoid kubectl delete --force --grace-period=0 for normal operations. It bypasses graceful termination and can leave application work in an unknown state.
Protect Availability During Rescheduling
Section titled “Protect Availability During Rescheduling”Use PodDisruptionBudgets for replicated workloads that need to survive node drains, upgrades, and node consolidation. Size the PDB with the replica count in mind; a strict PDB on a low-replica workload can block useful maintenance.
Spread replicas across nodes or zones with topology spread constraints. For elastic workloads, prefer soft spreading with ScheduleAnyway unless strict placement is required. Hard anti-affinity and narrow node affinity can block scale-up during spikes if the cluster cannot satisfy the constraint quickly.
Use PriorityClasses sparingly. Priority helps critical workloads schedule and survive node pressure, but broad high-priority use can preempt lower-priority pods and make the cluster less predictable.
For Deployments, keep rolling updates compatible with availability targets. maxUnavailable: 0 and a small maxSurge are common for serving workloads, but they require enough spare capacity or node autoscaler headroom.
Use Vertical Scaling Deliberately
Section titled “Use Vertical Scaling Deliberately”Use vertical scaling to right-size CPU and memory requests, reduce waste, and keep memory-bound pods stable. Kubernetes VPA and Kedify Vertical Scalers are useful when changing pod resources is a better answer than adding replicas.
Set lower and upper bounds for recommendations. Unbounded vertical growth can make pods impossible to schedule or can starve other workloads. For production, start with recommendations or conservative update policies, then allow automatic updates where disruption is acceptable.
If vertical updates recreate pods, combine them with PodDisruptionBudgets and graceful shutdown. If your cluster supports in-place resource resize, still account for Kubernetes version support, resizePolicy, and the fact that some memory changes may require container restart.
Do not use vertical workload autoscaling for DaemonSet pods when node autoscaling depends on predictable DaemonSet resource requests. Kubernetes node autoscaling guidance calls this out because autoscalers must predict DaemonSet overhead on new nodes.
For workloads that need a small warm footprint but should not fully scale to zero, Kedify Pod Resource Profiles and Pod Resource Autoscaler can shrink resources during idle periods and expand them when demand returns.
Reference Workload Settings
Section titled “Reference Workload Settings”This example shows the kind of baseline to aim for. Tune the values to the workload, not the other way around.
apiVersion: apps/v1kind: Deploymentmetadata: name: apispec: replicas: 3 strategy: rollingUpdate: maxSurge: 1 maxUnavailable: 0 template: metadata: labels: app: api spec: terminationGracePeriodSeconds: 60 topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: ScheduleAnyway labelSelector: matchLabels: app: api containers: - name: api image: registry.example.com/api:v1.2.3 imagePullPolicy: IfNotPresent resources: requests: cpu: 250m memory: 512Mi limits: memory: 1Gi startupProbe: httpGet: path: /startup port: 8080 failureThreshold: 30 periodSeconds: 2 readinessProbe: httpGet: path: /ready port: 8080 periodSeconds: 5 failureThreshold: 1 livenessProbe: httpGet: path: /live port: 8080 periodSeconds: 10 failureThreshold: 3apiVersion: policy/v1kind: PodDisruptionBudgetmetadata: name: apispec: minAvailable: 2 selector: matchLabels: app: api