Skip to content

Multi-Cluster Scaling

Kedify supports scaling workloads across a fleet of Kubernetes clusters. This is achieved through two new custom resources that extend the standard KEDA resources with multi-cluster capabilities:

Prerequisite - Enabling Multi-Cluster controllers

Section titled “Prerequisite - Enabling Multi-Cluster controllers”

Starting with Kedify Agent v0.5.0, controllers for DistributedScaledObject and DistributedScaledJob are disabled by default. To activate multi-cluster scaling, set DSO_ENABLED and DSJ_ENABLED to "true".

You can enable both on an existing deployment:

Terminal window
kubectl set env deploy -n keda kedify-agent DSO_ENABLED="true" DSJ_ENABLED="true"

Or update your Helm release:

Terminal window
helm repo add kedifykeda https://kedify.github.io/charts
helm repo update kedifykeda
helm upgrade -i -n keda kedify-agent kedifykeda/kedify-agent --version v0.5.0 --reuse-values \
--set agent.features.distributedScaledObjectsEnabled=true \
--set agent.features.distributedScaledJobsEnabled=true

There are two main types of clusters involved in multi-cluster scaling:

  1. KEDA Cluster: This cluster runs the Kedify stack and manages the scaling logic. It monitors the metrics and decides when to scale workloads up or down.
  2. Member Clusters: These clusters host the actual workloads that need to be scaled. They expose their kube-apiserver to the KEDA cluster for management.

The member clusters don’t need to run KEDA themselves, as scaling decisions for DistributedScaledObject or DistributedScaledJob are made by the KEDA cluster. This allows for a smaller footprint on member clusters and enables edge scenarios where resources are limited.

In order to connect a member cluster to the KEDA cluster, you need to make the kube-apiserver of the member cluster accessible from the KEDA cluster. This can be done using various methods such as VPN, VPC peering, or exposing the API server via a load balancer with proper security measures.

With the connectivity established, you can use Kedify’s kubectl plugin to register member clusters to the KEDA cluster:

Terminal window
kubectl kedify mc setup-member <name> --keda-kubeconfig <path> --member-kubeconfig <path>

This command will use the provided kubeconfig files to set up the necessary access and permissions for the KEDA cluster to manage the member cluster. The member-kubeconfig should have sufficient permissions to create RBAC, ServiceAccount and keda namespace in the member cluster, these resources will be created with minimal privileges required for Kedify multi-cluster to operate. The keda-kubeconfig should have permissions to patch Secret named kedify-agent-multicluster-kubeconfigs in keda namespace in the KEDA cluster. For connecting multiple member clusters, you can repeat the above command with different names and kubeconfig files for each member cluster.

In case you would like the KEDA cluster to connect to the member cluster using a different address than the one specified in the member-kubeconfig, you can provide --member-api-url <url> flag to override the API server URL.

You can also list and remove registered member clusters using the following commands:

Terminal window
kubectl kedify mc list-members
kubectl kedify mc delete-member <name>

A ScaledObject is a KEDA resource that defines how to scale a specific workload based on certain metrics. The DistributedScaledObject extends this concept to support scaling across multiple clusters. It includes all the fields of a standard ScaledObject, along with additional fields to specify the member clusters and their configurations.

Distributed ScaledObject Architecture

apiVersion: keda.kedify.io/v1alpha1
kind: DistributedScaledObject
metadata:
name: nginx
spec:
memberClusters: # optional list of member clusters to use, if omitted all registered member clusters will be used
- name: member-cluster-1
weight: 4 # weight determines the proportion of replicas to be allocated to this cluster
- name: member-cluster-2
weight: 6
rebalancingPolicy: # optional parameters for rebalancing replicas across member clusters in case of outage or issues
gracePeriod: 1m # when a member cluster becomes unreachable, wait for this duration before rebalancing replicas to other clusters
scaledObjectSpec: # standard ScaledObject spec
scaleTargetRef:
kind: Deployment
name: nginx
minReplicaCount: 1
maxReplicaCount: 10
triggers:
- type: kubernetes-resource
metadata:
resourceKind: ConfigMap
resourceName: mock-metric
key: metric-value
targetValue: "5"

In this example, the DistributedScaledObject named nginx is configured to scale a Deployment named nginx across two member clusters. The memberClusters field whitelists the member clusters to be used along with their respective weights, which determine how many replicas should be allocated to each cluster. This section is optional; if omitted, all registered member clusters will be used with equal weights.

The workloads of type Deployment are expected to be present in relevant member clusters in a matching namespace as the DistributedScaledObject.

The rebalancingPolicy field allows you to specify how to handle situations where a member cluster becomes unreachable. In this case, after the specified gracePeriod, the replicas that were allocated to the unreachable cluster will be redistributed among the remaining healthy clusters. Once the unreachable cluster becomes healthy again, the replicas will be rebalanced back according to the defined weights.

Status of the DistributedScaledObject provides insights into the scaling state across member clusters:

status:
memberClusterStatuses:
member-cluster-1:
currentReplicas: 2
description: Cluster is healthy
desiredReplicas: 2
id: /etc/mc/kubeconfigs/member-cluster-1.kubeconfig+kedify-agent@member-cluster-1
lastStatusChangeTime: "2025-11-05T16:46:39Z"
state: Ready
member-cluster-2:
currentReplicas: 3
description: Cluster is healthy
desiredReplicas: 3
id: /etc/mc/kubeconfigs/member-cluster-2.kubeconfig+kedify-agent@member-cluster-2
lastStatusChangeTime: "2025-11-05T15:45:44Z"
state: Ready
membersHealthyCount: 2
membersTotalCount: 2
selector: kedify-agent-distributedscaledobject=nginx
totalCurrentReplicas: 5

Similar to DistributedScaledObject, the DistributedScaledJob extends KEDA’s ScaledJob concept to support job-based workloads across multiple clusters. It includes all the fields of a standard ScaledJob, along with additional fields to specify the member clusters and their configurations.

Distributed ScaledJob Architecture

  • Job-based workload: Instead of scaling Deployments, it creates and manages Jobs based on metrics

Before using DistributedScaledJobs, make sure that the raw metrics endpoint in KEDA is enabled. Use the environment variable RAW_METRICS_GRPC_PROTOCOL and set it to the value enabled. In the values.yaml file:

keda:
env:
- name: RAW_METRICS_GRPC_PROTOCOL
value: enabled

From the command line - add this argument to your helm installation command:

helm install ... \
--set-json 'keda.env=[{"name":"RAW_METRICS_GRPC_PROTOCOL","value":"enabled"}]'

The following example of a DistributedScaledJob splits the execution of job processing from the RabbitMQ task queue between two member clusters in a 2:3 ratio.

apiVersion: keda.kedify.io/v1alpha1
kind: DistributedScaledJob
metadata:
name: processor-job
spec:
memberClusters: # optional list of member clusters to use, if omitted all registered member clusters will be used
- name: member-cluster-1
weight: 2 # weight determines the proportion of jobs to be allocated to this cluster
- name: member-cluster-2
weight: 3
clusterScheduling:
strategy: weightedRoundRobin
failoverPolicy:
gracePeriod: 1m # wait before re-creating jobs that do not progress from Pending to Running
hardTaintDuration: 5m # taint a failing cluster before scheduling jobs on it again
softTaintDuration: 3m # if another failure happens in this window, apply a hard taint
scaledJobSpec: # standard ScaledJob spec
failedJobsHistoryLimit: 2 # keep up to 2 failed jobs, delete all older
successfulJobsHistoryLimit: 2 # keep up to 2 jobs that completed successfully, delete all older
jobTargetRef:
template:
spec:
containers:
- name: processor
image: myapp:latest
command: ["process"]
restartPolicy: Never
pollingInterval: 30
maxReplicaCount: 20
scalingStrategy:
strategy: pendingAware # pending/stuck jobs can be re-created on another cluster
triggers:
- type: rabbitmq
name: rabbit
metadata:
queueName: tasks
host: http://guest:password@localhost:15672/path/vhost
value: "5"

In this example, the DistributedScaledJob named processor-job is configured to scale Jobs across two member clusters.

clusterScheduling.failoverPolicy controls how taints are applied to failing clusters. If omitted, defaults are applied (gracePeriod: 1m, hardTaintDuration: 5m, softTaintDuration: 3m). See FailoverPolicy for details.

The jobTargetRef field contains the standard Kubernetes Job template specification. Jobs are created in the member clusters based on the scaling metrics and cluster weights.

Soft and hard taints for failing clusters: when a call from KEDA cluster to member cluster fails, that cluster is not removed from scheduling immediately. Instead, it is soft-tainted first. After a second failure, the soft taint is escalated to a hard taint. This helps prevent unnecessary scheduling pauses due to temporary network glitches.

weightedRoundRobin is the default strategy. If spec.clusterScheduling.strategy is omitted, DistributedScaledJob uses weightedRoundRobin. In this mode, jobs are distributed by memberClusters[].weight.

  • hard-tainted clusters are excluded from scheduling for all workloads
  • soft-tainted clusters remain eligible
  • memberClusters[].scheduling.priority is ignored in this mode

priorityFailover provides primary/failover behavior. The scheduler prefers the highest memberClusters[].scheduling.priority cluster, and only falls back when that cluster is excluded (for example tainted or overloaded).

Short example focused on failover configuration: Member cluster member-primary is preferred by the scheduler for new jobs until a job fails to progress from Pending to Running for longer than spec.clusterScheduling.failoverPolicy.gracePeriod. After the grace period is reached, the cluster is tainted for this affinity tuple (class: team-a, size: 4): jobs with size 4 and higher are excluded from that tainted cluster, while smaller jobs can still be scheduled there.

apiVersion: keda.kedify.io/v1alpha1
kind: DistributedScaledJob
metadata:
name: processor-job-failover
spec:
clusterScheduling:
strategy: priorityFailover
workloadAffinity:
class: team-a
size: 4
failoverPolicy:
gracePeriod: 1m
hardTaintDuration: 5m
softTaintDuration: 3m
memberClusters:
- name: member-primary
scheduling:
priority: 100
- name: member-failover
scheduling:
priority: 0
scaledJobSpec:
scalingStrategy:
strategy: pendingAware

Important differences vs weightedRoundRobin:

  • clusterScheduling.workloadAffinity is required in priorityFailover
  • memberClusters[].scheduling.priority is required for each member cluster in priorityFailover
  • failover taints are evaluated by (order, size, priority):
    • a taint recorded at size N excludes workloads with size >= N
    • workloads with smaller size can still use the same cluster
  • transient create Job API failures are treated as transient and do not taint the cluster in priorityFailover
  • dsj priorityFailover supports scaledJobSpec.scalingStrategy.strategy: pendingAware (or omitted strategy, which defaults to pendingAware)
  • podOverrides.affinity is only pod-placement configuration and does not drive cluster failover selection

spec.clusterScheduling.failoverPolicy controls failover timing, taint duration and duplicate-job handling.

spec:
clusterScheduling:
failoverPolicy:
gracePeriod: 1m
hardTaintDuration: 5m
softTaintDuration: 3m
duplicationPolicy: keepAll

Fields:

  • gracePeriod (default: 1m): how long a pending job can stay pending before it is treated as stuck.
  • hardTaintDuration (default: 5m): how long a cluster remains hard-tainted after failure.
  • softTaintDuration (default: 3m): soft-taint window for escalation.
    • If another failure happens within this window, taint escalates to hard taint.
    • Set to 0s for immediate hard taint on first failure.
  • duplicationPolicy (default: keepAll): how to resolve duplicate source/failover jobs in pending-aware failover.
  • keepAll

    • Behavior: keep source and failover jobs running when both exist.
    • Provisioning tendency: can temporarily overprovision during failover/recovery windows (availability-first).
    • Underprovisioning risk: lowest among the policies.
  • preferFailover

    • Behavior: prefer the failover job; source is deleted once a healthy failover replacement is confirmed.
    • Provisioning tendency: short-lived overprovisioning can happen while waiting for replacement health confirmation.
    • Underprovisioning risk: low to medium (mainly during failover transitions if replacement cannot be confirmed quickly).
  • preferSource

    • Behavior: prefer the original source job; replacement is deleted when source should resume ownership.
    • Provisioning tendency: generally avoids prolonged overprovisioning.
    • Underprovisioning risk: medium to high during unstable source-cluster recovery (source can be preferred before fully stable).
  • immediateSourceCleanup

    • Behavior: source job is deleted immediately after failover scheduling.
    • Provisioning tendency: minimizes overprovisioning.
    • Underprovisioning risk: highest if the replacement is delayed or fails to become healthy.

Rule of thumb:

  • If you optimize for continuity/availability, use keepAll or preferFailover.
  • If you optimize for strict capacity/cost control, use immediateSourceCleanup or preferSource, accepting higher underprovisioning risk.

DistributedScaledJob supports per-cluster overrides of selected pod/container fields. Overrides are defined on each memberClusters[] entry and are applied only when creating new Jobs. Existing Jobs are not modified in place.

Supported overrides:

  • podOverrides.nodeSelector: merged into the pod nodeSelector map (override values replace existing keys)
  • podOverrides.tolerations: merged with deterministic deduplication by identity key (key, operator, effect, value); jobTargetRef tolerations are applied first and podOverrides.tolerations are applied on top, so when the identity key matches an entry from podOverrides.tolerations overrides the one from jobTargetRef (last-writer-wins)
  • podOverrides.affinity: replaces base pod affinity
  • podOverrides.containerOverrides.<containerName>.image: replaces container image
  • podOverrides.containerOverrides.<containerName>.env: merged by env var name with override taking precedence
  • podOverrides.containerOverrides.<containerName>.resources:
    • requests and limits are merged by resource name
    • claims are replaced when explicitly provided (including empty list to clear existing claims)

If containerOverrides references a container name that does not exist in scaledJobSpec.jobTargetRef.template.spec.containers, the override is ignored and a warning event is emitted.

Example:

apiVersion: keda.kedify.io/v1alpha1
kind: DistributedScaledJob
metadata:
name: processor-job
spec:
memberClusters:
- name: member-cluster-1
weight: 2
podOverrides:
nodeSelector:
nodepool: gpu
tolerations:
- key: "gpu"
operator: "Exists"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: region
operator: In
values: ["us-east-1"]
containerOverrides:
processor:
image: my-registry.local/processor:v2
env:
- name: LOG_LEVEL
value: debug
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: "1"
memory: 1Gi
- name: member-cluster-2
weight: 3
clusterScheduling:
strategy: weightedRoundRobin
failoverPolicy:
gracePeriod: 1m
hardTaintDuration: 5m
softTaintDuration: 3m
scaledJobSpec:
jobTargetRef:
template:
spec:
containers:
- name: processor
image: myapp:latest
restartPolicy: Never
triggers:
- type: rabbitmq
name: rabbit
metadata:
queueName: tasks
host: http://guest:password@localhost:15672/path/vhost
value: "5"

Status of the DistributedScaledJob provides insights into the job state across member clusters:

status:
desiredJobs: 10
runningJobs: 8
pendingJobs: 2
memberClusterStatuses:
member-cluster-1:
description: Cluster is healthy
id: /etc/mc/kubeconfigs/member-cluster-1.kubeconfig+kedify-agent@member-cluster-1
runningJobs: 3
pendingJobs: 1
stuckJobs: 0
softTainted: false
lastStatusChangeTime: "2025-11-05T16:46:39Z"
state: Ready
excluded: false
member-cluster-2:
description: Cluster is healthy
id: /etc/mc/kubeconfigs/member-cluster-2.kubeconfig+kedify-agent@member-cluster-2
runningJobs: 5
pendingJobs: 1
stuckJobs: 0
softTainted: false
lastStatusChangeTime: "2025-11-05T15:45:44Z"
state: Ready
excluded: false
selector: kedify-agent-distributedscaledjob=processor-job

For a walkthrough example on how to set up and use multi-cluster scaling with Kedify, refer to the examples repository.

Scaling strategies are used to compute how many new Jobs to create across clusters. Choose one of: basic, pendingAware, custom, accurate, eager.

Inputs:

  • desiredJobsCount: target number derived from metrics and DSJ min/max bounds
  • runningJobsCount: number of non-terminal Jobs currently present (includes “pending”)
  • pendingJobsCount: subset of running Jobs considered “pending” (not yet progressed)
  • maxReplicaCount: DSJ upper bound on total concurrent non-terminal Jobs

Scale to the gap between desired and running.

  • Formula: desired - running
  • Behavior: simple “catch up” to desired.

Immediately re-create pending Jobs on other clusters while honoring capacity.

  • Idea: replace stuck Jobs.
  • Formula:
    • needed = max(0, desired - running + pending)
    • capacity = max(0, maxReplica - running)
    • scaleTo = min(needed, capacity)
  • Use when pending/stuck Jobs should be replaced/failovered elsewhere quickly.

User-defined scaling using a percentage of running Jobs and optional queue deduction.

  • Inputs: runningJobPercentage (float), queueLengthDeduction (int)
  • Formula: scaleTo = min(desired - deduction - running * percentage, maxReplica)
  • Notes:
    • If percentage parse fails, falls back to Basic.

Balance towards desired while staying within capacity; subtract pending from desired unless over max.

  • Formula:
    • If desired + running > maxReplica: scaleTo = maxReplica - running
    • Else: scaleTo = desired - pending
  • Use when pending work should defer new creations and capacity must be respected.

Fill available capacity (excluding pending) up to desired.

  • Formula: scaleTo = min(maxReplica - running - pending, desired)
  • Use when it’s safe to aggressively utilize capacity.
  • Pending-Aware (default): prioritize re-creating stuck Jobs elsewhere.
  • Basic: simplest gap-based scaling.
  • Accurate: conservative, subtracts pending.
  • Eager: aggressive, fills capacity quickly.
  • Custom: tailor behavior with percentage and deductions.

Both DistributedScaledObject and DistributedScaledJob support pausing via annotation:

metadata:
annotations:
autoscaling.keda.sh/paused: "true"

When this annotation is set:

  • DistributedScaledObject reconciliation skips scaling actions across member clusters
  • DistributedScaledJob reconciliation skips creating new Jobs across member clusters
  • status condition Paused=True is set on the distributed resource

DistributedScaledObject also supports autoscaling.keda.sh/paused-replicas:

metadata:
annotations:
autoscaling.keda.sh/paused-replicas: "3"

Behavior for DistributedScaledObject:

  • autoscaling.keda.sh/paused-replicas pauses scaling and sets a fixed total target replicas value
  • if both annotations are set, autoscaling.keda.sh/paused-replicas takes precedence
  • value must be a non-negative integer

Pause existing resources:

Terminal window
kubectl annotate distributedscaledobject <name> autoscaling.keda.sh/paused="true"
kubectl annotate distributedscaledjob <name> autoscaling.keda.sh/paused="true"
kubectl annotate distributedscaledobject <name> autoscaling.keda.sh/paused-replicas="3"

To resume scaling, remove the annotation:

Terminal window
kubectl annotate distributedscaledobject <name> autoscaling.keda.sh/paused-
kubectl annotate distributedscaledjob <name> autoscaling.keda.sh/paused-
kubectl annotate distributedscaledobject <name> autoscaling.keda.sh/paused-replicas-