Multi-Cluster Scaling

Kedify supports scaling workloads across a fleet of Kubernetes clusters. This is achieved through two new custom resources that extend the standard KEDA resources with multi-cluster capabilities:

DistributedScaledObject - extends ScaledObject for workloads like Deployments
DistributedScaledJob - extends ScaledJob for long running job-based workloads

Prerequisite - Enabling Multi-Cluster controllers

Starting with Kedify Agent v0.5.0, controllers for DistributedScaledObject and DistributedScaledJob are disabled by default. To activate multi-cluster scaling, set DSO_ENABLED and DSJ_ENABLED to "true".

You can enable both on an existing deployment:

kubectl set env deploy -n keda kedify-agent DSO_ENABLED="true" DSJ_ENABLED="true"

Or update your Helm release:

helm repo add kedifykeda https://kedify.github.io/charts
helm repo update kedifykeda
helm upgrade -i -n keda kedify-agent kedifykeda/kedify-agent --version v0.5.0 --reuse-values \
  --set agent.features.distributedScaledObjectsEnabled=true \
  --set agent.features.distributedScaledJobsEnabled=true

Architecture

There are two main types of clusters involved in multi-cluster scaling:

KEDA Cluster: This cluster runs the Kedify stack and manages the scaling logic. It monitors the metrics and decides when to scale workloads up or down.
Member Clusters: These clusters host the actual workloads that need to be scaled. They expose their kube-apiserver to the KEDA cluster for management.

The member clusters don’t need to run KEDA themselves, as scaling decisions for DistributedScaledObject or DistributedScaledJob are made by the KEDA cluster. This allows for a smaller footprint on member clusters and enables edge scenarios where resources are limited.

Connecting Member Clusters

In order to connect a member cluster to the KEDA cluster, you need to make the kube-apiserver of the member cluster accessible from the KEDA cluster. This can be done using various methods such as VPN, VPC peering, or exposing the API server via a load balancer with proper security measures.

With the connectivity established, you can use Kedify’s kubectl plugin to register member clusters to the KEDA cluster:

kubectl kedify mc setup-member <name> --keda-kubeconfig <path> --member-kubeconfig <path>

This command will use the provided kubeconfig files to set up the necessary access and permissions for the KEDA cluster to manage the member cluster. The member-kubeconfig should have sufficient permissions to create RBAC, ServiceAccount and keda namespace in the member cluster, these resources will be created with minimal privileges required for Kedify multi-cluster to operate. The keda-kubeconfig should have permissions to patch Secret named kedify-agent-multicluster-kubeconfigs in keda namespace in the KEDA cluster. For connecting multiple member clusters, you can repeat the above command with different names and kubeconfig files for each member cluster.

In case you would like the KEDA cluster to connect to the member cluster using a different address than the one specified in the member-kubeconfig, you can provide --member-api-url <url> flag to override the API server URL.

You can also list and remove registered member clusters using the following commands:

kubectl kedify mc list-members
kubectl kedify mc delete-member <name>

DistributedScaledObject

A ScaledObject is a KEDA resource that defines how to scale a specific workload based on certain metrics. The DistributedScaledObject extends this concept to support scaling across multiple clusters. It includes all the fields of a standard ScaledObject, along with additional fields to specify the member clusters and their configurations.

Distributed ScaledObject Architecture

DistributedScaledObject Specification

apiVersion: keda.kedify.io/v1alpha1
kind: DistributedScaledObject
metadata:
  name: nginx
spec:
  memberClusters: # optional list of member clusters to use, if omitted all registered member clusters will be used
    - name: member-cluster-1
      weight: 4 # weight determines the proportion of replicas to be allocated to this cluster
    - name: member-cluster-2
      weight: 6
  rebalancingPolicy: # optional parameters for rebalancing replicas across member clusters in case of outage or issues
    gracePeriod: 1m # when a member cluster becomes unreachable, wait for this duration before rebalancing replicas to other clusters
  scaledObjectSpec: # standard ScaledObject spec
    scaleTargetRef:
      kind: Deployment
      name: nginx
    minReplicaCount: 1
    maxReplicaCount: 10
    triggers:
      - type: kubernetes-resource
        metadata:
          resourceKind: ConfigMap
          resourceName: mock-metric
          key: metric-value
          targetValue: "5"

In this example, the DistributedScaledObject named nginx is configured to scale a Deployment named nginx across two member clusters. The memberClusters field whitelists the member clusters to be used along with their respective weights, which determine how many replicas should be allocated to each cluster. This section is optional; if omitted, all registered member clusters will be used with equal weights.

The workloads of type Deployment are expected to be present in relevant member clusters in a matching namespace as the DistributedScaledObject.

The rebalancingPolicy field allows you to specify how to handle situations where a member cluster becomes unreachable. In this case, after the specified gracePeriod, the replicas that were allocated to the unreachable cluster will be redistributed among the remaining healthy clusters. Once the unreachable cluster becomes healthy again, the replicas will be rebalanced back according to the defined weights.

Status of the DistributedScaledObject provides insights into the scaling state across member clusters:

status:
  memberClusterStatuses:
    member-cluster-1:
      currentReplicas: 2
      description: Cluster is healthy
      desiredReplicas: 2
      id: /etc/mc/kubeconfigs/member-cluster-1.kubeconfig+kedify-agent@member-cluster-1
      lastStatusChangeTime: "2025-11-05T16:46:39Z"
      state: Ready
    member-cluster-2:
      currentReplicas: 3
      description: Cluster is healthy
      desiredReplicas: 3
      id: /etc/mc/kubeconfigs/member-cluster-2.kubeconfig+kedify-agent@member-cluster-2
      lastStatusChangeTime: "2025-11-05T15:45:44Z"
      state: Ready
  membersHealthyCount: 2
  membersTotalCount: 2
  selector: kedify-agent-distributedscaledobject=nginx
  totalCurrentReplicas: 5

DistributedScaledJob

Similar to DistributedScaledObject, the DistributedScaledJob extends KEDA’s ScaledJob concept to support job-based workloads across multiple clusters. It includes all the fields of a standard ScaledJob, along with additional fields to specify the member clusters and their configurations.

Distributed ScaledJob Architecture

Job-based workload: Instead of scaling Deployments, it creates and manages Jobs based on metrics

Prerequisite - Enabling KEDA raw metrics

Before using DistributedScaledJobs, make sure that the raw metrics endpoint in KEDA is enabled. Use the environment variable RAW_METRICS_GRPC_PROTOCOL and set it to the value enabled. In the values.yaml file:

keda:
  env:
    - name: RAW_METRICS_GRPC_PROTOCOL
      value: enabled

From the command line - add this argument to your helm installation command:

helm install ... \
  --set-json 'keda.env=[{"name":"RAW_METRICS_GRPC_PROTOCOL","value":"enabled"}]'

DistributedScaledJob Specification

The following example of a DistributedScaledJob splits the execution of job processing from the RabbitMQ task queue between two member clusters in a 2:3 ratio.

apiVersion: keda.kedify.io/v1alpha1
kind: DistributedScaledJob
metadata:
  name: processor-job
spec:
  memberClusters: # optional list of member clusters to use, if omitted all registered member clusters will be used
    - name: member-cluster-1
      weight: 2 # weight determines the proportion of jobs to be allocated to this cluster
    - name: member-cluster-2
      weight: 3
  clusterScheduling:
    strategy: weightedRoundRobin
    failoverPolicy:
      gracePeriod: 1m # wait before re-creating jobs that do not progress from Pending to Running
      hardTaintDuration: 5m # taint a failing cluster before scheduling jobs on it again
      softTaintDuration: 3m # if another failure happens in this window, apply a hard taint
  scaledJobSpec: # standard ScaledJob spec
    failedJobsHistoryLimit: 2 # keep up to 2 failed jobs, delete all older
    successfulJobsHistoryLimit: 2 # keep up to 2 jobs that completed successfully, delete all older
    jobTargetRef:
      template:
        spec:
          containers:
            - name: processor
              image: myapp:latest
              command: ["process"]
          restartPolicy: Never
    pollingInterval: 30
    maxReplicaCount: 20
    scalingStrategy:
      strategy: pendingAware # pending/stuck jobs can be re-created on another cluster
    triggers:
      - type: rabbitmq
        name: rabbit
        metadata:
          queueName: tasks
          host: http://guest:password@localhost:15672/path/vhost
          value: "5"

In this example, the DistributedScaledJob named processor-job is configured to scale Jobs across two member clusters.

clusterScheduling.failoverPolicy controls how taints are applied to failing clusters. If omitted, defaults are applied (gracePeriod: 1m, hardTaintDuration: 5m, softTaintDuration: 3m). See FailoverPolicy for details.

The jobTargetRef field contains the standard Kubernetes Job template specification. Jobs are created in the member clusters based on the scaling metrics and cluster weights.

Soft And Hard Taints

Soft and hard taints for failing clusters: when a call from KEDA cluster to member cluster fails, that cluster is not removed from scheduling immediately. Instead, it is soft-tainted first. After a second failure, the soft taint is escalated to a hard taint. This helps prevent unnecessary scheduling pauses due to temporary network glitches.

Cluster Scheduling Strategies

WeightedRoundRobin

weightedRoundRobin is the default strategy. If spec.clusterScheduling.strategy is omitted, DistributedScaledJob uses weightedRoundRobin. In this mode, jobs are distributed by memberClusters[].weight.

hard-tainted clusters are excluded from scheduling for all workloads
soft-tainted clusters remain eligible
memberClusters[].scheduling.priority is ignored in this mode

PriorityFailover

priorityFailover provides primary/failover behavior. The scheduler prefers the highest memberClusters[].scheduling.priority cluster, and only falls back when that cluster is excluded (for example tainted or overloaded).

Short example focused on failover configuration: Member cluster member-primary is preferred by the scheduler for new jobs until a job fails to progress from Pending to Running for longer than spec.clusterScheduling.failoverPolicy.gracePeriod. After the grace period is reached, the cluster is tainted for this affinity tuple (class: team-a, size: 4): jobs with size 4 and higher are excluded from that tainted cluster, while smaller jobs can still be scheduled there.

apiVersion: keda.kedify.io/v1alpha1
kind: DistributedScaledJob
metadata:
  name: processor-job-failover
spec:
  clusterScheduling:
    strategy: priorityFailover
    workloadAffinity:
      class: team-a
      size: 4
    failoverPolicy:
      gracePeriod: 1m
      hardTaintDuration: 5m
      softTaintDuration: 3m
  memberClusters:
    - name: member-primary
      scheduling:
        priority: 100
    - name: member-failover
      scheduling:
        priority: 0
  scaledJobSpec:
    scalingStrategy:
      strategy: pendingAware

Important differences vs weightedRoundRobin:

clusterScheduling.workloadAffinity is required in priorityFailover
memberClusters[].scheduling.priority is required for each member cluster in priorityFailover
failover taints are evaluated by (order, size, priority):
- a taint recorded at size N excludes workloads with size >= N
- workloads with smaller size can still use the same cluster
transient create Job API failures are treated as transient and do not taint the cluster in priorityFailover
dsj priorityFailover supports scaledJobSpec.scalingStrategy.strategy: pendingAware (or omitted strategy, which defaults to pendingAware)
podOverrides.affinity is only pod-placement configuration and does not drive cluster failover selection

FailoverPolicy

spec.clusterScheduling.failoverPolicy controls failover timing, taint duration and duplicate-job handling.

spec:
  clusterScheduling:
    failoverPolicy:
      gracePeriod: 1m
      hardTaintDuration: 5m
      softTaintDuration: 3m
      duplicationPolicy: keepAll

Fields:

gracePeriod (default: 1m): how long a pending job can stay pending before it is treated as stuck.
hardTaintDuration (default: 5m): how long a cluster remains hard-tainted after failure.
softTaintDuration (default: 3m): soft-taint window for escalation.
- If another failure happens within this window, taint escalates to hard taint.
- Set to 0s for immediate hard taint on first failure.
duplicationPolicy (default: keepAll): how to resolve duplicate source/failover jobs in pending-aware failover.

Duplication Policy

keepAll
- Behavior: keep source and failover jobs running when both exist.
- Provisioning tendency: can temporarily overprovision during failover/recovery windows (availability-first).
- Underprovisioning risk: lowest among the policies.
preferFailover
- Behavior: prefer the failover job; source is deleted once a healthy failover replacement is confirmed.
- Provisioning tendency: short-lived overprovisioning can happen while waiting for replacement health confirmation.
- Underprovisioning risk: low to medium (mainly during failover transitions if replacement cannot be confirmed quickly).
preferSource
- Behavior: prefer the original source job; replacement is deleted when source should resume ownership.
- Provisioning tendency: generally avoids prolonged overprovisioning.
- Underprovisioning risk: medium to high during unstable source-cluster recovery (source can be preferred before fully stable).
immediateSourceCleanup
- Behavior: source job is deleted immediately after failover scheduling.
- Provisioning tendency: minimizes overprovisioning.
- Underprovisioning risk: highest if the replacement is delayed or fails to become healthy.

Rule of thumb:

If you optimize for continuity/availability, use keepAll or preferFailover.
If you optimize for strict capacity/cost control, use immediateSourceCleanup or preferSource, accepting higher underprovisioning risk.

DistributedScaledJob Pod Overrides

DistributedScaledJob supports per-cluster overrides of selected pod/container fields. Overrides are defined on each memberClusters[] entry and are applied only when creating new Jobs. Existing Jobs are not modified in place.

Supported overrides:

podOverrides.nodeSelector: merged into the pod nodeSelector map (override values replace existing keys)
podOverrides.tolerations: merged with deterministic deduplication by identity key (key, operator, effect, value); jobTargetRef tolerations are applied first and podOverrides.tolerations are applied on top, so when the identity key matches an entry from podOverrides.tolerations overrides the one from jobTargetRef (last-writer-wins)
podOverrides.affinity: replaces base pod affinity
podOverrides.containerOverrides.<containerName>.image: replaces container image
podOverrides.containerOverrides.<containerName>.env: merged by env var name with override taking precedence
podOverrides.containerOverrides.<containerName>.resources:
- requests and limits are merged by resource name
- claims are replaced when explicitly provided (including empty list to clear existing claims)

If containerOverrides references a container name that does not exist in scaledJobSpec.jobTargetRef.template.spec.containers, the override is ignored and a warning event is emitted.

Example:

apiVersion: keda.kedify.io/v1alpha1
kind: DistributedScaledJob
metadata:
  name: processor-job
spec:
  memberClusters:
    - name: member-cluster-1
      weight: 2
      podOverrides:
        nodeSelector:
          nodepool: gpu
        tolerations:
          - key: "gpu"
            operator: "Exists"
            effect: "NoSchedule"
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                    - key: region
                      operator: In
                      values: ["us-east-1"]
        containerOverrides:
          processor:
            image: my-registry.local/processor:v2
            env:
              - name: LOG_LEVEL
                value: debug
            resources:
              requests:
                cpu: 200m
                memory: 256Mi
              limits:
                cpu: "1"
                memory: 1Gi
    - name: member-cluster-2
      weight: 3
  clusterScheduling:
    strategy: weightedRoundRobin
    failoverPolicy:
      gracePeriod: 1m
      hardTaintDuration: 5m
      softTaintDuration: 3m
  scaledJobSpec:
    jobTargetRef:
      template:
        spec:
          containers:
            - name: processor
              image: myapp:latest
          restartPolicy: Never
    triggers:
      - type: rabbitmq
        name: rabbit
        metadata:
          queueName: tasks
          host: http://guest:password@localhost:15672/path/vhost
          value: "5"

Status of the DistributedScaledJob provides insights into the job state across member clusters:

status:
  desiredJobs: 10
  runningJobs: 8
  pendingJobs: 2
  memberClusterStatuses:
    member-cluster-1:
      description: Cluster is healthy
      id: /etc/mc/kubeconfigs/member-cluster-1.kubeconfig+kedify-agent@member-cluster-1
      runningJobs: 3
      pendingJobs: 1
      stuckJobs: 0
      softTainted: false
      lastStatusChangeTime: "2025-11-05T16:46:39Z"
      state: Ready
      excluded: false
    member-cluster-2:
      description: Cluster is healthy
      id: /etc/mc/kubeconfigs/member-cluster-2.kubeconfig+kedify-agent@member-cluster-2
      runningJobs: 5
      pendingJobs: 1
      stuckJobs: 0
      softTainted: false
      lastStatusChangeTime: "2025-11-05T15:45:44Z"
      state: Ready
      excluded: false
  selector: kedify-agent-distributedscaledjob=processor-job

For a walkthrough example on how to set up and use multi-cluster scaling with Kedify, refer to the examples repository.

Scaling strategies

Scaling strategies are used to compute how many new Jobs to create across clusters. Choose one of: basic, pendingAware, custom, accurate, eager.

Inputs:

desiredJobsCount: target number derived from metrics and DSJ min/max bounds
runningJobsCount: number of non-terminal Jobs currently present (includes “pending”)
pendingJobsCount: subset of running Jobs considered “pending” (not yet progressed)
maxReplicaCount: DSJ upper bound on total concurrent non-terminal Jobs

Basic

Scale to the gap between desired and running.

Formula: desired - running
Behavior: simple “catch up” to desired.

Pending-Aware

Immediately re-create pending Jobs on other clusters while honoring capacity.

Idea: replace stuck Jobs.
Formula:
- needed = max(0, desired - running + pending)
- capacity = max(0, maxReplica - running)
- scaleTo = min(needed, capacity)
Use when pending/stuck Jobs should be replaced/failovered elsewhere quickly.

Custom

User-defined scaling using a percentage of running Jobs and optional queue deduction.

Inputs: runningJobPercentage (float), queueLengthDeduction (int)
Formula: scaleTo = min(desired - deduction - running * percentage, maxReplica)
Notes:
- If percentage parse fails, falls back to Basic.

Accurate

Balance towards desired while staying within capacity; subtract pending from desired unless over max.

Formula:
- If desired + running > maxReplica: scaleTo = maxReplica - running
- Else: scaleTo = desired - pending
Use when pending work should defer new creations and capacity must be respected.

Eager

Fill available capacity (excluding pending) up to desired.

Formula: scaleTo = min(maxReplica - running - pending, desired)
Use when it’s safe to aggressively utilize capacity.

Choosing a strategy

Pending-Aware (default): prioritize re-creating stuck Jobs elsewhere.
Basic: simplest gap-based scaling.
Accurate: conservative, subtracts pending.
Eager: aggressive, fills capacity quickly.
Custom: tailor behavior with percentage and deductions.

Pausing Scaling Temporarily

Both DistributedScaledObject and DistributedScaledJob support pausing via annotation:

metadata:
  annotations:
    autoscaling.keda.sh/paused: "true"

When this annotation is set:

DistributedScaledObject reconciliation skips scaling actions across member clusters
DistributedScaledJob reconciliation skips creating new Jobs across member clusters
status condition Paused=True is set on the distributed resource

DistributedScaledObject also supports autoscaling.keda.sh/paused-replicas:

metadata:
  annotations:
    autoscaling.keda.sh/paused-replicas: "3"

Behavior for DistributedScaledObject:

autoscaling.keda.sh/paused-replicas pauses scaling and sets a fixed total target replicas value
if both annotations are set, autoscaling.keda.sh/paused-replicas takes precedence
value must be a non-negative integer

Pause existing resources:

kubectl annotate distributedscaledobject <name> autoscaling.keda.sh/paused="true"
kubectl annotate distributedscaledjob <name> autoscaling.keda.sh/paused="true"
kubectl annotate distributedscaledobject <name> autoscaling.keda.sh/paused-replicas="3"

To resume scaling, remove the annotation:

kubectl annotate distributedscaledobject <name> autoscaling.keda.sh/paused-
kubectl annotate distributedscaledjob <name> autoscaling.keda.sh/paused-
kubectl annotate distributedscaledobject <name> autoscaling.keda.sh/paused-replicas-