Configure Envoy in the Kedify Proxy

At the core of kedify-http scaler is kedify-proxy, which forms a fleet of Envoy proxies. The fleet is configured over xDS control plane, implemented as part of the http-add-on interceptor component.

There are two parts of envoy configuration that support override of the default configuration, both are set as values in the kedify-agent helm chart:

cluster - chart, envoy options
route - chart, envoy options

Retry Configuration on Error

By default, kedify-proxy will not retry requests that fail with any error code and will return the error code to the client. With route configuration, you can enable automatic retries for specific error situations. For example, to retry on 5xx errors, you can set the following in your kedify-agent values:

agent:
  kedifyProxy:
    globalEnvoyConfigs:
      route:
        retry_policy:
          retry_on: 5xx # any internal or external 5xx error
          num_retries: 5 # retry up to 5 times
          retry_back_off:
            base_interval: 1s # first retry will be after 1 second
            max_interval: 10s # maximum interval between retries is 10 seconds with exponential backoff

This envoy config snippet means kedify-proxy will retry requests that fail with 5xx errors up to 5 times, with an exponential backoff starting at 1 second and capping at 10 seconds.

Slow Start Configuration

The kedify-proxy envoy uses ROUND_ROBIN load balancing strategy by default. This means that all endpoints in the cluster are treated equally, no matter how long they have been up. This can lead to issues if some endpoints are slow to start, as they may receive high load of traffic before they are ready. To mitigate this, you can enable slow start for the cluster by setting the slow_start_config configuration in the cluster section of your kedify-agent values:

agent:
  kedifyProxy:
    globalEnvoyConfigs:
      cluster:
        lb_policy: ROUND_ROBIN
        round_robin_lb_config:
          slow_start_config:
            slow_start_window: 60s   # slow start window will take effect for 60 seconds, after that it's ROUND_ROBIN
            min_weight_percent:
              value: 1.0             # as little as 1% of the traffic can be sent to the new endpoint to warm it up
            aggression:
              default_value: 1.0     # pace of traffic increase during the slow start window, lower number means slower in the beginning
              runtime_key: slow_start_aggression

This envoy config snippet will instruct kedify-proxy to use a slow start window of 60 seconds, during which as little as 1% of the traffic will be sent to the new endpoint and it will gradually increase. After the slow start window, the endpoint will be treated equally with other endpoints in the service and receive its fair share of the traffic.

Preconnecting

Envoy supports preconnecting endpoints in the cluster, which can help reduce latency for requests by anticipating a request and establishing a TCP session before it’s needed.

agent:
  kedifyProxy:
    globalEnvoyConfigs:
      cluster:
        preconnect_policy:
          per_upstream_preconnect_ratio: 1.05  # preconnect 5 upstream connections for every 100 request
          predictive_preconnect_ratio: 1.05    # preconnect 5 spare connection on the anticipated next endpoint for 100 requests

Having per_upstream_preconnect_ratio set to 1.05 means that for each 100 requests, kedify-proxy will preconnect 5 upstream connections in the cluster instead of waiting for the new request to arrive and then establishing the connection. Setting predictive_preconnect_ratio to 1.05 means that for each 100 requests, kedify-proxy will try to predict the next 5 connections anticipating what the next endpoint should be based on the envoy cluster’s internal loadbalancing configuration.

Active Health Checks

Envoy supports active health checks for endpoints in the cluster. This can help ensure that only healthy endpoints receive traffic. Because this configuration is set globally for each cluster, we recommend only very rudimentary TCP health checks.

agent:
  kedifyProxy:
    globalEnvoyConfigs:
      cluster:
        common_lb_config:
          ignore_new_hosts_until_first_hc: true  # apply health checks to new hosts
          healthy_panic_threshold:
            value: 0.2                           # panic if more than 80% of the hosts are unhealthy
        health_checks:
          - timeout: 1s                          # timeout for the health check request
            interval: 2s                         # interval between health checks during endpoint's healthy state
            unhealthy_interval: 10s              # interval between health checks during endpoint's unhealthy state
            unhealthy_threshold: 3               # endpoint is considered unhealthy after 3 consecutive failed health checks
            healthy_threshold: 2                 # endpoint is considered healthy after 2 consecutive successful health checks
            tcp_health_check: {}                 # check if envoy can establish a TCP connection to the endpoint

This envoy config snippet will instruct kedify-proxy to perform active health checks on the endpoints in the cluster. The health check will timeout after 1 second, and will be performed every 2 seconds while the endpoint is healthy, and every 10 seconds while the endpoint is unhealthy. The endpoint will be considered unhealthy after 3 consecutive failed health checks, and healthy again after 2 consecutive successful health checks. The health check will simply try to establish a TCP connection to the endpoint and then send FIN packet.

The panic threshold is set to 20%, which means that if more than 80% of the endpoints in the cluster are unhealthy, kedify-proxy will panic and sending traffic to all endpoints regardless of their health status. This is to prevent the situation where all endpoints are unhealthy and no traffic is being sent. Change this parameter with care, because it directly relates to the ScaledObject scale-up behavior. If the threshold is too high, even regular rapid scale-up events can trigger the panic mode.

Passive Health Checks

Envoy also supports passive health checks through outlier detection. These checks are performed on the actual traffic and can help detect unhealthy endpoints based on their response codes and latency.

agent:
  kedifyProxy:
    globalEnvoyConfigs:
      cluster:
        outlier_detection:
          interval: 5s                                # interval between outlier detection checks
          max_ejection_percent: 50                    # maximum percentage of endpoints that can be ejected from the cluster
          split_external_local_origin_errors: true    # treat external and local origin errors separately
          consecutive_gateway_failure: 3              # number of consecutive requests that must fail with gateway error or timeout to eject the endpoint
          enforcing_consecutive_gateway_failure: 100  # enforce the ejection each time the threshold is reached

This envoy config snippet will instruct kedify-proxy to perform passive health checks on the endpoints in the cluster. The outlier detection will be performed every 5 seconds, and unless 50% or more are already ejected, it will eject endpoints that fail 3 consecutive requests with envoy internal error or timeout. The split_external_local_origin_errors parameter is set to true, which means that external and local origin errors will be treated separately. This is useful if you want to treat errors from external services differently from errors from your own services.