Skip to content

Configure Envoy in the Kedify Proxy

At the core of kedify-http scaler is kedify-proxy, which forms a fleet of Envoy proxies. The fleet is configured over xDS control plane, implemented as part of the http-add-on interceptor component.

There are two parts of envoy configuration that support override of the default configuration, both are set as values in the kedify-agent helm chart:

  1. cluster - chart, envoy options
  2. route - chart, envoy options

By default, kedify-proxy will not retry requests that fail with any error code and will return the error code to the client. With route configuration, you can enable automatic retries for specific error situations. For example, to retry on 5xx errors, you can set the following in your kedify-agent values:

agent:
kedifyProxy:
globalEnvoyConfigs:
route:
retry_policy:
retry_on: 5xx # any internal or external 5xx error
num_retries: 5 # retry up to 5 times
retry_back_off:
base_interval: 1s # first retry will be after 1 second
max_interval: 10s # maximum interval between retries is 10 seconds with exponential backoff

This envoy config snippet means kedify-proxy will retry requests that fail with 5xx errors up to 5 times, with an exponential backoff starting at 1 second and capping at 10 seconds.

The kedify-proxy envoy uses ROUND_ROBIN load balancing strategy by default. This means that all endpoints in the cluster are treated equally, no matter how long they have been up. This can lead to issues if some endpoints are slow to start, as they may receive high load of traffic before they are ready. To mitigate this, you can enable slow start for the cluster by setting the slow_start_config configuration in the cluster section of your kedify-agent values:

agent:
kedifyProxy:
globalEnvoyConfigs:
cluster:
lb_policy: ROUND_ROBIN
round_robin_lb_config:
slow_start_config:
slow_start_window: 60s # slow start window will take effect for 60 seconds, after that it's ROUND_ROBIN
min_weight_percent:
value: 1.0 # as little as 1% of the traffic can be sent to the new endpoint to warm it up
aggression:
default_value: 1.0 # pace of traffic increase during the slow start window, lower number means slower in the beginning
runtime_key: slow_start_aggression

This envoy config snippet will instruct kedify-proxy to use a slow start window of 60 seconds, during which as little as 1% of the traffic will be sent to the new endpoint and it will gradually increase. After the slow start window, the endpoint will be treated equally with other endpoints in the service and receive its fair share of the traffic.

Envoy supports preconnecting endpoints in the cluster, which can help reduce latency for requests by anticipating a request and establishing a TCP session before it’s needed.

agent:
kedifyProxy:
globalEnvoyConfigs:
cluster:
preconnect_policy:
per_upstream_preconnect_ratio: 1.05 # preconnect 5 upstream connections for every 100 request
predictive_preconnect_ratio: 1.05 # preconnect 5 spare connection on the anticipated next endpoint for 100 requests

Having per_upstream_preconnect_ratio set to 1.05 means that for each 100 requests, kedify-proxy will preconnect 5 upstream connections in the cluster instead of waiting for the new request to arrive and then establishing the connection. Setting predictive_preconnect_ratio to 1.05 means that for each 100 requests, kedify-proxy will try to predict the next 5 connections anticipating what the next endpoint should be based on the envoy cluster’s internal loadbalancing configuration.

Envoy supports active health checks for endpoints in the cluster. This can help ensure that only healthy endpoints receive traffic. Because this configuration is set globally for each cluster, we recommend only very rudimentary TCP health checks.

agent:
kedifyProxy:
globalEnvoyConfigs:
cluster:
common_lb_config:
ignore_new_hosts_until_first_hc: true # apply health checks to new hosts
healthy_panic_threshold:
value: 0.2 # panic if more than 80% of the hosts are unhealthy
health_checks:
- timeout: 1s # timeout for the health check request
interval: 2s # interval between health checks during endpoint's healthy state
unhealthy_interval: 10s # interval between health checks during endpoint's unhealthy state
unhealthy_threshold: 3 # endpoint is considered unhealthy after 3 consecutive failed health checks
healthy_threshold: 2 # endpoint is considered healthy after 2 consecutive successful health checks
tcp_health_check: {} # check if envoy can establish a TCP connection to the endpoint

This envoy config snippet will instruct kedify-proxy to perform active health checks on the endpoints in the cluster. The health check will timeout after 1 second, and will be performed every 2 seconds while the endpoint is healthy, and every 10 seconds while the endpoint is unhealthy. The endpoint will be considered unhealthy after 3 consecutive failed health checks, and healthy again after 2 consecutive successful health checks. The health check will simply try to establish a TCP connection to the endpoint and then send FIN packet.

The panic threshold is set to 20%, which means that if more than 80% of the endpoints in the cluster are unhealthy, kedify-proxy will panic and sending traffic to all endpoints regardless of their health status. This is to prevent the situation where all endpoints are unhealthy and no traffic is being sent. Change this parameter with care, because it directly relates to the ScaledObject scale-up behavior. If the threshold is too high, even regular rapid scale-up events can trigger the panic mode.

Envoy also supports passive health checks through outlier detection. These checks are performed on the actual traffic and can help detect unhealthy endpoints based on their response codes and latency.

agent:
kedifyProxy:
globalEnvoyConfigs:
cluster:
outlier_detection:
interval: 5s # interval between outlier detection checks
max_ejection_percent: 50 # maximum percentage of endpoints that can be ejected from the cluster
split_external_local_origin_errors: true # treat external and local origin errors separately
consecutive_gateway_failure: 3 # number of consecutive requests that must fail with gateway error or timeout to eject the endpoint
enforcing_consecutive_gateway_failure: 100 # enforce the ejection each time the threshold is reached

This envoy config snippet will instruct kedify-proxy to perform passive health checks on the endpoints in the cluster. The outlier detection will be performed every 5 seconds, and unless 50% or more are already ejected, it will eject endpoints that fail 3 consecutive requests with envoy internal error or timeout. The split_external_local_origin_errors parameter is set to true, which means that external and local origin errors will be treated separately. This is useful if you want to treat errors from external services differently from errors from your own services.