Configure Envoy in the Kedify Proxy
At the core of kedify-http
scaler is kedify-proxy
, which forms a fleet of Envoy proxies. The fleet is configured over xDS control plane, implemented as part of the http-add-on interceptor
component.
There are two parts of envoy configuration that support override of the default configuration, both are set as values in the kedify-agent
helm chart:
cluster
- chart, envoy optionsroute
- chart, envoy options
Retry Configuration on Error
Section titled “Retry Configuration on Error”By default, kedify-proxy
will not retry requests that fail with any error code and will return the error code to the client. With route
configuration, you can enable automatic retries for specific error situations. For example, to retry on 5xx errors, you can set the following in your kedify-agent
values:
agent: kedifyProxy: globalEnvoyConfigs: route: retry_policy: retry_on: 5xx # any internal or external 5xx error num_retries: 5 # retry up to 5 times retry_back_off: base_interval: 1s # first retry will be after 1 second max_interval: 10s # maximum interval between retries is 10 seconds with exponential backoff
This envoy config snippet means kedify-proxy
will retry requests that fail with 5xx errors up to 5 times, with an exponential backoff starting at 1 second and capping at 10 seconds.
Slow Start Configuration
Section titled “Slow Start Configuration”The kedify-proxy
envoy uses ROUND_ROBIN
load balancing strategy by default. This means that all endpoints in the cluster are treated equally, no matter how long they have been up. This can lead to issues if some endpoints are slow to start, as they may receive high load of traffic before they are ready.
To mitigate this, you can enable slow start for the cluster by setting the slow_start_config
configuration in the cluster
section of your kedify-agent
values:
agent: kedifyProxy: globalEnvoyConfigs: cluster: lb_policy: ROUND_ROBIN round_robin_lb_config: slow_start_config: slow_start_window: 60s # slow start window will take effect for 60 seconds, after that it's ROUND_ROBIN min_weight_percent: value: 1.0 # as little as 1% of the traffic can be sent to the new endpoint to warm it up aggression: default_value: 1.0 # pace of traffic increase during the slow start window, lower number means slower in the beginning runtime_key: slow_start_aggression
This envoy config snippet will instruct kedify-proxy
to use a slow start window of 60 seconds, during which as little as 1% of the traffic will be sent to the new endpoint and it will gradually increase. After the slow start window, the endpoint will be treated equally with other endpoints in the service and receive its fair share of the traffic.
Preconnecting
Section titled “Preconnecting”Envoy supports preconnecting endpoints
in the cluster
, which can help reduce latency for requests by anticipating a request and establishing a TCP session before it’s needed.
agent: kedifyProxy: globalEnvoyConfigs: cluster: preconnect_policy: per_upstream_preconnect_ratio: 1.05 # preconnect 5 upstream connections for every 100 request predictive_preconnect_ratio: 1.05 # preconnect 5 spare connection on the anticipated next endpoint for 100 requests
Having per_upstream_preconnect_ratio
set to 1.05 means that for each 100 requests, kedify-proxy
will preconnect 5 upstream connections in the cluster instead of waiting for the new request to arrive and then establishing the connection. Setting predictive_preconnect_ratio
to 1.05 means that for each 100 requests, kedify-proxy will try to predict the next 5 connections anticipating what the next endpoint should be based on the envoy cluster’s internal loadbalancing configuration.
Active Health Checks
Section titled “Active Health Checks”Envoy supports active health checks for endpoints in the cluster. This can help ensure that only healthy endpoints receive traffic. Because this configuration is set globally for each cluster, we recommend only very rudimentary TCP health checks.
agent: kedifyProxy: globalEnvoyConfigs: cluster: common_lb_config: ignore_new_hosts_until_first_hc: true # apply health checks to new hosts healthy_panic_threshold: value: 0.2 # panic if more than 80% of the hosts are unhealthy health_checks: - timeout: 1s # timeout for the health check request interval: 2s # interval between health checks during endpoint's healthy state unhealthy_interval: 10s # interval between health checks during endpoint's unhealthy state unhealthy_threshold: 3 # endpoint is considered unhealthy after 3 consecutive failed health checks healthy_threshold: 2 # endpoint is considered healthy after 2 consecutive successful health checks tcp_health_check: {} # check if envoy can establish a TCP connection to the endpoint
This envoy config snippet will instruct kedify-proxy
to perform active health checks on the endpoints in the cluster. The health check will timeout after 1 second, and will be performed every 2 seconds while the endpoint is healthy, and every 10 seconds while the endpoint is unhealthy. The endpoint will be considered unhealthy after 3 consecutive failed health checks, and healthy again after 2 consecutive successful health checks. The health check will simply try to establish a TCP connection to the endpoint and then send FIN packet.
The panic threshold is set to 20%, which means that if more than 80% of the endpoints in the cluster are unhealthy, kedify-proxy
will panic and sending traffic to all endpoints regardless of their health status. This is to prevent the situation where all endpoints are unhealthy and no traffic is being sent. Change this parameter with care, because it directly relates to the ScaledObject
scale-up behavior. If the threshold is too high, even regular rapid scale-up events can trigger the panic mode.
Passive Health Checks
Section titled “Passive Health Checks”Envoy also supports passive health checks through outlier detection. These checks are performed on the actual traffic and can help detect unhealthy endpoints based on their response codes and latency.
agent: kedifyProxy: globalEnvoyConfigs: cluster: outlier_detection: interval: 5s # interval between outlier detection checks max_ejection_percent: 50 # maximum percentage of endpoints that can be ejected from the cluster split_external_local_origin_errors: true # treat external and local origin errors separately consecutive_gateway_failure: 3 # number of consecutive requests that must fail with gateway error or timeout to eject the endpoint enforcing_consecutive_gateway_failure: 100 # enforce the ejection each time the threshold is reached
This envoy config snippet will instruct kedify-proxy
to perform passive health checks on the endpoints in the cluster. The outlier detection will be performed every 5 seconds, and unless 50% or more are already ejected, it will eject endpoints that fail 3 consecutive requests with envoy internal error or timeout. The split_external_local_origin_errors
parameter is set to true, which means that external and local origin errors will be treated separately. This is useful if you want to treat errors from external services differently from errors from your own services.