Circuit breaking is an important pattern that can help with service resiliency. This pattern is used to prevent additional failures by controlling and managing access to the failing services.
The easiest way to explain circuit breakers is with a simple example using a frontend called Hello web and backends called Greeter service. Let's say the greeter service starts failing, and instead of call it continuously, we could detect the failures and either stop or reduce the number of requests made to the service. If we added a database to this example, you could quickly imagine how calling the service could put more stress on different parts of the system and potentially make everything even worse. This scenario is where the circuit breaker comes into play. We define the conditions when we want the circuit breaker to trip (for example, if we get more than 10 failures within a 5 second time period), once circuit breaker trips, we won't be making calls to the underlying service anymore, instead, we will just directly return the error from the circuit breaker. This way, we are preventing additional strain and damage to the system.
In Istio, circuit breakers get defined in the destination rule. Circuit breaker tracks the status of each host, and if any of those hosts start to fail, it will eject it from the pool. Practically speaking, if we have five instances of our pod running, the circuit breaker will eject any of the instances that misbehave, so that the calls won't be made to those hosts anymore. Ejection is controlled by the outlier detection and it can be configured by setting the following properties:
number of consecutive errors
scan interval
base ejection time
In addition to the outliers, we can also set the connection pool properties - such as the number of connections and requests per connection made to the service.
The above rule sets the maximum number of pending requests to the service (http1MaxPendingRequests) to a single request. This means if there is more than one request queued for connection, the circuit breaker will trip. Similarly, the circuit breaker trips if there is more than one request per connection (maxRequestsPerConnection setting).
Envoy uses outlier detection to detect when pods are not reliable and it can eject them from the load balancing pool for a period of time (baseEjectionTime setting). If the pod is ejected from the load balancing pool, no requests will be able to reach it. The second setting in play with the ejection is the maxEjectionPercent. This setting represents a threshold that, if reached, causes the circuit breaker to load balance across all pods again.
Let's try to explain this with an example where maxEjectionPercent is set to 50%. If pods are failing, circuit breaker keeps ejecting them from the load balancing pool. With failing pods ejected from the load balancing pool, only healthy pods receive traffic
At some point, even the healthy pods start failing and once the 50% threshold is reached, circuit breaker reverts back to the original load balancing logic and starts load balancing across all pods again (both healthy and failing ones). This setting is used in case of severe outages, so you can start dropping some requests, instead of exhausting the fewer healthy pods that are still available.
The decision to eject a pod or not is controlled by the consecutiveErrors setting. In the above example, if there's more than 1 error (an HTTP 5xx is considered an error), the pod gets ejected from the load balancing pool. Finally, the interval is a time interval between checking if new pods need to be ejected or brought back into the load balancing pool.
Let's deploy the destination rule that configures a simple circuit breaker:
To demonstrate the circuit breaking we will use the load-testing library called Fortio. With Fortio we can easily control the number of connection, concurrency, and delays of the outgoing HTTP calls. Let's deploy Fortio:
With the above command, we are just making one call to the greeter service, and it all works, we get the response back. Let's try to trip the circuit breaker now.
You can use Fortio to make 20 requests with 2 concurrent connections like this:
This is telling us that 82% of requests succeded, and the rest was caught by the circuit breaker. Another way to see the calls that we trapped by the circuit breaker is to query the Istio proxy stats:
The above stats are showing that 9 calls have been flagged for circuit breaking (which equals the number of failed requests we had with Fortio).
Similarly, you can also check the metrics from Prometheus. You can launch Prometheus with:
istioctl dashboard prometheus
From the Prometheus dashboard, use the following query to get a sum of all requests that went to the greeter-service and group them by response code, flags, and source app:
sum(istio_requests_total{destination_app="greeter-service}) by (response_code, response_flags, source_app)
You should see a result similar to the figure below:
The response flag you are looking for is UO. If you look up this from the Envoy documentation, the code UO stands for upstream overflow - circuit breaker.
Join the discussion
SHARE THIS ARTICLE
Peter Jausovec
Peter Jausovec is a platform advocate at Solo.io. He has more than 15 years of experience in the field of software development and tech, in various roles such as QA (test), software engineering and leading tech teams. He's been working in the cloud-native space, focusing on Kubernetes and service meshes, and delivering talks and workshops around the world. He authored and co-authored a couple of books, latest being Cloud Native: Using Containers, Functions, and Data to Build Next-Generation Applications.