In my previous blog post, I discussed the need for SaaS systems to manage load on their system in a way that’s fair for all their tenants. I went over the big challenge of noisy neighbors and how difficult it can be to solve these problems at your company. But solve them you must, for the success of your business. So let’s see how you can build a solution that will best address these problems with the resources you have.
First, any solution you build needs to know when each system is getting overloaded, which requires a dynamic view of capacity for each service. A simple way to track overload is to look at latency and error rates; when those values get unacceptably high, you would declare a system overloaded and throw an alert. Using these higher level metrics lets you focus on perceived performance regardless of what’s actually happening inside the service.
For example, if your requests return quickly and successfully, does it matter if a service is running hot, or if CPU is at 95%, etc.? Probably not. Also, because these higher level metrics are easily measured at the entry points and are always available, regardless of the architecture and hardware it is built on, this technique is easy to implement consistently across all your systems.
Many solutions look at a much more complex set of metrics, usually at the hardware level on each instance, which often leads to a lot of complexity without much additional benefit. Of course, looking only at the top-level metrics will not catch problems with individual instances, such as hardware or code issues. But you (hopefully!) already have monitoring tools that can detect those problems and alert on them, so there is no need to rebuild that functionality in your new product.
But how do you determine what values are “unacceptably high” for metrics such as response time and error rate? There are various criteria you can use to identify outliers, but remember that your criteria must be dynamic. Because the code and environment of your systems are constantly changing, you can’t use static values. Whatever strategy you implement to define thresholds for throwing an alert, it must be able to adjust to this quickly changing environment.
Now that you’ve been alerted that a service is overloaded, what do you do? As I discussed before, the simple way is just to blindly limit requests. But as I also discussed, this solution creates a host of bad outcomes for your customers. Assuming you care about your customers, your goal should be this concept of “fairness” among your customers; that is, each customer should have its fair share of requests serviced. You need to implement some kind of intelligent load-shedding strategy that is focused on this fairness.
For example, if one customer is overloading the system with many intensive requests, your product should shed load primarily from that customer, without affecting those customers who are harmlessly doing very little. If you simply limit an even number of requests from each customer, all your customers will equally share in the pain of a problem caused by only a single customer, and this is not “fair.” The more requests and the more resources one customer is using, the more that customer’s requests should be limited, allowing well-behaving customers to not feel the pain. If you can develop an algorithm to implement this strategy, you will have achieved this fairness and you will have much more satisfied customers!
If you want to look beyond tenant fairness, you can also address operation priority. Some types of operations are inherently much more important than others. For example, a request that is doing a financial transaction is of much higher priority then a background reporting request. So in a more advanced implementation, you can assign priorities to requests based on what type of work they are doing, and add this prioritization to your tenant queue.
Of course there are some nuances to all this load shedding. If you blindly start rejecting requests all at once, you may create an even worse problem: a retry storm, as those tenants will likely retry the request upon getting an error code. To spread out the retries, hold the requests you plan to reject in a queue and pace your rejections over time. (By “rejection,” I simply mean returning an HTTP 429 error code for typical web services or something equivalent for other types of services such as database queries, etc. This type of rejection notifies the caller that you can’t handle his request, as opposed to reporting an actual error within the system.)
Another important implementational consideration is how your load shedding is interacting with the underlying system’s attempt to scale under load. You don’t want to arbitrarily shed so much load that the system won’t even try to scale, because sometimes scaling will allow the system to handle the load instead of rejecting it. But the reason you are shedding load is to allow the system to scale gracefully. So you should develop a way to shed gradually, re-evaluating as you go.
As you get an alert about overloaded services, you can try to shed slowly and ramp up over time. But as you constantly recheck, you may see that the alert can be canceled, as the service has scaled enough to handle additional load, and so you need to slow down shedding and let the extra capacity handle the additional load. But again, only gradually reduce the shedding, because you don’t know how much the additional capacity can handle, and you will only find out as you allow in more traffic. Thus intelligent load shedding needs to be implemented as an iterative process; a “one-and-done” strategy will not be dynamic enough to reflect and respond to actual tenant behavior.
Of course, the critical question is this: how do you know how well your solution is working, or if it is working at all? It’s vital to have reporting so you can understand how your solution is affecting the underlying system. This actually brings up a good question: what are your criteria for determining success? I would suggest the concept of “goodput.” Goodput is defined as the percentage of the total requests received by the system that were handled without errors or degraded performance. In other words, goodput is the success rate of requests being handled by your system.
In a perfect system, goodput is 100% (all requests are handled perfectly) and as a system gets more and more overloaded, goodput declines until it eventually reaches 0% (usually once your system is down). This concept leads to another question: how do you decide if a request is “degraded?” One good way is to just count the outliers, such as the slowest 10%, 5% or even 1% of the requests. The actual value will depend on how variable the speed of requests is on your systems: if there’s very little variation, the slowest 10% might all be really slow; if there’s more of a bell curve or long tail, then only requests at the very end (perhaps 1% or less) might be considered slow. In other circumstances, you could also look at request performance that does not meet SLAs as defined by your contract with customers.
Your goal should be that even during overwhelming load, especially from specific tenants, your load shedding implementation maximizes the number of successful requests and minimizes the number of tenants impacted by the storm and resulting shedding exercise. This goodput should be higher for requests and tenants with your new solution than without it. And of course, you should always aim to raise that goodput by regularly working to improve the implementation and alert configurations.
I know you probably aren’t ready to go off and build this great product just yet (and hopefully one will be available in the market one day!). But I hope this discussion has helped you to think about your problems with heavy load and the never-ending issue of managing noisy neighbors on your system. If you have suggestions on what a solution would look like or comments about my proposed implementation, please let me know!
Just remember: for anyone trying to build their own solution, the big takeaway here should be that it’s going to be challenging, because there is no simple way to deal with a constantly changing environment. Your customer’s load pattern is changing, your product is changing, and your infrastructure is changing. Any solution must be complex and dynamic enough to successfully work with a system that is itself complex and dynamic. In a dynamic environment, static solutions simply won’t suffice.