Boris LivshutzSTARTUP ADVISOR

The secret to raising pre-seed money



Building Your Own Solution for SaaS Capacity Management

A customer-centric approach to SaaS, Part 4

In my previous blog post, I discussed the need for SaaS systems to manage load on their system in a way that’s fair for all their tenants. I went over the big challenge of noisy neighbors and how difficult it can be to solve these problems at your company. But solve them you must, for the success of your business. So let’s see how you can build a solution that will best address these problems with the resources you have.

Step 1 — Problem Detection

First, any solution you build needs to know when each system is getting overloaded, which requires a dynamic view of capacity for each service. A simple way to track overload is to look at latency and error rates; when those values get unacceptably high, you would declare a system overloaded and throw an alert. Using these higher level metrics lets you focus on perceived performance regardless of what’s actually happening inside the service.

For example, if your requests return quickly and successfully, does it matter if a service is running hot, or if CPU is at 95%, etc.? Probably not. Also, because these higher level metrics are easily measured at the entry points and are always available, regardless of the architecture and hardware it is built on, this technique is easy to implement consistently across all your systems.

Many solutions look at a much more complex set of metrics, usually at the hardware level on each instance, which often leads to a lot of complexity without much additional benefit. Of course, looking only at the top-level metrics will not catch problems with individual instances, such as hardware or code issues. But you (hopefully!) already have monitoring tools that can detect those problems and alert on them, so there is no need to rebuild that functionality in your new product.

But how do you determine what values are “unacceptably high” for metrics such as response time and error rate? There are various criteria you can use to identify outliers, but remember that your criteria must be dynamic. Because the code and environment of your systems are constantly changing, you can’t use static values. Whatever strategy you implement to define thresholds for throwing an alert, it must be able to adjust to this quickly changing environment.

Step 2 — How to shed load

Now that you’ve been alerted that a service is overloaded, what do you do? As I discussed before, the simple way is just to blindly limit requests. But as I also discussed, this solution creates a host of bad outcomes for your customers. Assuming you care about your customers, your goal should be this concept of “fairness” among your customers; that is, each customer should have its fair share of requests serviced. You need to implement some kind of intelligent load-shedding strategy that is focused on this fairness.

For example, if one customer is overloading the system with many intensive requests, your product should shed load primarily from that customer, without affecting those customers who are harmlessly doing very little. If you simply limit an even number of requests from each customer, all your customers will equally share in the pain of a problem caused by only a single customer, and this is not “fair.” The more requests and the more resources one customer is using, the more that customer’s requests should be limited, allowing well-behaving customers to not feel the pain. If you can develop an algorithm to implement this strategy, you will have achieved this fairness and you will have much more satisfied customers!

If you want to look beyond tenant fairness, you can also address operation priority. Some types of operations are inherently much more important than others. For example, a request that is doing a financial transaction is of much higher priority then a background reporting request. So in a more advanced implementation, you can assign priorities to requests based on what type of work they are doing, and add this prioritization to your tenant queue.

Of course there are some nuances to all this load shedding. If you blindly start rejecting requests all at once, you may create an even worse problem: a retry storm, as those tenants will likely retry the request upon getting an error code. To spread out the retries, hold the requests you plan to reject in a queue and pace your rejections over time. (By “rejection,” I simply mean returning an HTTP 429 error code for typical web services or something equivalent for other types of services such as database queries, etc. This type of rejection notifies the caller that you can’t handle his request, as opposed to reporting an actual error within the system.)

Another important implementational consideration is how your load shedding is interacting with the underlying system’s attempt to scale under load. You don’t want to arbitrarily shed so much load that the system won’t even try to scale, because sometimes scaling will allow the system to handle the load instead of rejecting it. But the reason you are shedding load is to allow the system to scale gracefully. So you should develop a way to shed gradually, re-evaluating as you go.

As you get an alert about overloaded services, you can try to shed slowly and ramp up over time. But as you constantly recheck, you may see that the alert can be canceled, as the service has scaled enough to handle additional load, and so you need to slow down shedding and let the extra capacity handle the additional load. But again, only gradually reduce the shedding, because you don’t know how much the additional capacity can handle, and you will only find out as you allow in more traffic. Thus intelligent load shedding needs to be implemented as an iterative process; a “one-and-done” strategy will not be dynamic enough to reflect and respond to actual tenant behavior.

Step 3 — Measuring Success

Of course, the critical question is this: how do you know how well your solution is working, or if it is working at all? It’s vital to have reporting so you can understand how your solution is affecting the underlying system. This actually brings up a good question: what are your criteria for determining success? I would suggest the concept of “goodput.” Goodput is defined as the percentage of the total requests received by the system that were handled without errors or degraded performance. In other words, goodput is the success rate of requests being handled by your system.

In a perfect system, goodput is 100% (all requests are handled perfectly) and as a system gets more and more overloaded, goodput declines until it eventually reaches 0% (usually once your system is down). This concept leads to another question: how do you decide if a request is “degraded?” One good way is to just count the outliers, such as the slowest 10%, 5% or even 1% of the requests. The actual value will depend on how variable the speed of requests is on your systems: if there’s very little variation, the slowest 10% might all be really slow; if there’s more of a bell curve or long tail, then only requests at the very end (perhaps 1% or less) might be considered slow. In other circumstances, you could also look at request performance that does not meet SLAs as defined by your contract with customers.

Your goal should be that even during overwhelming load, especially from specific tenants, your load shedding implementation maximizes the number of successful requests and minimizes the number of tenants impacted by the storm and resulting shedding exercise. This goodput should be higher for requests and tenants with your new solution than without it. And of course, you should always aim to raise that goodput by regularly working to improve the implementation and alert configurations.

Putting It All Together

I know you probably aren’t ready to go off and build this great product just yet (and hopefully one will be available in the market one day!). But I hope this discussion has helped you to think about your problems with heavy load and the never-ending issue of managing noisy neighbors on your system. If you have suggestions on what a solution would look like or comments about my proposed implementation, please let me know!

Just remember: for anyone trying to build their own solution, the big takeaway here should be that it’s going to be challenging, because there is no simple way to deal with a constantly changing environment. Your customer’s load pattern is changing, your product is changing, and your infrastructure is changing. Any solution must be complex and dynamic enough to successfully work with a system that is itself complex and dynamic. In a dynamic environment, static solutions simply won’t suffice.



Are You Fair to Your SaaS Neighbor?

A customer-centric approach to SaaS, Part 3

In my previous two blogs posts, I discussed the reasons your business must have visibility into the tenants on your SaaS applications. Now I’d like to address a very tough and important challenge all SaaS companies face: what to do when your systems get overloaded during busy times.

The easiest thing to do when your system is struggling is just ignore the tenants; reject all incoming requests and then hope for the best. Tenant visibility doesn’t even come into play. The problem with this approach is that you might be ensuring a terrible experience for all your customers when only one of them is overloading your systems. This is known as the “noisy neighbor” problem. Your ​​system has a noisy neighbor problem when one or a few of your customers place such a tremendous load on your system that it adversely affects all customers using the platform.

But it doesn’t have to be this way; a system can be designed such that the usage of one customer doesn’t have to hurt the rest of your customers. In this blog post, I will discuss various approaches that can do just this.

At this point, you may be thinking you don’t need to worry about noisy neighbors. If you already do auto-scaling, you use features like AWS Lambda, etc. Isn’t that enough to adjust to varying load, regardless of which tenant is causing it? Well, as many companies have learned the hard way, auto-scaling is extremely hard to get right, and most smaller companies have a hard time making it work.

Even if you do master this advanced technology, and successfully auto-scale at the right times, it won’t prevent all overload issues; during a spike in load, your new capacity won’t magically appear instantaneously. First, your system will take time to detect the heavier load and ensure it continues long enough to justify scaling. And even then, after your scaling kicks in, each new instance has to go through several steps to be ready to serve the load:

  • Each new instance first has to be assigned to you by the cloud.
  • The instance is then bootstrapped with its virtual resources.
  • Next, the applications have to start up, caches have to be warmed up, connections must be established, and so on.

This process will take at least 5 minutes to be completely ready, in the best of times. By that time, a big request storm might have already brought down your system. And your system can get overloaded even more just spinning up all those new instances at once.

But what about Lambda…can’t it scale infinitely? Well, there may not be a technical limit on how many Lambda instances can spring up suddenly, but what about the cost? And what about the downstream services it will end up calling? Many budgets have been wiped out in the first month after launching a system that has unlimited use of Lambda, because you are charged for each time you execute the function as well as for the time spent in executing the function. Once the budget folks come at you with their pitchforks, you will then put in strong limits on when Lambda functions can be executed, and this will bring you back to the original problem of not being able to handle load spikes.

But even if you have the budget and start invoking Lambda functions en masse on each load spike, the usual result will just be congestion further downstream. The functions will probably involve database calls, queue requests, use of physical resources such as a disk, and so on. Yes, the functions will get invoked without delay, but they will eventually end up waiting on the database storm they created.

So just like auto-scaling, Lambda is not a magic bullet and does not alleviate the need for a real solution to the noisy neighbor problem. So let’s explore some other companies’ DIY or open-source solutions.

Many companies use the most basic solution, called rate limiting. They simply place a limit on how many requests a service can allow at any given time, and reject any requests above that limit. While this solution is easy to implement and configure, it is not very effective at solving the problem. Usually, to be most conservative, limits are set so high that they only trigger in the most extreme DDOS-like situations. But if a company wants to be less conservative, instead of setting a very high limit, they leave it to the operations teams to set limits based on actual data.

The difficulty is that the optimal limit is a moving target that changes with every software update, infrastructure change or just changes in load mix. Setting limits in constantly changing environments becomes cumbersome, time-consuming and error-prone. Because of all this, most operations teams eventually resort to the first solution, which is to set it to a constant high value, leading back to the same problem we are trying to solve.

As it turns out, rate limiting exercises simply don’t limit load. Ultimately, either performance will degrade during load spikes, or you will have to grossly over-provision resources to handle any load storm. Over-provisioning wastes money, because the business is paying for a large amount of unused capacity.

The other problem is fairness. At some point, one of your customers might explode and be responsible for most of the requests on your system. By stopping all requests that exceed the limit, you are most likely rejecting requests from those customers who are just lightly using the system. Meanwhile, the noisy neighbor tenant is greedily getting most of its requests serviced.

So while the traditional simple solution tries to prevent overload, it is far from ideal, due to higher costs and allowing a bad tenant (noisy neighbor) to hurt everyone. But many companies, especially smaller ones, don’t have the internal expertise or resources in engineering to implement systems more intelligent than this.

Some very large companies have built their own advanced solutions to this problem. There are a lot of articles, talks, and blogs from companies such as Netflix, Amazon, Wechat and even Lyft. These companies have enormous resources and have found thoughtful solutions to address the problems of the basic solution I mentioned before. While these advanced solutions vary, the common ideas are as follows.

They all try to identify the capacity of each system and use certain metrics (usually request wait time in queue) to decide when to shed load. Then they try to decide which load to shed, based on priority of the calling service and fairness to each user. For example, they would rather kill trivial background tasks than financial transactions, and they don’t want to kill requests from the same user each time. Instead, they try to spread out the pain.

Another important feature is that these solutions, when running on each service, communicate with each other. Making decisions in a coordinated fashion helps to avoid retry and re-login storms. This feature also limits wasted work by not killing a request very deep in a call graph.

These solutions are quite impressive engineering feats, and they are indeed effective ways to deal with the noisy neighbor problem. But if you are reading this blog post, they are probably out of reach for you. They require immense engineering resources, teams on top of teams, and constant updates to account for changes in the underlying applications.

If you don’t have a massive and sophisticated engineering organization, you might want to try finding an open source library that can help. Libraries such as Kanaloa and concurrency-limits (Netflix) offer a few of the features I’ve discussed above. While they do help, they are hard to maintain, as they also require customization and configuration, which sometimes need to be modified or reconfigured as your environment and usage patterns change. They are not dynamic enough to be plug and play.

I know this wasn’t a very uplifting blog post for those of you who were hoping for a quick fix. Nevertheless, I hope this content has helped you understand that problems with heavy load and the never-ending noisy neighbors on your system are not easy to solve, and that you should give them more of your attention.

I hope the big takeaway for you is that managing dynamically changing load is intrinsically difficult because you are working in an environment that is never static; your customer’s load pattern is changing, your product is changing, and your infrastructure is changing. To address such a complex problem, any solution must also be highly dynamic!

But all is not lost. In my next blog post I will go over what a comprehensive solution should look like. If you have any ideas or suggestions on how to elegantly solve this problem, please write to me and let me know.



Your (SaaS) Business is Your Tenant!

A customer-centric approach to SaaS, Part 2

In my first blog post, I discussed the importance of being tenant-aware — that is, looking at your SaaS business from the perspective of each tenant. We dove into the technology and methodology of tracking tenants with metrics and quickly solving problems caused by tenants on the platform. In this post, I would like to switch over to focusing on the business aspects of being tenant-aware

Like any other business owner or executive, you are no doubt hyper focused on your top and bottom lines. For the top line, that means growing your revenue by improving customer usage and satisfaction. For the bottom line, that means optimizing your spending, which can only be done if you can accurately identify and reduce the biggest expenditures. In a SaaS business, this can be done well only if you have accurate usage data on your customers, i.e. your tenants.

In the last blog post, we saw how engineering can provide you with metrics that show how much each tenant is using the various services on your platform. Now let’s see how this knowledge will help you improve your top and bottom lines.

Cost Per Tenant

You are already tracking your costs. I’m sure you know how much the hardware, software and hosting is costing you. But how much does each customer cost? If you don’t know this, how can you know which customers are responsible for most of your costs? How do you know if a customer is even profitable?

Cost per tenant is one of the most important metrics to have on your business dashboard When you can see for every given time period (hourly, daily monthly, etc.) how much each tenant is costing you in terms of platform cost, you have gained actionable information on which you can then make critical cost-reduction decisions.

Prior analyses may have shown that database usage was eating up a big chunk of your profit. But by analyzing cost per tenant, you might now realize that actually just three small customers are responsible for half that database cost. Or you might determine that the cost of running one of your services isn’t really expensive, but rather that a few tenants are overusing it.

Possessing granular cost breakdowns like these lets you avoid a common, costly, and ultimately futile attempt at changing your architecture because budget folks think it’s too expensive. Instead, you can focus your efforts on actions that will improve both the top and bottom lines, as discussed below.

Customer profitability

Without tenant utilization detail, companies usually just look at customers by revenue. But this often leads to focusing on the wrong customers, as bigger customers may be draining much of your resources. That is, while they might contribute a lot to your top line, they might detract from your bottom line.

A much more useful metric is customer profitability. Are you even making a profit on a customer, and if so, how much? For customers that are highly profitable, you should focus more human resources on interacting with them, encouraging more usage of your platform, ensuring renewals, and so on.

Now what about customers that are not so profitable, or even ones you are losing money on? Those should require a different kind of focus; after investigating why these customers are so costly, you may have a number of options:

  • Flag them to the sales team to raise their price.
  • Move them onto a less expensive environment
  • Optimize your software so that it handles this customer in a more efficient way.

If you are not already doing so, you might also want to consider the tenant-aware billing model discussed next.

Tenant-aware billing model

Most SaaS companies charge their customers based on a simplistic model , such as number of users or per module, etc. But once you know the true cost this customer is placing on your platform, the billing model can be more sophisticated. For example, you can add additional billing for customers that have low profitability.

As shown above, smaller customers sometimes pay less but put a lot of strain on your platform. If you charge those customers more, you can improve your bottom line without raising costs on all your customers, which could risk attrition from your most profitable customers. Regardless of where your platform is hosted, you can use this more advanced cost-based billing model to ensure that each customer pays a fair share of your costs.

Sales Discounts

Related to all this, of course, is the age-old favorite sales tactic of discounting. Finance is always worried about sales reps giving too large a discount to customers and hurting the bottom line. Now with precise data on a customer’s costs, finance can allow discounts for highly profitable customers and not offer them to customers with low profitability. This data-driven approach can align sales and finance to work together to find a balance between customer acquisition and maintaining the bottom line.

Product Usage and Adoption

A less obvious but equally critical area to monitor is how your tenants are using your product. While you already carefully analyze how much activity there is on your platform, understanding how each customer is using your product can be quite powerful.

For example, overall usage of your platform may be normal and even growing, but that simple metric may not reflect the fact that particular key customers are actually lowering their usage of your product. Deeper analysis may also show that customers are not using key or new features. Information on how much each customer is using your product — and more specifically, which features they are or are not focusing on — becomes highly actionable to several key stakeholders in your company.

  • Product Management — By knowing which features each customer is using, PMs can better plan future releases by investigating why certain key customers are not using key features (might be hard to use?) or invest more in feature sets that are gaining traction with large customers. They can also release new features and immediately know which customers are adopting those features, and then dig into why.
  • Customer Success — Product utilization is a good proxy for measuring customer satisfaction. Once you know how much each customer is using the product, you have some insight into how satisfied they are. The customer success team can then focus on the low-scoring customers, working to raise satisfaction scores preemptively, before low usage translates into bad review scores and non-renewals.
  • Sales — Sales can monitor utilization and adoption to better gauge where upsell opportunities lie. Sales should always look at this data before they make contact with a customer to know how satisfied the customer is, which features they are using, and how their overall usage is trending. Beyond upsell, this can also improve the renewal rates, which sales reps should be tackling proactively.

Trending

One last topic relating to this discussion is trending. Not only do we want to see what each tenant is doing right now, but we also want to look into the future and make projections. While the operations and finance teams already do this without any tenant visibility by simply looking at overall product and platform usage, they are missing the individual customer trends. Once they have tenant-specific data, they can make this data actionable by projecting the velocity of various customer types.

If a large number of smaller customers are growing quickly, this can be an early warning sign that capacity will need to grow and might grow exponentially once the customer gets bigger and continues to grow. Alternatively, if only a handful of very large customers are driving your usage growth, you will be able to increase your capacity more carefully.

Summary

Of course, everything I’ve discussed so far is only a sampling of all the ways in which your business can leverage tenant utilization data to improve both the top and bottom lines and improve the overall product and satisfaction for all of your customers. Hopefully your business teams can work side by side with the engineering staff to leverage all this new data and make sure it’s actionable within your organization.

Once your organization embraces the value of carefully tracking tenant data, I’m sure you will come up with many more valuable ways to leverage this data and improve the business. As you start doing this, I hope all of you share your more creative and consequential ways of leveraging this data with me!



Is Your SaaS Product “Tenant-Aware”?

A customer-centric approach to SaaS, Part I

THE PROBLEM

Back in 2010, when I was working at a small startup with a tiny engineering team, I found myself in charge of building a SaaS version of our product, which had previously only been available in an on-premise version. This was no easy task, but after quite a bit of on-the-job learning, homebrew tooling, and many mistakes along the way, we had a successful SaaS offering. To our surprise, however, that was just the beginning of our problems. As we soon learned, our SaaS business was all about the customer, but our platform and tooling were all about the resources. In other words, we had no visibility into the customer (or in SaaS-speak, the tenant). Our platform and tooling were tenant-unaware!

Because our startup itself was an APM vendor, we knew how to monitor our own systems really well; we had rich insights into how our own software was running, but nothing about our tenants’ systems. We had great technical and business metrics about various components of the technology stack and infrastructure, but we had no metrics on our tenants, our customers.

BENEFITS OF IDENTIFYING THE TENANT

Why is it so important to be tenant aware? From a technical standpoint, you must be tenant-aware so that you can:

  • quickly detect outages that are caused by or affecting specific tenants,
  • determine SLA compliance,
  • enforce tenant isolation, and
  • track overall tenant utilization and performance across your services and tech stacks.

On the business side, tenant-awareness is required if you want to build a customer-centric business model with the ability to:

  • know the operational cost of each customer
  • provide discounts,
  • manage billings, and
  • measure customer satisfaction.

And of course, this is only a partial list of the business data that drives top and bottom line business optimization.

Those tenant-aware benefits are made possible by first being able to collect tenant-specific metrics as you monitor your SaaS cloud and tenant activity, and then making the metrics actionable.

  • To DevOps, actionable means tooling to enable tenant throttling, tenant migration, and other related tenant management tooling.
  • For the business, actionable means billing based on customer utilization, renewal rate optimization from customer satisfaction detection, and even increased product adoption by looking at feature utilization.

THE SOLUTION

Ok, you are convinced — you must become tenant-aware. But how do you do that? I wish I could say that you just buy one of 20 products out there, install it, and voilà, you now have brilliant dashboards that give you all of the data we just discussed plus control mechanisms for operations and maintenance. Sadly, after decades of growth of SaaS, there is no simple vendor solution currently available.

But wait! We all know that every DevOps team runs dozens of monitoring tools that show the health and performance of systems. When systems degrade, team members quickly try to find the root cause of the problem. However, these efforts work well only when the root cause is a faulty piece of hardware, a configuration error, or buggy software code, all of which DevOps can address. If the root cause instead resides in a tenant’s environment, those monitoring tools can’t offer much help. So unfortunately, that leaves you with the age old solution: gain visibility into your tenants through your own code and tooling. Let’s look at one possible way of doing this

IDENTIFYING THE TENANT

A common and relatively simple solution exists, if you are willing to involve developers and change code. Developers must add a unique tenant ID in some environmental context so it is always passed around and visible in every service-to-service and service-to-resource interaction. Once you have changed the code and can identify each tenant, you need to actually capture metrics in production, so you can calculate the two most important metrics for tenant monitoring:

  • Frequency — how often a tenant is utilizing a service or resource
  • Response Time — how much time a service spent responding to a tenant’s request (on average)

Since DevOps already monitors your systems with all sorts of monitoring tools (or uses their own homebrew monitoring implementation), just extend them to capture the tenant ID and break out the data they already report by each tenant. From this data, you will have access to the two metrics mentioned above, and DevOps can graph, analyze, and alert on this in production.

All of you engineers out there will probably now point out that this seems too much of an oversimplification and doesn’t tell you everything about the tenant. It is true that there are many more metrics we could come up with and monitor, such as looking at low level resource utilization, function level analysis inside each service, and many more. However, these can require a massive amount of engineering and overhead cost while adding only minor utility to our goals. Most problems that tenants cause can be discovered by focusing at the high level services with only these two metrics. Not to mention we have plenty of resource monitoring tools already that track such lower level metrics.

ROOT CAUSE ANALYSIS

Now that you have your key metrics, how will you use them to be tenant-aware? Let’s turn our attention back to DevOps, whose job it is to keep an eye out for service problems and, when a problem does occur, to quickly find the root cause and fix it. Let’s look at a couple problems that DevOps can now quickly find with these new tenant metrics.

Noisy Neighbor Problem

This problem occurs when one tenant, potentially even one of your smallest customers, increases their utilization of your services by such an unanticipated degree that it causes performance degradation for some of your other tenants. Looking at both of our new metrics, DevOps can determine if this is simply due to an increase in tenant requests to one of the services (the first metric, frequency) or if the issue is something deeper. Perhaps the tenants’ data size has grown and each service request takes much longer to fulfill (the second metric, response time).

Obnoxious Neighbor Problem

This problem arises when the noisy neighbor gets worse and starts to severely impact many (or all) tenants, and potentially even endangers the stability of your services.

MAKING IT ACTIONABLE

OK, so let’s say you’ve managed to get visibility into all this. Now that you notice problems that tenants are causing, what do you do about it? There are technical and business solutions to some of these problems. For now, let’s focus on the technical side.

Going back to the noisy neighbor problem, a tenant that is dangerous to others can be moved to their own environment and isolated. In less extreme cases, tenants can be re-balanced on environments so that no single environment has more heavy tenants than it can support. Ideally, you already have tools to reassign tenants to environments and resources. If so, you can provide your operations team with dashboards to identify noisy neighbors and link that with the tools they can run to immediately resolve these issues as they come up real time to limit the impact to others. We will discuss how to implement and automate these types of actions in a future blog.

And these are just the technical issues we can solve; we also have to look at the business solutions to tenant problems, in the areas of finance, budget, and pricing. Again, we will explore this in future blogs.

THE FUTURE

I hope this blog post has inspired you to learn more about your tenants and try to be tenant-aware. As you can see, it’s relatively straightforward to get started, and the benefits are immense. And so far we’ve only focused on the technical aspects. Stay tuned for future blogs where we will dive deeper into the business aspects of tenant awareness.

I also didn’t discuss automating remediation to the tenant problems; we’ll dive more into that in the future as well.

Finally, while I claimed that there are no off-the-shelf solutions to this (and just made you go through all those code changes!), there are new solutions being worked on in the industry. We will also explore some of these in future posts. Stay tuned!



Making Single Page Applications (SPA) faster

Thumbnail 1
Thumbnail 2
Thumbnail 3
Thumbnail 4
Thumbnail 5
Thumbnail 6
Thumbnail 7
Thumbnail 8
Thumbnail 9
Thumbnail 22
Thumbnail 32
Thumbnail 23
Thumbnail 24
Thumbnail 25
Thumbnail 26
Thumbnail 27
Thumbnail 28
Thumbnail 29
Thumbnail 30
Thumbnail 31
Thumbnail 21
Thumbnail 20
Thumbnail 19
Thumbnail 18
Thumbnail 17
Thumbnail 16
Thumbnail 15
Thumbnail 14
Thumbnail 13
Thumbnail 12
Thumbnail 11
Thumbnail 10
previous arrow
next arrow
Thumbnail 1
Thumbnail 2
Thumbnail 3
Thumbnail 4
Thumbnail 5
Thumbnail 6
Thumbnail 7
Thumbnail 8
Thumbnail 9
Thumbnail 22
Thumbnail 32
Thumbnail 23
Thumbnail 24
Thumbnail 25
Thumbnail 26
Thumbnail 27
Thumbnail 28
Thumbnail 29
Thumbnail 30
Thumbnail 31
Thumbnail 21
Thumbnail 20
Thumbnail 19
Thumbnail 18
Thumbnail 17
Thumbnail 16
Thumbnail 15
Thumbnail 14
Thumbnail 13
Thumbnail 12
Thumbnail 11
Thumbnail 10
previous arrow
next arrow


Planning for Failure with Application Performance Management (APM)

If you are running mission-critical applications and do not have a strategy to deal with failure, you are putting your whole organization at risk. You may think that your application cannot fail, but at some point everything fails.  It may not be the software running your application that fails – it could be the hardware, the network, or even a natural disaster in your area that causes your application to go down. In case of such failure, no matter how rare, your customers will still expect the same level of service, not to mention preservation of their data.  Without a failover strategy and a tested backup infrastructure you will be out of service for an unknown period of time, which will lead to angry customers and loss of revenue.

Most of you have some failover strategy in place. I’m sure many of you have spent large amounts of time and money ensuring that your application is resilient to failure because you understand your app’s importance. But you may still be missing one key component, without which you are still at risk. That component is monitoring, or more specifically, Application Performance Management (APM). While mission-critical applications rely on an APM system to help monitor application performance and health, APM is often forgotten on the failover systems. APM needs to goes hand-in-hand with any failure testing plan to ensure that your company’s strategy will work in case of a real emergency.

 

Now that you’ve agreed to incorporate APM into your failure testing plan, how do you actually make this work?  Your application already has an APM solution in place. But now you need to install APM on the backup system, which may not be running (or may not even exist yet). Which solution you choose depends on whether your application is in a data center or in the cloud. Let’s explore each of these scenarios.

Companies that run their applications in one or more data centers usually have a backup data center in a different geographical region. In case of the primary data center fails, the backup data center will handle the entire application load. But will your APM system continue to work as it did on the primary system? If you haven’t installed and configured your APM solution in your backup environment, it won’t.   You will need to install agents from your APM vendor on all of the machines in the backup data center, and then configure it to match what you are monitoring on your primary system.  That requires that you bring up all your backup systems and drive load through them, so you can see that activity in your APM system.  Only then can you be sure the APM system is ready in case of failure.

For applications running in the cloud, things are a bit different.  In most cases there isn’t a backup system already running.  Instead, if a part of the cloud fails, the application must be architected to spin off new nodes in a different region of the cloud that is independent of the failure. As we all learned during Amazon mega failure (#cloudfail) recently, companies must be very careful to understand which parts of the cloud are interdependent. In the case of Amazon’s elastic cloud, moving to a different “availability zone” was not enough to prevent failure – instead, only those who could move to a different geographic region were safe.

Assuming the nodes you spin off in a different region use the same machine image as your primary system, your APM system should not require any further installation or configuration. But that’s the easy part. Problems arise when, during a cloud failure, you rapidly start spinning off new nodes in a different region. This is basically equivalent to a sudden massive burst of new nodes in your system. New nodes coming online create lots of overhead on the monitoring system, which struggles to register the existence of the new nodes and all of their related data.   Companies running very large applications have found that most monitoring systems become unavailable for hours or even crash during such sudden bursts.  Of course a failure is the time your really want your APM system up and working to monitor the success of your failover plan. This is something to keep in mind when choosing an APM vendor – make sure your vendor has a track record in large cloud applications.

Now you need to actually test all this to make sure everything works as expected.  No amount of simulation or architectural review can substitute for a live test run. Be sure you test your failover plan on a regular basis. I’ve seen many companies spend weeks in failover emergency hell because their first attempt at failover didn’t work as expected. But even after you’ve resolved all your breakages in the failover, you are not done.  Just because you are pushing load to the backup system, how do you know if they are handling it successfully? Has the failover impacted application performance for your customers?  Has the failover impacted reliability and availability?  There’s only one way to really know, and that’s from your APM solution’s data. Don’t just failover, but run all your customer load on the backup system, and not just for a night.  Run your entire customer load on your backup system for at least a week to experience all the different workload patterns that vary over the day and week. Monitor your system with an APM solution and see if its performance is the same as you had before the failover.  Carefully compare the data, find any disparities, then fix and retest. Make sure your APM system supports baselines so you can track your performance based on your system’s baseline performance for any time of the day or week.

You may be a bit skeptical at this point. A week seems like overkill, you think. Do I really need to run on the backup system for a week? You’d be surprised how many problems companies have when they run in a failover environment.  Alerts are not set up, backups run on the wrong system, configuration files have not been maintained, software is outdated, and so on.  Plenty of companies spend more than a week flushing out all these issues, so don’t be in a hurry to declare success and failback.

By now you should see that testing for failure must include APM to be successful. Make sure you are using the right APM system – one that makes this testing easier, not harder. Your APM system should be easy to install and require little configuration so that changes in your application are automatically discovered and monitored, both on your primary system and on your backup. Failure of any system is never a good thing, but with proper planning and the right tools it should have no impact on your customers and your business.

Boris.



Life After Sharding: Monitoring and Managing a Complex Data Cloud

Thumbnail 1
Thumbnail 2
Thumbnail 3
Thumbnail 4
Thumbnail 5
Thumbnail 6
Thumbnail 7
Thumbnail 8
Thumbnail 9
Thumbnail 19
Thumbnail 18
Thumbnail 17
Thumbnail 16
Thumbnail 15
Thumbnail 14
Thumbnail 13
Thumbnail 12
Thumbnail 11
Thumbnail 10
previous arrow
next arrow
Thumbnail 1
Thumbnail 2
Thumbnail 3
Thumbnail 4
Thumbnail 5
Thumbnail 6
Thumbnail 7
Thumbnail 8
Thumbnail 9
Thumbnail 19
Thumbnail 18
Thumbnail 17
Thumbnail 16
Thumbnail 15
Thumbnail 14
Thumbnail 13
Thumbnail 12
Thumbnail 11
Thumbnail 10
previous arrow
next arrow

 

Slides from Boris Livshutz’ presentation at OSCON 2012.



A Short History of Postgres

The Berkeley Postgres Project

Implementation of the Postgres DBMS began in 1986. The initial concepts for the system were presented in [STON86] and the definition of the initial data model appeared in [ROWE87]. The design of the rule system at that time was described in [STON87a]. The rationale and architecture of the storage manager were detailed in [STON87b].

Postgres has undergone several major releases since then. The first “demoware” system became operational in 1987 and was shown at the 1988 ACM-SIGMOD Conference. We released Version 1, described in [STON90a], to a few external users in June 1989. In response to a critique of the first rule system ([STON89]), the rule system was redesigned ([STON90b]) and Version 2 was released in June 1990 with the new rule system. Version 3 appeared in 1991 and added support for multiple storage managers, an improved query executor, and a rewritten rewrite rule system. For the most part, releases since then have focused on portability and reliability.

Postgres has been used to implement many different research and production applications. These include: a financial data analysis system, a jet engine performance monitoring package, an asteroid tracking database, a medical information database, and several geographic information systems. Postgres has also been used as an educational tool at several universities. Finally, Illustra Information Technologies picked up the code and commercialized it. Postgres became the primary data manager for the Sequoia 2000 scientific computing project in late 1992. Furthermore, the size of the external user community nearly doubled during 1993. It became increasingly obvious that maintenance of the prototype code and support was taking up large amounts of time that should have been devoted to database research. In an effort to reduce this support burden, the project officially ended with Version 4.2.

Postgres95

In 1994, Andrew Yu and Jolly Chen added a SQL language interpreter to Postgres, and the code was subsequently released to the Web to find its own way in the world. Postgres95 was a public-domain, open source descendant of this original Berkeley code.

Postgres95 is a derivative of the last official release of Postgres (version 4.2). The code is now completely ANSI C and the code size has been trimmed by 25%. There are a lot of internal changes that improve performance and code maintainability. Postgres95 v1.0.x runs about 30-50% faster on the Wisconsin Benchmark compared to v4.2. Apart from bug fixes, these are the major enhancements:

  • The query language Postquel has been replaced with SQL (implemented in the server). We do not yet support subqueries (which can be imitated with user defined SQL functions). Aggregates have been re-implemented. We also added support for “GROUP BY”. The libpq interface is still available for C programs.
  • In addition to the monitor program, we provide a new program (psql) which supports GNU readline.
  • We added a new front-end library, libpgtcl, that supports Tcl-based clients. A sample shell, pgtclsh, provides new Tcl commands to interface tcl programs with the Postgres95 backend.
  • The large object interface has been overhauled. We kept Inversion large objects as the only mechanism for storing large objects. (This is not to be confused with the Inversion file system which has been removed.)
  • The instance-level rule system has been removed. Rules are still available as rewrite rules.
  • A short tutorial introducing regular SQL features as well as those of ours is distributed with the source code.
  • GNU make (instead of BSD make) is used for the build. Also, Postgres95 can be compiled with an unpatched gcc (data alignment of doubles has been fixed).

PostgreSQL

By 1996, it became clear that the name “Postgres95” would not stand the test of time. A new name, PostgreSQL, was chosen to reflect the relationship between original Postgres and the more recent versions with SQL capability. At the same time, the version numbering was reset to start at 6.0, putting the numbers back into the sequence originally begun by the Postgres Project.

The emphasis on development for the v1.0.x releases of Postgres95 was on stabilizing the backend code. With the v6.x series of PostgreSQL, the emphasis has shifted from identifying and understanding existing problems in the backend to augmenting features and capabilities, although work continues in all areas.

Major enhancements include:

  • Important backend features, including subselects, defaults, constraints, and triggers, have been implemented.
  • Additional SQL92-compliant language features have been added, including primary keys, quoted identifiers, literal string type coersion, type casting, and binary and hexadecimal integer input.
  • Built-in types have been improved, including new wide-range date/time types and additional geometric type support.
  • Overall backend code speed has been increased by approximately 20%, and backend startup speed has decreased 80%.


Hello world!

Welcome to WordPress. This is your first post. Edit or delete it, then start writing!