The Myth of Alerting Services in IT

Written by
Published on Aug. 04, 2014

When VictorOps is compared to PagerDuty, that comparison is understandable on the surface. Both services allow on-call team scheduling and rotation, both services will notify you of an incident in your IT infrastructure, both allow for escalation. In short, both systems will “tell you that you have a problem”.

The difference however, is that VO is a Collaborative platform, not just an Alerting platform. Our vision is to be “in the fight” and actually help teams resolve problems faster. Simply put, our mission will be to shorten Time To Resolution (TTR) so that people get their lives back faster and companies have fewer costly outages.

One of the advantages of having previously built two fairly large-scale companies from the ground up is that we have a good deal of data from real teams. This data becomes even more telling now that our platform has launched. As we compare our own data to the companies using the VO platform in Alpha and now Beta, a picture that was pretty clear becomes even clearer.

THE MYTH: An Alerting Service can significantly reduce Time to Resolution (TTR).

This commonly held truth is actually more false than true. That seems like a surprising statement from a company that provides an alerting platform as part of its service. But the fact is Alerting platforms don’t contribute as much as you may think to system uptime.

Consideration #1: Generally every SaaS business has some sort of alerting mechanism in place already, whether they built a tool in-house using their monitoring system or they use PagerDuty, OpsGenie or VictorOps. Because of this, the alerting phase of problem resolution tends to be very binary. If you have any alerting in place, the gain has been realized. If you have no alerting in place, the gain has not been realized. But realistically, an alerting platform of any kind is really just table stakes. You can argue that one is a minute faster than another, but when you consider the average incident is around 45 minutes in length, what system you use for alerting is largely irrelevant as long as you are using something.

Consideration #2: If you break down a typical “incident resolution” into phases, you see that generally a small portion of time is spent “being alerted to the problem”. On average, we have seen that at most 10% of the total TTR has anything to do with alerting or escalation of problems. There are incidents where a team member does not respond but this is generally more about the team member than the platform finding him or her.

todd_blog_post

The Alerting phase historically has been a longer portion of TTR back when teams actually carried “pagers”, as those systems were quite slow. Now that team members have smart phones, human behavior has changed to be more “always-on” and engaged with that device. Alerting has come along for the ride with more ubiquitous connectivity and overlapping networks of WAN, LAN and SMS data.

Nonetheless, the truth of the matter is a perfect “zero time” alerting platform that could find people instantly can only really effect average TTR by 10% in the best case scenario.

We have a saying around VO…

We don’t want to provide a new way to admire the problem. We want to build a platform that helps teams solve the problem.

Knowing about the problem is just one piece of the puzzle. In Part 2 of this post, I’ll illustrate the differences between alerting and collaboration.

Hiring Now
Sierra Space
Aerospace • Hardware • Information Technology • Robotics