Turning Technical Failure into Success

[ibimage==27747==Original==none==self==ibimage_align-center]

During Boulder Startup Week 2014, I was grateful to have the opportunity to take part in a panel to discuss technical failures, and share stories from the trenches in which overwhelming fail was turned into triumph. It was a true pleasure to share the panel with Ned McClain of AppliedTrust.com, Christian Vanek of SurveyGizmo.com, and Jeremy Frazao of Kiva.org, moderated by Ryan Angilly, founder of Ramen.is. The format consisted of topic introductions to the panel from the moderator, as well as followup interactive questions from the audience of approximately 80 attendees.

We started off discussing what technical failure means, with responses ranging from the literal to the philosophical. Definitions included “a break in information flow, literal blockage of system processing,” “customer downtime, or inability to deliver functionality or product to the end user,” as well as other more nuanced meanings.

From there, we moved into some story telling, where each member of the panel shared personal experiences of tech failures, and the end results of the events as it affected their teams, businesses and customers. The stories ranged from “catastrophic success” type events to malignant DDoS attacks by extortionists, to preventable outages caused by an expired AmEx card.

In the cases of catastrophic success, one was due to a very young company being promoted in the media without sufficient technical resources at hand, while the other was a fully armed and operational web-scale system going into a huge live event under full preparation and testing. In both cases, the resulting post-mortem and system hardening work resulted in the eventual success of the ventures, by stabilizing their platforms and allowing for stable growth after the learning period and refactoring.

The case of the expired AmEx card was a business and process failure that resulted in complete technical breakdown, as recurring hosted services were turned off by the provider, resulting in total system outage. This was improved on for the future by changes in off-boarding procedure and consolidation of financial processes in a centralized role.

The DDoS event was brought on by a malicious third party, attempting to extort a service provider. The attack was perpetrated by targeting exposed base IPs of the providers system, which were unintentionally made public via email headers. The mitigation of the attack was accomplished by implementing layered DNS proxy services to obfuscate the source IPs, thus eliminating the pinning vector used by the denial of service attack.

In every single case, there was an overriding theme of transparency, teamwork and acceptance of imperfection. There were no circular firing squads, no witch hunts and no fronting of excuses to the customers and end users. This honesty and goodwill fostered stronger teams, stronger customer relationships and more reliable systems and processes.

We concluded by discussing ‘the one most important thing’ in preventing failure. We all agreed that all technical failure cannot be prevented, and that failure must be embraced so that when it does happen, the response is appropriate and conducive to continued business and relationships, both internally with the team, and externally with customers. A practical suggestion from the panel: DevOps - “You need people who can understand, traverse and troubleshoot the entire stack in real time in order to mitigate loss and downtime as it is occurring.”

Another on-point answer: Transparency. “Customers deserve to be informed as to what is going on, not just a standard splat page. Utilize social media and other means to accurately inform the user base of what is going on. Also internally, real-time triage reporting needs to be communicated regularly up the chain to keep all parts of the business current and in the loop.” It was recommended that a designated person, such as a tech manager, be the contact point for such communication, in order to allow the technical team to focus on resolution activity.

Such honest conversation and sharing about the worst things that can happen in a technical operations environment is a rare and valuable experience. The takeaway lessons of transparency, improvement through reflection, and team solidarity are ones that any organization can benefit from, as failure is simply a part of tech, business and life. Success comes from learning from those mistakes, and becoming better for them.

“Fall down seven times. Get up eight”

Recent Articles