I’m a big believer in the old adage, “Those who do not learn from history are doomed to repeat it.” You need to examine your mistakes, figure out why and how you made them, and then use those lessons so you can do better the next time. In addition, I would prefer to learn from other people’s mistakes rather than committing them myself — it’s gain without the pain. So I asked myself the question, “What are the biggest IT fails in history?” And then the more important follow-up, “What can I learn from them?”
I wanted to find the failures that were significant and spectacular. They had to be impactful and memorable. There are plenty of stories about failed IT projects that “just” wasted a lot of money like this one from the Air Force (there’s a lot to be learned from them, too). But I wanted the projects that culminated in a momentous, go-out-in-a-blaze-of-glory, end-up-on-the-evening-news kind of failure.
Here’s my short list along with the lessons I took away from each.
1. Ariane 5 Flight 501
On June 4, 1996, Ariane 5 Flight 501 suffered a catastrophic failure, with a total cost of $370M for both the rocket and its payload. The source of the failure was a 64-bit floating point number relating to the horizontal velocity of the rocket being converted to a 16-bit signed integer. It crashed the primary computer and caused the rocket to self-destruct just 40 seconds into its flight. The designers of the Ariane 5 reused some of the software from the Ariane 4 (noble aim), but didn’t account for the significantly different flight characteristics of the more powerful Ariane 5. The result: a big boom.
You can get all of the details in the full failure report. You can also see a friend and colleague use this event in his talk Unindented Code Cannot Possible Work.
Lessons: Test your assumptions, edge cases, and boundary conditions. Perform integration testing. In fact, just test. A lot. Preferably using automation.
2. Knight Capital Group
On August 1, 2012, the Knight Capital Group caused a major disruption in the stock market when its automated trading software glitched. A rewrite of one of its programs repurposed a flag used to activate testing code for use in a controlled environment (i.e., not production). That rewrite was pushed to seven out of eight production servers. When that repurposed flag was used on the eighth server without the rewrite, it activated the old testing code and chaos ensued. The result: Knight Capital lost $440M in 45 minutes before they shut down the problematic code.
You can get more details from the SEC’s administrative proceeding.
Lessons: Automate your builds and deploys. Use automated testing to catch unforeseen problems before they reach production. Practice good coding hygiene through refactoring and eliminating stale, unused code.
On October 1, 2013, HealthCare.gov launched to help U.S. citizens find and sign up for health care as part of the Affordable Care Act (a.k.a., Obamacare). The website had over 67 million visitors on launch day. Out of those 67 million visitors, a grand total of 6 people were able to sign up for a health care plan. Not 60,000. Not 600. Just 6. The Federal Government spent the next several months working around the clock to stabilize the site and improve performance and usability. If that weren’t enough reason for HealthCare.gov to end up on this list, can you think of another website that got that much attention for that long from the President of the United States? There were so many failures of people, process, and technology that led to the disastrous launch that it’s hard to know where to start in listing them all. The result: As of early 2014, CMS estimated the Federal Government spent $824 million on HealthCare.gov, more than twice the original estimate. And a lot of bad press and a lot of good jokes.
You can read a full account of the lead up, launch, and recovery of HealthCare.gov in the HHS OIG report.
Lessons: So many… So many… But to sum up: be agile. Don’t just do Agile.
On September 7, 2017, Equifax announced a data breach that affected 143 million Americans. The breach occurred between mid-May and July and was discovered by Equifax on July 29. The breach exploited a vulnerability in Apache Struts, an open source software component used in many websites. Here’s the issue: the vulnerability was known as of March 10, 2017 — at least two full months before the breach occurred. The result: The largest data breach of all time affected half of American consumers and Equifax lost 35% of stock price in the days following the breach.
Lessons: Know what software is running in your environment. Keep it up-to-date. Have a way of detecting new vulnerabilities and patch them quickly.
If you think there are other failures that belong on this list, I’d love to hear about them. More importantly, I’d love to hear about the lessons that can be learned from them so we’re not doomed to repeat them.