Now that we are two weeks removed from the “World’s Biggest IT Outage,” we can discuss what went wrong, how to prevent such incidents, and best practices around change management.
First, let me say that if you haven’t already, you will cause an outage at some point in your career. Hopefully, it is not one that makes the nightly news, but it will happen—it’s the IT engineer’s Murphy’s Law. What matters is that you own it, learn from it, and take the right steps to reconcile it afterward.
The Importance of Testing
Testing might seem obvious, especially to seasoned veterans in the industry. However, overconfidence can be a risk when you believe that what you’re doing couldn’t possibly be wrong. Always test your code: conduct local testing, unit testing, integration testing, performance testing, etc.
“But that’s a lot of extra work that takes hours!”
If you’re working on critical applications or infrastructure, meticulous testing is essential to ensure you don’t cause an impact.
Building Smoke Tests
Build smoke tests into your pipeline that test all core services. Examples:
Can I produce and consume a Kafka topic?
Can I create a pod?
Can I provision and attach storage?
Whatever services you offer should be tested, even if you think they aren’t impacted by your change, because unknown dependencies may exist.
Monitoring Service Health
Use dashboards to monitor service health continuously. This proactive approach can help you catch issues early before they escalate into significant problems.
The Necessity of Multiple Environments
Any IT professional might say, “Yeah, duh,” when it comes to having multiple environments. However, upper management often looks at the bottom line and questions, “Why do we have so many environments?” when they seek to cut costs. When building out new environments, they might not consider the cost of maintaining lower environments.
It’s critical to have at least three environments for testing, and if you are a platform or infrastructure team, four environments are ideal:
Testing Environment:
This environment has no customers in it, whether internal or external. Even if internal customers need a “test” environment, platform teams require a space to break things without worrying about impacting anyone.
Development Environment:
This is where application teams test their code. Platform teams should partner with a few application teams when releasing services or updates to ensure no impact on their applications.
Non-Production Environment:
This environment is solely for testing what will be released to production. No manual changes should occur here. Applications should be released with CI/CD, and infrastructure and services should be built with Infrastructure as Code.
Production Environment:
This is where you accept live traffic, and there is a monetary impact if you break it.
What upper management often doesn’t understand is that for a platform team, Development, Non-Production, and Production environments are all technically “Production.” Breaking the Development environment means none of the application teams can work that day. Breaking Non-Production means no one can test for Production releases. The Dev/Non-Prod/Prod nomenclature indicates what application teams should use the environment for. Obviously, the Production environment is the most impactful because if it goes down, there is monetary impact. However, other environments must remain operational for internal teams to work effectively.
Ensuring Like-for-Like Environments
It’s also crucial to have like-for-like environments or as close to them as possible. While this increases costs, not testing in like-for-like environments risks unknown consequences from environment discrepancies. At a minimum, Non-Prod and Prod environments should be like-for-like: same size/type instances, same OS versions, package versions, etc.
Rolling Out Changes
The biggest issue in the “World’s Biggest IT Outage,” in my opinion, was the decision to roll out the changes at once. That should never happen, even if it’s a small change you believe has no impact. Yes, it’s painful, slow, and mind-numbing, but it can save the company lots of money and reputation if the change does cause an outage. If the “World’s Biggest IT Outage” had only impacted one customer, we probably wouldn’t have heard about it. But because it took down airlines, 911 call routing, hospitals, etc., it was on the news.
Strategies for Rolling Out Changes
If you follow one of the below strategies for releases, it will minimize the impact of any outages. Some of these are easier depending on your environment, but it is important to know and set your release strategy before you start having live traffic/customers.
Canary Releases: Deploy changes to a small subset of users first, monitor the impact, and then gradually roll out to more users.
Blue-Green Deployments: Maintain two identical production environments (blue and green). Direct traffic to one while the other is updated and tested.
Feature Toggles: Use feature toggles to enable or disable features without deploying new code, allowing you to test new features gradually.
So You’ve Caused a Major Outage—Now What?
Let's say you followed all of the above and you still ended up causing an outage. It has happened to me. One time I updated nginx and caused outages to production applications, What happened was some customers had misconfiguration in their ingress which after the update the traffic wasn't flowing anymore. It would have been hard to test for this scenario because I would have needed to have a misconfigured ingress to know that it would break.
But once you have caused an outage what should you do?
Rollback the Latest Changes (if possible):
Immediately rollback changes to restore service quickly and minimize the impact.
ANY changes, even if you think its not related to what is going on, any changes to the application/environment that happened around the time the outage happened should be rolled back until it is determined what caused the outage.
Find and Fix the Issue:
If you cannot roll back the change, start a call with the engineers on the team that manages the application/platform/service that has the outage.
Include everyone even the junior engineers, they may not be helpful but it important for them to "be a fly on the wall" and learn. They can also help answer customers questions while other engineers are fixing the issue.
Investigate the root cause of the outage and apply a fix to prevent recurrence.
If you had to do # 2 then you know what the issue is and you just need to fix whatever caused it. Your automation, a typo in the code somewhere, a version upgrade that caused unforseen consequences. Whatever it is ensure that is it not only fixed manually but fixed so it wont happen again. Do it as soon as possible so it is not forgotten and reapplied.
Conduct a Root Cause Analysis (RCA):
Perform an RCA to understand what went wrong and identify preventive measures for the future.
Send this out to all impacted internal customers. It is important to let people know what happened because I have been on the other side of the coin and it is frustrating not knowing what happening or why. There is no point in "hiding" the RCA because of shame or saving face. The company succeeds and fails together and by sharing it others can learn.
Also it is important that this be blameless, do not say "Andrew did X which caused Y". Own it as a team, even if one person did code it and release it, more than likely there was a code review and the team knew that there was an update or release going out.
For external customers, let PR handle what they want to go out.
By implementing robust testing, maintaining multiple environments, and rolling out changes carefully, you can significantly reduce the risk of causing major outages and ensure a smoother, more reliable IT operation.
Remember, no one is perfect and causing outages is a when not an if. But if you are diligent about doing the steps/process above you can minimize the impact and the amount of times outages occur. And if your team/company doesnt have these steps in place, its never too late to introduce them. Be a leader and speak up.