What role does site reliability engineering play in preventing major IT outages and ensuring continuous, resilient service? A significant one.
We experienced a major IT outage some weeks ago that left organizations worldwide reeling. A faulty software update from CrowdStrike affected 8.5 million Microsoft Windows devices, causing flight disruptions, hospital appointment delays, and news broadcast interruptions. Although a quick fix restored many services, fully resetting all impacted computers took time. This raises a critical question: how can we prevent such disruptions in the future?
See also: Data Reliability Engineering: You Can’t Fly Blind in the Clouds
The Ripple Effect of Modern Technology Failures
The recent outage illustrates the complexity of modern technology systems and our growing dependence on them. It shows how a single failure can have widespread consequences, affecting businesses and individuals globally.
While there is no perfect solution, we must prioritize and rigorously measure reliability at every stage. Market risk now affects the entire IT sector, touching everyone involved. We need to address potential issues and be prepared for market risks proactively.
Site Reliability Engineering: Building Resilience Through a Backup Chain
A key step in building resilience is ensuring every component and dependency in our systems has a fallback plan. Backups are not just about data—they’re about the resiliency of the entire solution. Think of it as a backup chain: if one element fails, the user seamlessly falls back to another, whether moving from a mobile app to a web app, switching from a phone system to office support, or accessing help in person. This approach should extend beyond technology to include processes and staffing, ensuring every scenario has a contingency.
This notion of a backup chain represents a shift in how we think about business continuity. It’s not just about storing copies of data but about providing multiple layers of operational redundancy to support business functions in real time. This multi-layered backup strategy needs to include all critical interfaces and touchpoints to ensure minimal impact on user interaction, even when one component goes down.
For example, if an organization’s cloud services become unavailable, its employees should have secure access to alternative infrastructure or platforms, allowing them to continue working. Similarly, if a mobile app experiences a sudden outage, users should be able to access the same functionalities through a web app or desktop interface without experiencing significant disruption. This way, the end-user experiences continuity, even if a system they initially relied on fails.
Reliability also means maintaining a clear error budget and measuring dependencies. In the Software Development Life Cycle (SDLC) and release management, strategies like canary releases, blue-green deployments, and rolling updates help ensure zero downtime during upgrades. However, this recent outage reminds us that reliability must be a consideration on both the production and consumer sides.
Producers of new software versions or APIs might have robust deployment strategies, but are the consumers equally prepared? Do they have backup plans to manage potential disruptions, rollback options, or phased updates across different regions? Consumer-side reliability is just as critical as production-side reliability and must be supported with disaster recovery and continuity plans.
Testing and Preparing for Future Outages
Moreover, organizations must invest in stress-testing their backup systems regularly. A plan is only as good as its execution, and testing these plans in real-world scenarios is the only way to ensure they work when most needed. This could include simulating region-wide cloud failures, office network outages, or rolling application crashes to see how well teams respond and how quickly systems recover. A well-rehearsed recovery plan can make the difference between hours of downtime and a seamless transition to alternative systems.
The recent incident should serve as a wake-up call for IT leaders. The question is not whether another major outage will happen but when. Regional, cloud-based, or system-wide failures are inevitable, and we must aim to make these disruptions manageable, recover swiftly, and, ideally, prevent them from escalating.
By adopting a proactive Site Reliability Engineering (SRE) approach, incorporating comprehensive contingency plans, and ensuring reliability across both producer and consumer environments, we can better navigate the complexities of modern technology. Ultimately, this is about more than mitigating risks—it’s about guaranteeing continuous, resilient service in an interconnected world.
Yuri Gubin is the Chief Innovation Officer at DataArt and helps clients innovate, drive change, solve complex technology problems, and thus rebuild their business. Leading member of the Solution Architect Board, Enterprise Board, number of committees, AI and Cloud labs. Passionate about architecture, finding solutions to problems, technical and organizational – from solution architecture to IT governance models. Stanford LEAD alumni and NACD.DC director.