Disaster Recovery: Avoiding What You Don’t See Coming

Disaster recovery plan start button — *A disaster recovery plan is vital to ensuring resilience during disruptions.*

Nobody anticipated that all the data centers in lower Manhattan would go offline before Super Storm Sandy hit in 2012. But they were—and it was days before electricity, network connectivity, and access to those data centers returned. Some businesses never recovered from that outage; they simply lost too much money and too much business; others survived, but lost a lot of money during the outage.

You can’t anticipate a Super Storm Sandy. However, you can develop a disaster recovery (DR) plan designed to ensure that you have access to your data and critical applications when disaster unexpectedly strikes.

Building out a remote cloud infrastructure

At the heart of a disaster recovery plan is a DR site geographically distant from your primary data center. From this site, you could run your key applications and databases in case of a compromise of the “local” infrastructure supporting your day-to-day operations. Azure, AWS, and Google Cloud Services (GCS) can all support the creation of a DR infrastructure in a remote region.

It doesn’t need to be on the other side of the world; it only needs to be in a region unaffected by whatever calamity has compromised your primary region. If your data center sits in Manhattan, your DR infrastructure might sit in a cloud data center in the midwest. In the event of a disaster taking your region offline, you could spin up that remote infrastructure and run your critical operations from there. When conditions allow you to use your local infrastructure again, you could move those operations back to the original infrastructure.

Mobilizing quickly and efficiently

Of course, the key to minimizing disruption in the face of a disaster is the ability to spin up those DR services quickly. You need to make sure that your remote infrastructure has up-to-date copies of your critical applications and data. An application such as SQL Server is easy to install in the remote location. It just needs to sit in standby mode until you call the DR infrastructure into service. But the active data in your production SQL Server infrastructure? Ensuring you have an up-to-date copy of your production data is a bit more complicated due to the physical distances separating your production system from the DR infrastructure.

You could use the Availability Groups (AG) feature of SQL Server to replicate production data to the remote DR infrastructure. Given the distances involved, you’d likely use asynchronous rather than synchronous data replication (as you would if you were configuring data replication in a high availability configuration that spans multiple data centers in a single region). Asynchronous replication reliably replicates data from the production site to the DR site. However because of network traffic and other factors, the data in storage in your DR site is unlikely, at any moment, to be perfectly synchronized with the data stored in your production system.

Ensuring consistent access

If you need to bring your DR infrastructure online suddenly — because your production infrastructure has gone offline — you may discover that several seconds worth of transactions are missing from the DR infrastructure. Those updates had not arrived before the production system went offline. Still, you would have nearly all the asynchronously replicated data in the SQL Server database. Also, you would be able to continue running your SQL Server based applications from the infrastructure in the remote cloud data center.

Of course, if you’re not using SQL Server or another application that provides services for data replication (or if the constraints imposed by the AG tools in SQL Server Standard Edition preclude you from using it to support your DR needs), you’ll need to find another way to replicate your critical data to the DR infrastructure. This is where SANless Clustering tools fit in.

SANless Clustering tools provide the same kinds of data replication services described above, but SANless Clustering tools are application agnostic. They will replicate all the data on the identified production storage volume to storage attached to the DR environment (unlike the AG functionality in SQL Server, which only replicates user-named SQL Server databases). In the event of an unforeseen outage of your production environment — no matter what applications and databases are involved — SANless Clustering tools ensure that you can access all the data important to your operations.

Proactively responding to disaster

It’s worth noting that in the event of imminent disasters, you could proactively move operations to your DR infrastructure. If you pause your transactional systems for a moment to enable your latest transactions to be written to the distant disaster recovery infrastructure, you could then start up the DR infrastructure before the production infrastructure goes offline. Then, you could continue operations from your DR infrastructure with no loss of transactional data.

Practice, Practice, Practice

All this works wonderfully on paper. To make it work in the face of disaster, you need to practice moving over to your DR environment when no disaster is looming. Document the steps your IT team needs to take to ensure a smooth transition. Make sure everyone knows what they need to do when spinning down the production environment (assuming there’s time for an orderly shut down) and spinning up the DR infrastructure. If you do this when no emergency threatens, you’ll discover where you’ve missed a step. You’ll also see where some aspect of your disaster recovery hasn’t occurred as expected. A practice session is a perfect opportunity to discover and fix that. You won’t have the luxury of a do-over when a real disaster suddenly hits.

Your operations can survive calamity. However, you need to plan for it even if you can’t know what it is or when it will hit. You need to practice your response so that when the time actually comes your plan will work effectively.

Dave Bermingham

Dave Bermingham is Director of Customer Success at SIOS Technology. He is recognized within the technology community as a high-availability expert and has been honored to be elected a Microsoft MVP for the past 12 years: 6 years as a Cluster MVP and 6 years as a Cloud and Datacenter Management MVP. Dave holds numerous technical certifications and has more than thirty years of IT experience, including in finance, healthcare, and education.