Disaster Recovery – Do It Yourself or As-a-Service?

I don’t normally do articles off the back of someone else’s podcast episode, but I was listening to W. Curtis Preston and Prasanna Malaiyandi discuss “To DRaaS or Not to DRaaS” on The Backup Wrap-up a little while ago, and thought it was worth diving into a bit on the old weblog. In my day job I talk to many organisations about their Disaster Recovery (DR) requirements. I’ve been doing this a while, and I’ve seen the conversation change somewhat over the last few years. This article will barely skim the surface on what you need to think about, so I recommend you listen to Curtis and Prasanna’s podcast, and while you’re at it, buy Curtis’s book. And, you know, get to know your environment. And what your environment needs.

 

The Basics

As Curtis says in the podcast, “[b]ackup is hard, recovery is harder, and disaster recovery is an advanced form of recovery”. DR is rarely a matter of having another copy of your data safely stored away from wherever your primary workloads are hosted. Sure, that’s a good start, but it’s not just about getting that data back, or the servers, it’s the connectivity as well. So you have a bunch of servers running somewhere, and you’ve managed to get some / all / most of your data back. Now what? How do your users connect to that data? How do they authenticate? How long will it take to reconfigure your routing? What if your gateway doesn’t exist any more? A lot of customers think of DR in a similar fashion to the way they treat their backup and recovery approach. Sure, it’s super important that you understand how you can meet your recovery point objectives and your recovery time objectives, but there are some other things you’ll need to worry about too.

 

What’s A Disaster?

Natural

What kind of disaster are you looking to recover from? In the olden days (let’s say about 15 years ago), I was talking to clients about natural disasters in the main. What happens when the Brisbane River has a one in a hundred years flood for the second time in 5 years? Are your data centres above river level? What about your generators? What if there’s a fire? What if you live somewhere near the ocean and there’s a big old tsunami heading your way? My friends across the ditch know all about how not to build data centres on fault lines.

Accidental

Things have evolved to cover operational considerations too. I like to joke with my customers about the backhoe operator cutting through your data centre’s fibre connection, but this is usually the reason why data centres have multiple providers. And there’s alway human error to contend with. As data gets more concentrated, and organisations look to do things more efficiently (i.e. some of them are going cheap on infrastructure investments), the risk of messing up a single component can have a bigger impact than it would have previously. You can architect around some of this, for sure, but a lot of businesses are not investing in those areas and just leaving it up to luck.

Bad Hacker, Stop It

More recently, however, the conversation I’ve been having with folks across a variety of industries has been more about some kind of human malfeasance. Whether it’s bad actors (in hoodies and darkened rooms no doubt) looking to infect your environment with malware, or someone not paying attention and unleashing ransomware into your environment, we’re seeing more and more of this kind of activity having a real impact on organisations of all shapes and sizes. So not only do you need to be able to get your data back, you need to know that it’s clean and not going to re-infect your production environment.

A Real Disaster

And how pragmatic are you going to be when considering your recovery scenarios? When I first started in the industry, the company I worked for talked about their data centres being outside the blast radius (assuming someone wanted to drop a bomb in the middle of the Brisbane CBD). That’s great as far as it goes, but are all of your operational staff going to be around to start of the recovery activities? Are any of them going to be interested in trying to recovery your SQL environment if they have family and friends who’ve been impacted by some kind of earth-shattering disaster? Probably not so much. In Australia many organisations have started looking at having some workloads in Sydney, and recovery in Melbourne, or Brisbane for that matter. All cities on the Eastern seaboard. Is that enough? What if Oz gets wiped out? Should you have a copy of the data stored in Singapore? Is data sovereignty a problem for you? Will anyone care if things are going that badly for the country?

 

Can You Do It Better?

It’s not just about being able to recover your workloads either, it’s about how you can reduce the complexity and cost of that recovery. This is key to both the success of the recovery, and the likelihood of it not getting hit with relentless budget cuts. How fast do you want to be able to recover? How much data do you want to recover? To what level? I often talk to people about them shutting down their test and development environments during a DR event. Unless your whole business is built on software development, it doesn’t always make sense to have your user acceptance testing environment up and running in your DR environment.

The cool thing about public cloud is that there’s a lot more flexibility when it comes to making decisions about how much you want to run and where you want to run it. The cloud isn’t everything though. As Prasanna mentions, if the cloud can’t run all of your workloads (think mainframe and non-x86, for example), it’s pointless to try and use it as a recovery platform. But to misquote Jason Statham’s character Turkish in Snatch – “What do I know about DR?” And if you don’t really know about DR, you should think about getting someone who does involved.

 

What If You Can’t Do It?

And what if you want to DRaaS your SaaS? Chances are high that you probably can’t do much more than rely on your existing backup and recovery processes for your SaaS platforms. There’s not much chance that you can take that Salesforce data and put it somewhere else when Salesforce has a bad day. What you can do, however, is be sure that you back that Salesforce data up so that when someone accidentally lets the bad guys in you’ve got data you can recover from.

 

Thoughts

The optimal way to approach DR is an interesting problem to try and solve, and like most things in technology, there’s no real one size fits all approach that you can take. I haven’t even really touched on the pros and cons of as-a-Service offerings versus rolling your own either. Despite my love of DIY (thanks mainly to punk rock, not home improvement), I’m generally more of a fan of getting the experts to do it. They’re generally going to have more experience, and the scale, to do what needs to be done efficiently (if not always economically). That said, it’s never as simple as just offloading the responsibility to a third-party. You might have a service provider delivering a service for you, but ultimately you’ll always be accountable for your DR outcomes, even if you’re not directly responsible for the mechanics of it.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.