Disaster Recovery – Do It Yourself or As-a-Service?

I don’t normally do articles off the back of someone else’s podcast episode, but I was listening to W. Curtis Preston and Prasanna Malaiyandi discuss “To DRaaS or Not to DRaaS” on The Backup Wrap-up a little while ago, and thought it was worth diving into a bit on the old weblog. In my day job I talk to many organisations about their Disaster Recovery (DR) requirements. I’ve been doing this a while, and I’ve seen the conversation change somewhat over the last few years. This article will barely skim the surface on what you need to think about, so I recommend you listen to Curtis and Prasanna’s podcast, and while you’re at it, buy Curtis’s book. And, you know, get to know your environment. And what your environment needs.

 

The Basics

As Curtis says in the podcast, “[b]ackup is hard, recovery is harder, and disaster recovery is an advanced form of recovery”. DR is rarely a matter of having another copy of your data safely stored away from wherever your primary workloads are hosted. Sure, that’s a good start, but it’s not just about getting that data back, or the servers, it’s the connectivity as well. So you have a bunch of servers running somewhere, and you’ve managed to get some / all / most of your data back. Now what? How do your users connect to that data? How do they authenticate? How long will it take to reconfigure your routing? What if your gateway doesn’t exist any more? A lot of customers think of DR in a similar fashion to the way they treat their backup and recovery approach. Sure, it’s super important that you understand how you can meet your recovery point objectives and your recovery time objectives, but there are some other things you’ll need to worry about too.

 

What’s A Disaster?

Natural

What kind of disaster are you looking to recover from? In the olden days (let’s say about 15 years ago), I was talking to clients about natural disasters in the main. What happens when the Brisbane River has a one in a hundred years flood for the second time in 5 years? Are your data centres above river level? What about your generators? What if there’s a fire? What if you live somewhere near the ocean and there’s a big old tsunami heading your way? My friends across the ditch know all about how not to build data centres on fault lines.

Accidental

Things have evolved to cover operational considerations too. I like to joke with my customers about the backhoe operator cutting through your data centre’s fibre connection, but this is usually the reason why data centres have multiple providers. And there’s alway human error to contend with. As data gets more concentrated, and organisations look to do things more efficiently (i.e. some of them are going cheap on infrastructure investments), the risk of messing up a single component can have a bigger impact than it would have previously. You can architect around some of this, for sure, but a lot of businesses are not investing in those areas and just leaving it up to luck.

Bad Hacker, Stop It

More recently, however, the conversation I’ve been having with folks across a variety of industries has been more about some kind of human malfeasance. Whether it’s bad actors (in hoodies and darkened rooms no doubt) looking to infect your environment with malware, or someone not paying attention and unleashing ransomware into your environment, we’re seeing more and more of this kind of activity having a real impact on organisations of all shapes and sizes. So not only do you need to be able to get your data back, you need to know that it’s clean and not going to re-infect your production environment.

A Real Disaster

And how pragmatic are you going to be when considering your recovery scenarios? When I first started in the industry, the company I worked for talked about their data centres being outside the blast radius (assuming someone wanted to drop a bomb in the middle of the Brisbane CBD). That’s great as far as it goes, but are all of your operational staff going to be around to start of the recovery activities? Are any of them going to be interested in trying to recovery your SQL environment if they have family and friends who’ve been impacted by some kind of earth-shattering disaster? Probably not so much. In Australia many organisations have started looking at having some workloads in Sydney, and recovery in Melbourne, or Brisbane for that matter. All cities on the Eastern seaboard. Is that enough? What if Oz gets wiped out? Should you have a copy of the data stored in Singapore? Is data sovereignty a problem for you? Will anyone care if things are going that badly for the country?

 

Can You Do It Better?

It’s not just about being able to recover your workloads either, it’s about how you can reduce the complexity and cost of that recovery. This is key to both the success of the recovery, and the likelihood of it not getting hit with relentless budget cuts. How fast do you want to be able to recover? How much data do you want to recover? To what level? I often talk to people about them shutting down their test and development environments during a DR event. Unless your whole business is built on software development, it doesn’t always make sense to have your user acceptance testing environment up and running in your DR environment.

The cool thing about public cloud is that there’s a lot more flexibility when it comes to making decisions about how much you want to run and where you want to run it. The cloud isn’t everything though. As Prasanna mentions, if the cloud can’t run all of your workloads (think mainframe and non-x86, for example), it’s pointless to try and use it as a recovery platform. But to misquote Jason Statham’s character Turkish in Snatch – “What do I know about DR?” And if you don’t really know about DR, you should think about getting someone who does involved.

 

What If You Can’t Do It?

And what if you want to DRaaS your SaaS? Chances are high that you probably can’t do much more than rely on your existing backup and recovery processes for your SaaS platforms. There’s not much chance that you can take that Salesforce data and put it somewhere else when Salesforce has a bad day. What you can do, however, is be sure that you back that Salesforce data up so that when someone accidentally lets the bad guys in you’ve got data you can recover from.

 

Thoughts

The optimal way to approach DR is an interesting problem to try and solve, and like most things in technology, there’s no real one size fits all approach that you can take. I haven’t even really touched on the pros and cons of as-a-Service offerings versus rolling your own either. Despite my love of DIY (thanks mainly to punk rock, not home improvement), I’m generally more of a fan of getting the experts to do it. They’re generally going to have more experience, and the scale, to do what needs to be done efficiently (if not always economically). That said, it’s never as simple as just offloading the responsibility to a third-party. You might have a service provider delivering a service for you, but ultimately you’ll always be accountable for your DR outcomes, even if you’re not directly responsible for the mechanics of it.

Random Short Take #81

Welcome to Random Short Take #81. Last one for the year, because who really wants to read this stuff over the holiday season? Let’s get random.

Take care of yourselves and each other, and I’ll hopefully see you all on the line or in person next year.

MDP – Yeah You Know Me

Data protection is a funny thing. Much like insurance, most folks understand that it’s important, normally dread having to use it, and dislike the fact that it costs money “just in case something goes wrong”. But then they get hit by ransomware, or Judy in Accounting absolutely destroys a critical spreadsheet, and they realise it’s probably not such a bad thing to have this “data protection”. Books are weird too. Not the idea that we’ll put a whole bunch of information in a file and make it accessible to people. Rather, that sometimes that information is given context and then printed out, sold, read, and stuck on a shelf somewhere for future reference. Indeed, I was a voracious consumer of technical books early in my career, particularly when many vendors were insisting that this was the way to share knowledge with end users. YouTube wasn’t a thing, and access to manuals and reference guides was limited to partners or the vendors themselves. The problem with technical books, however, is that if they cover a specific version of software (or hardware or whatever), they very quickly become outdated in potentially significant ways. As enjoyable as some of those books about Windows NT 4.0 might have been for us all, they quickly became monitor stands when Windows 2000 was released. The more useful books were the ones that shared more of the how, what, when, and why of the topic, rather than digging in to specific guidance on how to do an activity with a particular solution. Particularly when that solution was re-written by the vendor between major versions.

Early on in my career I got involved in my employer’s backup and recovery solution. At the time it was all about GFS backup schemes and DDS-2 drives and per-server protection schemes that mostly worked. It was viewed as an unnecessary expense and given to junior staff to look after. There was a feeling, at least with some of the Windows stuff, that if anything went wrong it would likely go wrong in a big way. I generally felt ill at ease when recovery requests would hit the service desk queue. As a result of this, my interest in being able to bring data back from human error, disaster, or other kinds of failure was piqued, and I went out and bought a copy of Unix Backup and Recovery. As a system administrator, it was a great book to have at hand. There was a nice combination of understandable examples and practical application of backup and recovery principles covered throughout that book. I used to joke that it even had a happy ending, and everyone got their data back. As I moved through my career, I maintained an interest in data protection (it seemed, at one stage, to go hand in hand with storage for whatever reason), and I’ve often wondered what people do when they aren’t given the appropriate guidance on how to best do data protection to meet their needs.

All of this is an extremely long-winded way of saying that my friend W. Curtis Preston has released his fourth book, the snappily titled “Modern Data Protection“, and it makes for some excellent reading. If you listen to him talk about why he wrote another book on his podcast, you’ll appreciate that this thing was over 10 years in the making, had an extensive outline developed for it, and really took a lot of effort to get done. As Curtis points out, he goes out of his way not to name vendors or solutions in the book (he works for Druva). Instead, he spends time on the basics (why backup?), what you should backup, how to backup, and even when you should be backing up things.

This one doesn’t just cover off the traditional centralised server / tape library combo so common for many years in enterprise shops. It also goes into more modern on-premises solutions (I think the kids call them hyper-converged) and cloud-native solutions of all different shapes and sizes. He talks about how to protect a wide variety of workloads and solution architectures, drills in on the importance of recovery testing, and even covers off the difference between backup and archive. Yes, they are different, and I’m not just saying that because I contributed that particular chapter. There’s talk of traditional data sources, deduplication technologies, and more fashionable stuff like Docker and Kubernetes.

The book comes in at a svelte 350ish pages, and you know that each chapter could have almost been a book on its own (or at least a very long whitepaper). That said, Preston does a great job of sticking to the topic at hand, and breaking down potentially complex scenarios in a concise and simple to digest fashion. As I like to say to anyone who’ll listen, this stuff can be hard to get right, and you want to get it right, so it helps if the book you’re using gets it right too.

Should you read this book? Yes. Particularly if you have data or know someone who has data. You may be a seasoned industry veteran or new to the game. It doesn’t matter. You might be a consultant, an architect, or an end user. You might even work at a data protection vendor. There’s something in this for everyone. I was one of the technical editors on this book, fancy myself as knowing about about data protection, and I learnt a lot of stuff. Even if you’re not directly in charge of data protection for your own data or your organisation’s data, this is an extremely useful guide that covers off the things you should be looking at with your existing solution or with a new solution. You can buy it directly from O’Reilly, or from big book sellers. It comes in electronic and physical versions and is well worth checking out. If you don’t believe me, ask Mellor, or Leib – they’ll tell you the same thing.

  • Publisher: O’Reilly
  • ISBN: 9781492094050

Finally, thanks to Preston for getting me involved in this project, for putting up with my English (AU) spelling, and for signing my copy of Unix Backup and Recovery.