Disaster Recovery – Do It Yourself or As-a-Service?

I don’t normally do articles off the back of someone else’s podcast episode, but I was listening to W. Curtis Preston and Prasanna Malaiyandi discuss “To DRaaS or Not to DRaaS” on The Backup Wrap-up a little while ago, and thought it was worth diving into a bit on the old weblog. In my day job I talk to many organisations about their Disaster Recovery (DR) requirements. I’ve been doing this a while, and I’ve seen the conversation change somewhat over the last few years. This article will barely skim the surface on what you need to think about, so I recommend you listen to Curtis and Prasanna’s podcast, and while you’re at it, buy Curtis’s book. And, you know, get to know your environment. And what your environment needs.

 

The Basics

As Curtis says in the podcast, “[b]ackup is hard, recovery is harder, and disaster recovery is an advanced form of recovery”. DR is rarely a matter of having another copy of your data safely stored away from wherever your primary workloads are hosted. Sure, that’s a good start, but it’s not just about getting that data back, or the servers, it’s the connectivity as well. So you have a bunch of servers running somewhere, and you’ve managed to get some / all / most of your data back. Now what? How do your users connect to that data? How do they authenticate? How long will it take to reconfigure your routing? What if your gateway doesn’t exist any more? A lot of customers think of DR in a similar fashion to the way they treat their backup and recovery approach. Sure, it’s super important that you understand how you can meet your recovery point objectives and your recovery time objectives, but there are some other things you’ll need to worry about too.

 

What’s A Disaster?

Natural

What kind of disaster are you looking to recover from? In the olden days (let’s say about 15 years ago), I was talking to clients about natural disasters in the main. What happens when the Brisbane River has a one in a hundred years flood for the second time in 5 years? Are your data centres above river level? What about your generators? What if there’s a fire? What if you live somewhere near the ocean and there’s a big old tsunami heading your way? My friends across the ditch know all about how not to build data centres on fault lines.

Accidental

Things have evolved to cover operational considerations too. I like to joke with my customers about the backhoe operator cutting through your data centre’s fibre connection, but this is usually the reason why data centres have multiple providers. And there’s alway human error to contend with. As data gets more concentrated, and organisations look to do things more efficiently (i.e. some of them are going cheap on infrastructure investments), the risk of messing up a single component can have a bigger impact than it would have previously. You can architect around some of this, for sure, but a lot of businesses are not investing in those areas and just leaving it up to luck.

Bad Hacker, Stop It

More recently, however, the conversation I’ve been having with folks across a variety of industries has been more about some kind of human malfeasance. Whether it’s bad actors (in hoodies and darkened rooms no doubt) looking to infect your environment with malware, or someone not paying attention and unleashing ransomware into your environment, we’re seeing more and more of this kind of activity having a real impact on organisations of all shapes and sizes. So not only do you need to be able to get your data back, you need to know that it’s clean and not going to re-infect your production environment.

A Real Disaster

And how pragmatic are you going to be when considering your recovery scenarios? When I first started in the industry, the company I worked for talked about their data centres being outside the blast radius (assuming someone wanted to drop a bomb in the middle of the Brisbane CBD). That’s great as far as it goes, but are all of your operational staff going to be around to start of the recovery activities? Are any of them going to be interested in trying to recovery your SQL environment if they have family and friends who’ve been impacted by some kind of earth-shattering disaster? Probably not so much. In Australia many organisations have started looking at having some workloads in Sydney, and recovery in Melbourne, or Brisbane for that matter. All cities on the Eastern seaboard. Is that enough? What if Oz gets wiped out? Should you have a copy of the data stored in Singapore? Is data sovereignty a problem for you? Will anyone care if things are going that badly for the country?

 

Can You Do It Better?

It’s not just about being able to recover your workloads either, it’s about how you can reduce the complexity and cost of that recovery. This is key to both the success of the recovery, and the likelihood of it not getting hit with relentless budget cuts. How fast do you want to be able to recover? How much data do you want to recover? To what level? I often talk to people about them shutting down their test and development environments during a DR event. Unless your whole business is built on software development, it doesn’t always make sense to have your user acceptance testing environment up and running in your DR environment.

The cool thing about public cloud is that there’s a lot more flexibility when it comes to making decisions about how much you want to run and where you want to run it. The cloud isn’t everything though. As Prasanna mentions, if the cloud can’t run all of your workloads (think mainframe and non-x86, for example), it’s pointless to try and use it as a recovery platform. But to misquote Jason Statham’s character Turkish in Snatch – “What do I know about DR?” And if you don’t really know about DR, you should think about getting someone who does involved.

 

What If You Can’t Do It?

And what if you want to DRaaS your SaaS? Chances are high that you probably can’t do much more than rely on your existing backup and recovery processes for your SaaS platforms. There’s not much chance that you can take that Salesforce data and put it somewhere else when Salesforce has a bad day. What you can do, however, is be sure that you back that Salesforce data up so that when someone accidentally lets the bad guys in you’ve got data you can recover from.

 

Thoughts

The optimal way to approach DR is an interesting problem to try and solve, and like most things in technology, there’s no real one size fits all approach that you can take. I haven’t even really touched on the pros and cons of as-a-Service offerings versus rolling your own either. Despite my love of DIY (thanks mainly to punk rock, not home improvement), I’m generally more of a fan of getting the experts to do it. They’re generally going to have more experience, and the scale, to do what needs to be done efficiently (if not always economically). That said, it’s never as simple as just offloading the responsibility to a third-party. You might have a service provider delivering a service for you, but ultimately you’ll always be accountable for your DR outcomes, even if you’re not directly responsible for the mechanics of it.

VMware Cloud Disaster Recovery – Using A Script VM

This is a quick post covering the steps required to configure a script VM for use in a recovery plan with VMware Cloud Disaster Recovery (VCDR). Why would you want to do this? You might be running a recovery for a Linux VM and you need to run a script to update the DNS settings of the VM once it’s powered on at another site. Or you might have a site-specific application that needs to be installed. Whatever. The point is that VCDR gives you that ability to do that via the Script VM. You can read the documentation on the feature here.

Firstly, you configure the Script VM as part of the Recovery Plan creation process. Specify the name of the VM and the vCenter it’s hosted on.

Under Recovery steps, click on Add Step to add a step to the recovery process.

When you add the step, you’ll want to add an action for the post-recovery phase.

You can then select “Run script on the Script VM”.

At this point you can specify the full path to the script file, keeping in mind that Windows looks different to Linux. You can also set a timeout for the script.

And that’s pretty much it. Remember that you’ll need working DNS, or, failing that, valid IP addresses for things to work.

VMware Cloud Disaster Recovery – Ransomware Recovery Activation

One of the cool features of VMware Cloud Disaster Recovery (VCDR) is the Enhanced Ransomware Recovery capability. This is a quick post to talk through how to turn it on in your VCDR environment, and things you need to consider.

 

Organization Settings

The first step is to enable the ransomware services integration in your VCDR dashboard. You’ll need to be an Organisation owner to do this. Go to Settings, and click on Ransomware Recovery Services.

You’ll then have the option to select where the data analysis is performed.

You’ll also need to tick some boxes to ensure that you understand that an appliance will be deployed in each of your Recovery SDDCs, Windows VMs will get a sensor installed, and some preinstalled sensors may clash with Carbon Black.

Click on Activate and it will take a few moments. If it takes much longer than that, you’ll need to talk to someone in support.

Once the analysis integration is activated, you can then activate NSX Advanced Firewall. Page 245 of the PDF documentation covers this better than I can, but note that NSX Advanced Firewall is a chargeable service (if you don’t already have a subscription attached to your Recovery SDDC). There’s some great documentation here on what you do and don’t have access to if you allow the activation of NSX Advanced Firewall.

Like your favourite TV chef would say, here’s one I’ve prepared earlier.

Recovery Plan Configuration

Once the services integration is done, you can configure Ransomware Recovery on a per Recovery Plan basis.

Start by selecting Activate ransomware recovery. You’ll then need to acknowledge that this is a chargeable feature.

You can also choose whether you want to use integrated analysis (i.e. Carbon Black Cloud), and if you want to manually remove other security sensors when you recover. You can, also, choose to use your own tools if you need to.

And that’s it from a configuration perspective. The actual recovery bit? A story for another time.

VMware Cloud Disaster Recovery – Firewall Ports

I published an article a while ago on getting started with VMware Cloud Disaster Recovery (VCDR). One thing I didn’t cover in any real depth was the connectivity requirements between on-premises and the VCDR service. VMware has worked pretty hard to ensure this is streamlined for users, but it’s still something you need to pay attention to. I was helping a client work through this process for a proof of concept recently and thought I’d cover it off more clearly here. The diagram below highlights the main components you need to look at, being:

  • The Cloud File System (frequently referred to as the SCFS)
  • The VMware Cloud DR SaaS Orchestrator (the Orchestrator); and
  • VMware Cloud DR Auto-support.

It’s important to note that the first two services are assigned IP addresses when you enable the service in the Cloud Service Console, and the Auto-support service has three public IP addresses that you need to be able to communicate with. All of this happens outbound over TCP 443. The Auto-support service is not required, but it is strongly recommended, as it makes troubleshooting issues with the service much easier, and provides VMware with an opportunity to proactively resolve cases. Network connectivity requirements are documented here.

[image courtesy of VMware]

So how do I know my firewall rules are working? The first sign that there might be a problem is that the DRaaS Connector deployment will fail to communicate with the Orchestrator at some point (usually towards the end), and you’ll see a message similar to the following. “ERROR! VMware Cloud DR authentication is not configured. Contact support.”

How can you troubleshoot the issue? Fortunately, we have a tool called the DRaaS Connector Connectivity Check CLI that you can run to check what’s not working. In this instance, we suspected an issue with outbound communication, and ran the following command on the console of the DRaaS Connector to check:

drc network test --scope cloud

This returned a status of “reachable” for the Orchestrator and Auto-support services, but the SCFS was unreachable. Some negotiations with the firewall team, and we were up and running.

Note, also, that VMware supports the use of proxy servers for communicating with Auto-support services, but I don’t believe we support the use of a proxy for Orchestrator and SCFS communications. If you’re worried about VCDR using up all your bandwidth, you can throttle it. Details on how to do that can be found here. We recommend a minimum of 100Mbps, but you can go as low as 20Mbps if required.

Datadobi Announces DobiProtect

Datadobi recently announced DobiProtect. I had the opportunity to speak with Michael Jack and Carl D’Halluin about the announcement, and thought I’d share some thoughts here.

 

The Problem

Disaster Recovery

Modern disaster recovery solutions tend more towards business continuity than DR. The challenge with data replication solutions is that it’s a trivial thing to replicate corruption from your primary storage to your DR storage. Backup systems are vulnerable too, and most instances you need to make some extra effort to ensure you’ve got a replicated catalogue, and that your backup data is not isolated. Invariably, you’ll be looking to restore to like hardware in order to reduce the recovery time. Tape is still a pain to deal with, and invariably you’re also at the mercy of people and processes going wrong.

What Do Customers Need?

To get what you need out of a robust DR system, there are a few criteria that need to be met, including:

  • An easy way to select business-critical data;
  • A simple way to make a golden copy in native format;
  • A bunker site in a DC or cloud;
  • A manual air-gap procedure;
  • A way to restore to anything; and
  • A way to failover if required.

 

Enter DobiProtect

What Does It Do?

The idea is that you have two sites with a manual air-gap between them, usually controlled by a firewall of some type. The first site is where you run your production workload, and there’ll likely be a subset of data that is really quirte important to your business. You can use DobiProtect to get that data from your production site to DR (it might even be in a bunker!). In order to get the data from Production to DR, DobiProtect scans the data before it’s pulled across to DR. Note that the data is pulled, not pushed. This is important as it means that there’s no obvious trace of the bunker’s existence in production.

[image courtesy of Datadobi]

If things go bang, you can recover to any NAS or Object.

  • Browse golden copy
  • Select by directory structure, folder, or object patterns
  • Mounts and shares
  • Specific versions

Bonus Use Case

One of the more popular use cases that Datadobi spoke to me about was heterogeneous edge-to-core protection. Data on the edge is usually more vulnerable, and not every organisation has the funding to put robust protection mechanisms in place at every edge site to protect critical data. With the advent of COVID-19, many organisations have been pushing more data to the edge in order for remote workers to have better access to data. The challenge then becomes keeping that data protected in a reliable fashion. DobiProtect can be used to pull data from the core once data has been pulled back from the edge. Because it’s a software only product, your edge storage can be anything that supports object, SMB, or NFS, and the core could be anything else. This provides a lot of flexibility in terms of the expense traditionally associated with DR at edge sites.

[image courtesy of Datadobi]

 

Thoughts and Further Reading

The idea of an air-gapped site in a bunker somewhere is the sort of thing you might associate with a James Bond story. In Australia these aren’t exactly a common thing (bunkers, not James Bond stories), but Europe and the US is riddled with them. As Jack pointed out in our call, “[t]he first rule of bunker club – you don’t talk about the bunker”. Datadobi couldn’t give me a list of customers using this type of solution because all of the customers didn’t want people to know that they were doing things this way. It seems a bit like security via obscurity, but there’s no point painting a big target on your back or giving clues out for would-be crackers to get into your environment and wreak havoc.

The idea that your RPO is a day, rather than minutes, is also confronting for some folks. But the idea of this solution is that you’ll use it for your absolutely mission critical can’t live without it data, not necessarily your virtual machines that you may be able to recover normally if you’re attacked or the magic black smoke escapes from one of your hosts. If you’ve gone to the trouble of looking into acquiring some rack space in a bunker, limited the people in the know to a handful, and can be bothered messing about with a manual air-gap process, the data you’re looking to protect is clearly pretty important.

Datadobi has a rich heritage in data migration for both file and object storage systems. It makes sense that eventually customer demand would drive them down this route to deliver a migration tool that ostensibly runs all the time as sort of data protection tool. This isn’t designed to protect everything in your environment, but for the stuff that will ruin your business if it goes away, it’s very likely worth the effort and expense. There are some folks out there actively looking for ways to put you over a barrel, so it’s important to think about what it’s worth to your organisation to avoid that if possible.

Zerto – News From ZertoCON 2019

Zerto recently held their annual user conference (ZertoCON) in Nashville, TN. I had the opportunity to talk to Rob Strechay about some of the key announcements coming out of the event and thought I’d cover them here.

 

Key Announcements

Licensing

You can now acquire Zerto either as a perpetual license or via a subscription. There’s previously been some concept of subscription pricing with Zerto, with customers having rented via managed service providers, but this is the first time it’s being offered directly to customers. Strechay noted that Zerto is “[n]ot trying to move to a subscription-only model”, but they are keen to give customers further flexibility in how they consume the product. Note that the subscription pricing also includes maintenance and support.

7.5 Is Just Around The Corner

If it feels like 7.0 was only just delivered, that’s because it was (in April). But 7.5 is already just around the corner. They’re looking to add a bunch of features, including:

  • Deeper integration with StoreOnce from HPE using Catalyst-based API, leveraging source-side deduplication
  • Qualification of Azure’s Data Box
  • Cloud mobility – in 7.0 they started down the path with Azure. Zerto Cloud Appliances now autoscale within Azure.

Azure Integration

There’s a lot more focus on Azure in 7.5, and Zerto are working on

  • Managed failback / managed disks in Azure
  • Integration with Azure Active Directory
  • Adding encryption at rest in AWS, and doing some IAM integration
  • Automated driver injection on the fly as you recover into AWS (with Red Hat)

Resource Planner

Building on their previous analytics work, you’ll also be able to (shortly) download Zerto Virtual Manager. This talks to vCenter and can gather data and help customers plan their VMware to VMware (or to Azure / AWS) migrations.

VAIO

Zerto has now completed the initial certification to use VMware’s vSphere APIs for I/O Filtering (VAIO) and they’ll be leveraging these in 7.5. Strechay said they’ll probably have both versions in the product for a little while.

 

Thoughts And Further Reading

I’d spoken with Strechay previously about Zerto’s plans to compete against the “traditional” data protection vendors, and asked him what the customer response has been to Zerto’s ambitions (and execution). He said that, as they’re already off-siting data (as part of the 3-2-1 data protection philosophy), how hard is it to take it to the next level? He said a number of customers were very motivated to use long term retention, and wanted to move on from their existing backup vendors. I’ve waxed lyrical in the past about what I thought some of the key differences were between periodic data protection, disaster recovery, and disaster avoidance were. That doesn’t mean that companies like Zerto aren’t doing a pretty decent job of blurring the lines between the types of solution they offer, particularly with the data mobility capabilities built in to their offerings. I think there’s a lot of scope with Zerto to move into spaces that they’ve previously only been peripherally involved in. It makes sense that they’d focus on data mobility and off-site data protection capabilities. There’s a good story developing with their cloud integration, and it seems like they’ll just continue to add features and capabilities to the product. I really like that they’re not afraid to make promises on upcoming releases and have (thus far) been able to deliver on them.

The news about VAIO certification is pretty big, and it might remove some of the pressure that potential customers have faced previously about adopting protection solutions that weren’t entirely blessed by VMware.

I’m looking forward to see what Zerto ends up delivering with 7.5, and I’m really enjoying the progress they’re making with both their on-premises and public cloud focused solutions. You can read Zerto’s press release here, and Andrea Mauro published a comprehensive overview here.

Disaster Recovery vs Disaster Avoidance vs Data Protection

This is another one of those rambling posts that I like to write when I’m sitting in an airport lounge somewhere and I’ve got a bit of time to kill. The versus in the title is a bit misleading too, because DR and DA are both forms of data protection. And periodic data protection (PDP) is important too. But what I wanted to write about was some of the differences between DR and DA, in particular.

TL;DR – DR is not DA, and this is not PDP either. But you need to think about all of them at some point.

 

Terminology

I want to be clear about what I mean when I say these terms, because it seems like they can mean a lot of things to different folks.

  • Recovery Point Objective – The Recovery Point Objective (RPO) is the maximum amount of time in which data may have been permanently lost during an incident. You want this to be in minutes and hours, not days or weeks (ideally). RPO 0 is the idea that no data is lost when there’s a failure. A lot of vendors will talk about “Near Zero” RPOs.
  • Recovery Time Objective – The Recovery Time Objective (RTO) is the amount of time the business can be without the service, without incurring significant risks or significant losses. This is, ostensibly, how long it takes you to get back up and running after an event. You don’t really want this to be in days and weeks either.
  • Disaster Recovery – Disaster Recovery is the ability to recover applications after a major event (think flood, fire, DC is now a hole in the ground). This normally involves a failover of workloads from one DC to another in an orchestrated fashion.
  • Disaster Avoidance – Disaster avoidance “is an anticipatory strategy that is in place in order to prevent any such instance of data breach or losses. It is a defensive, proactive approach to keeping data safe” (I’m quoting this from a great blog post on the topic here)
  • Periodic Data Protection – This is the kind of data protection activity we normally associate with “backups”. It is usually a daily activity (or perhaps as frequent as hourly) and the data is normally used for ad-hoc data file recovery requests. Some people use their backup data as an archive. They’re bad people and shouldn’t be trusted. PDP is normally separate to DA or DR solutions.

 

DR Isn’t The Full Answer

I’ve had some great conversations with customers recently about adding resilience to their on-premises infrastructure. It seems like an old-fashioned concept, but a number of organisations are only now seeing the benefits of adding infrastructure-level resilience to their platforms. The first conversation usually goes something like this:

Me: So what’s your key application, and what’s your resiliency requirement?

Customer: Oh, it’s definitely Application X (usually built on Oracle or using SAP or similar). It absolutely can’t go down. Ever. We need to have RPO 0 and RTO 0 for this one. Our while business depends on it.

Me: Okay, it sounds like it’s pretty important. So what about your file server and email?

Customer: Oh, that’s not so important. We can recover those from overnight backups.

Me: But aren’t they used to store data for Application X? Don’t you have workflows that rely on email?

Customer: Oh, yeah, I guess so. But it will be too expensive to protect all of this. Can we change the RPO a bit? I don’t think the CFO will support us doing RPO 0 everywhere.

These requirements tend to change whenever we move from technical discussions to commercial discussions. In an ideal world, Martha in Accounting will have her home directory protected in a highly available fashion such that it can withstand the failure of one or more storage arrays (or data centres). The problem with this is that, if there are 1000 Marthas in the organisation, the cost of protecting that kind of data at scale becomes prohibitive, relative to the perceived value of the data. This is one of the ways I’ve seen “DR” capability added to an environment in the past. Take some older servers and put them in a site removed from the primary site, setup some scripts to copy critical data to that site, and hope nothing ever goes too wrong with the primary site.

There are obviously better ways of doing this, and common solutions may or may not involve block-level storage replication, orchestrated failover tools, and like for like compute at the secondary site (or perhaps you’ve decided to shut down test and development while you’re fixing the problem at the production site).

But what are you trying to protect against? The failure of some compute? Some storage? The network layer? A key application? All of these answers will determine the path you’ll need to go down. Keep in mind also that DR isn’t the only answer. You also need to have business continuity processes in place. A failover of workloads to a secondary site is pointless if operations staff don’t have access to a building to continue doing their work, or if people can’t work when the swipe card access machine is off-lien, or if your Internet feed only terminates in one DC, etc.

 

I’m Avoiding The Problem

Disaster Avoidance is what I like to call the really sexy resilience solution. You can have things go terribly wrong with your production workload and potentially still have it functioning like there was no problem. This is where hardware solutions like Pure Storage ActiveCluster or Dell EMC VPLEX can really shine, assuming you’ve partnered them with applications that have the smarts built in to leverage what they have to offer. Because that’s the real key to a successful disaster avoidance design. It’s great to have synchronous replication and cache-consistency across DCs, but if your applications don’t know what to do when a leg goes missing, they’ll fall over. And if you don’t have other protection mechanisms in place, such as periodic data protection, then your synchronous block replication solution will merrily synchronise malware or corrupted data from one site to another in the blink of an eye.

It’s important to understand the failure scenarios you’re protecting against too. If you’ve deployed vSphere Metro Storage Cluster, you’ll be able to run VMs even when your whole array has gone off-line (assuming you’ve set it up properly). But this won’t necessarily prevent an outage if you lose your vSphere cluster, or the whole DC. Your data will still be protected, and you’ll be in good shape in terms of recovering quickly, but there will be an outage. This is where application-level resilience can help with availability. Remember that, even if you’ve got ultra-resilient workloads protection across DCs, if your staff only have one connection into the environment, they may be left twiddling their thumbs in the event of a problem.

There’s a level of resiliency associated with this approach, and your infrastructure will certainly be able to survive the failure of a compute node, or even a bunch of disk and some compute (everything will reboot in another location). But you need to be careful not to let people think that this is something it’s not.

 

PDP, Yeah You Know Me

I mentioned problems with malware and data corruption earlier on. This is where periodic data protection solutions (such as those sold by Dell EMC, CommVault, Rubrik, Cohesity, Veeam, etc) can really get you out of a spot of bother. And if you don’t need to recover the whole VM when there’s a problem, these solutions can be a lot quicker at getting data back. The good news is that you can integrate a lot of these products with storage protection solutions and orchestration tools for a belt and braces solution to protection, and it’s not the shitshow of scripts and kludges that it was ten years ago. Hooray!

 

Final Thoughts

There’s a lot more to data protection than I’ve covered here. People like Preston have written books about the topic. And a lot of the decision making is potentially going to be out of your hands in terms of what your organisation can afford to spend (until they lose a lot of data, money (or both), then they’ll maybe change their focus). But if you do have the opportunity to work on some of these types of solutions, at least try to make sure that everyone understands exactly what they can achieve with the technologies at hand. There’s nothing worse than being hauled over the coals because some director thought they could do something amazing with infrastructure-level availability and resiliency only to have the whole thing fall over due to lack of budget. It can be a difficult conversation to have, particularly if your executives are the types of people who like to trust the folks with the fancy logos on their documents. All you can do in that case is try and be clear about what’s possible, and clear about what it will cost in time and money.

In the near future I’ll try to put together a post on various infrastructure failure scenarios and what works and what doesn’t. RPO 0 seems to be what everyone is asking for, but it may not necessarily be what everyone needs. Now please enjoy this Unfinished Business stock image.