Disaster Recovery vs Disaster Avoidance vs Data Protection

This is another one of those rambling posts that I like to write when I’m sitting in an airport lounge somewhere and I’ve got a bit of time to kill. The versus in the title is a bit misleading too, because DR and DA are both forms of data protection. And periodic data protection (PDP) is important too. But what I wanted to write about was some of the differences between DR and DA, in particular.

TL;DR – DR is not DA, and this is not PDP either. But you need to think about all of them at some point.

 

Terminology

I want to be clear about what I mean when I say these terms, because it seems like they can mean a lot of things to different folks.

  • Recovery Point Objective – The Recovery Point Objective (RPO) is the maximum amount of time in which data may have been permanently lost during an incident. You want this to be in minutes and hours, not days or weeks (ideally). RPO 0 is the idea that no data is lost when there’s a failure. A lot of vendors will talk about “Near Zero” RPOs.
  • Recovery Time Objective – The Recovery Time Objective (RTO) is the amount of time the business can be without the service, without incurring significant risks or significant losses. This is, ostensibly, how long it takes you to get back up and running after an event. You don’t really want this to be in days and weeks either.
  • Disaster Recovery – Disaster Recovery is the ability to recover applications after a major event (think flood, fire, DC is now a hole in the ground). This normally involves a failover of workloads from one DC to another in an orchestrated fashion.
  • Disaster Avoidance – Disaster avoidance “is an anticipatory strategy that is in place in order to prevent any such instance of data breach or losses. It is a defensive, proactive approach to keeping data safe” (I’m quoting this from a great blog post on the topic here)
  • Periodic Data Protection – This is the kind of data protection activity we normally associate with “backups”. It is usually a daily activity (or perhaps as frequent as hourly) and the data is normally used for ad-hoc data file recovery requests. Some people use their backup data as an archive. They’re bad people and shouldn’t be trusted. PDP is normally separate to DA or DR solutions.

 

DR Isn’t The Full Answer

I’ve had some great conversations with customers recently about adding resilience to their on-premises infrastructure. It seems like an old-fashioned concept, but a number of organisations are only now seeing the benefits of adding infrastructure-level resilience to their platforms. The first conversation usually goes something like this:

Me: So what’s your key application, and what’s your resiliency requirement?

Customer: Oh, it’s definitely Application X (usually built on Oracle or using SAP or similar). It absolutely can’t go down. Ever. We need to have RPO 0 and RTO 0 for this one. Our while business depends on it.

Me: Okay, it sounds like it’s pretty important. So what about your file server and email?

Customer: Oh, that’s not so important. We can recover those from overnight backups.

Me: But aren’t they used to store data for Application X? Don’t you have workflows that rely on email?

Customer: Oh, yeah, I guess so. But it will be too expensive to protect all of this. Can we change the RPO a bit? I don’t think the CFO will support us doing RPO 0 everywhere.

These requirements tend to change whenever we move from technical discussions to commercial discussions. In an ideal world, Martha in Accounting will have her home directory protected in a highly available fashion such that it can withstand the failure of one or more storage arrays (or data centres). The problem with this is that, if there are 1000 Marthas in the organisation, the cost of protecting that kind of data at scale becomes prohibitive, relative to the perceived value of the data. This is one of the ways I’ve seen “DR” capability added to an environment in the past. Take some older servers and put them in a site removed from the primary site, setup some scripts to copy critical data to that site, and hope nothing ever goes too wrong with the primary site.

There are obviously better ways of doing this, and common solutions may or may not involve block-level storage replication, orchestrated failover tools, and like for like compute at the secondary site (or perhaps you’ve decided to shut down test and development while you’re fixing the problem at the production site).

But what are you trying to protect against? The failure of some compute? Some storage? The network layer? A key application? All of these answers will determine the path you’ll need to go down. Keep in mind also that DR isn’t the only answer. You also need to have business continuity processes in place. A failover of workloads to a secondary site is pointless if operations staff don’t have access to a building to continue doing their work, or if people can’t work when the swipe card access machine is off-lien, or if your Internet feed only terminates in one DC, etc.

 

I’m Avoiding The Problem

Disaster Avoidance is what I like to call the really sexy resilience solution. You can have things go terribly wrong with your production workload and potentially still have it functioning like there was no problem. This is where hardware solutions like Pure Storage ActiveCluster or Dell EMC VPLEX can really shine, assuming you’ve partnered them with applications that have the smarts built in to leverage what they have to offer. Because that’s the real key to a successful disaster avoidance design. It’s great to have synchronous replication and cache-consistency across DCs, but if your applications don’t know what to do when a leg goes missing, they’ll fall over. And if you don’t have other protection mechanisms in place, such as periodic data protection, then your synchronous block replication solution will merrily synchronise malware or corrupted data from one site to another in the blink of an eye.

It’s important to understand the failure scenarios you’re protecting against too. If you’ve deployed vSphere Metro Storage Cluster, you’ll be able to run VMs even when your whole array has gone off-line (assuming you’ve set it up properly). But this won’t necessarily prevent an outage if you lose your vSphere cluster, or the whole DC. Your data will still be protected, and you’ll be in good shape in terms of recovering quickly, but there will be an outage. This is where application-level resilience can help with availability. Remember that, even if you’ve got ultra-resilient workloads protection across DCs, if your staff only have one connection into the environment, they may be left twiddling their thumbs in the event of a problem.

There’s a level of resiliency associated with this approach, and your infrastructure will certainly be able to survive the failure of a compute node, or even a bunch of disk and some compute (everything will reboot in another location). But you need to be careful not to let people think that this is something it’s not.

 

PDP, Yeah You Know Me

I mentioned problems with malware and data corruption earlier on. This is where periodic data protection solutions (such as those sold by Dell EMC, CommVault, Rubrik, Cohesity, Veeam, etc) can really get you out of a spot of bother. And if you don’t need to recover the whole VM when there’s a problem, these solutions can be a lot quicker at getting data back. The good news is that you can integrate a lot of these products with storage protection solutions and orchestration tools for a belt and braces solution to protection, and it’s not the shitshow of scripts and kludges that it was ten years ago. Hooray!

 

Final Thoughts

There’s a lot more to data protection than I’ve covered here. People like Preston have written books about the topic. And a lot of the decision making is potentially going to be out of your hands in terms of what your organisation can afford to spend (until they lose a lot of data, money (or both), then they’ll maybe change their focus). But if you do have the opportunity to work on some of these types of solutions, at least try to make sure that everyone understands exactly what they can achieve with the technologies at hand. There’s nothing worse than being hauled over the coals because some director thought they could do something amazing with infrastructure-level availability and resiliency only to have the whole thing fall over due to lack of budget. It can be a difficult conversation to have, particularly if your executives are the types of people who like to trust the folks with the fancy logos on their documents. All you can do in that case is try and be clear about what’s possible, and clear about what it will cost in time and money.

In the near future I’ll try to put together a post on various infrastructure failure scenarios and what works and what doesn’t. RPO 0 seems to be what everyone is asking for, but it may not necessarily be what everyone needs. Now please enjoy this Unfinished Business stock image.

SolarWinds Articles

I’ve been writing some articles over in the Solarwinds Geek Speak community and other areas of the site covering fun things like SNMP, syslog and disaster recovery stuff. You can check them out here.

Syslog – The Blue-Collar Worker Of The Data Center

SNMP – It’s Not a Trap!

This is a Disaster! Knowing When to Call It

Disaster Recovery – How Logging Can Help Ensure You’ll Get There

Disaster Recovery – The Postmortem

What’s New With Zerto?

Zerto recently held their annual conference (ZertoCON) last week in Boston. I didn’t attend, but I did have time to catch up with Rob Strechay prior to Zerto making some announcements around the company and future direction. I thought I’d cover those here.

 

IT Resilience Platform

The first announcement revolved around the “IT Resilience Platform“. The idea behind the strategy is that backup, disaster recovery and cloud mobility solutions into a single, simple, scalable platform. Strechay says that “this strategy combines continuous availability, workload mobility, and multi-cloud agility to ensure you can withstand any disruption, leverage new technology seamlessly, and move forward with confidence”. They’ve found that Zerto is being used both for unplanned and planned disruptions, and they’ve also been seeing a lot more activity resolving ransomware and security incidents. From a planned outage perspective, DC consolidation has been a big part of the planned disruption activity as well.

What’s driving this direction? According to Strechay, companies are looking for fewer point solutions. They’re also seeing backup and DR activities converging. Cloud is driving this technology convergence and is changing the way data protection is being delivered.

  • Cloud for backup
  • Cloud for DR
  • Application mobility

“It’s good if it’s done properly”. Zerto tell me they haven’t rushed into this and are not taking the approach lightly. They see IT Resilience as a combination of  Backup, DR Replication, and Hybrid Cloud. Strechay told me that Zerto are going to stay software only and will partner on the hardware side where required. So what does it look like conceptually?

[image courtesy of Zerto]

Think of this as a mode of transport. The analytics and control is like the navigation system, the orchestration and automation layer are the steering wheel, and continuous data protection is the car.

 

Vision for the Future of Backup

Strechay also shared with me Zerto’s vision for the future of backup. In short, “it needs to change”. They really want to move away from the concept of periodic protection to continuous, journal-based protection delivering seconds of RPO at scale to meet customer expectations. How are they going to do this? The key differentiation will be CDP combined with best of breed replication.

 

Zerto 7 Preview

Strechay also shared some high level details of Zerto 7, with key features including:

  • Intelligent index and search
  • Elastic journal
  • Data protection workflows
  • Architecture enhanced
  • LTR targets

There’ll be a new and enhanced user experience – they’re busy revisiting workflows and enhancing a number of them (e.g. reducing clicks, enhanced APIs, etc). They’ll also be looking at features such as prescriptive analytics (what if I added more VMs to this journal?). They’re aiming for a release in Q1 2019.

 

Thoughts

The way we protect data is changing. Companies like Zerto, Rubrik and Cohesity are bringing a new way of thinking to an age old problem. They’re coming at it from slightly different angles as well. This can only be a good thing for the industry. A lot of the technical limitations that we faced previously have been removed in terms of bandwidth and processing power. This provides the opportunity to approach the problem from the business perspective. Rather than saying “we can’t do that”, we have the opportunity to say “we can do that”. That doesn’t mean that scale is a simple thing to manage, but it seems like there are more ways to solve this problem than there have been previously.

I’ve been a fan of Zerto’s approach for some time. I like the idea that a company has shared their new vision for data protection some months out from actually delivering the product. It makes a nice change from companies merely regurgitating highlights from their product release notes (not that that isn’t useful at times). Zerto have a rich history of delivering CDP solutions for virtualised environments, and they’ve made some great inroads with cloud workload protection as well. The idea of moving away from periodic data protection to something continuous is certainly interesting, and obviously fits in well with Zerto’s strengths. It’s possibly not a strategy that will work well in every situation, particularly with smaller environments. But if you’re leveraging replication technologies already, it’s worth looking at how Zerto might be able to deliver a more complete solution for your data protection requirements.

Pure Storage ActiveCluster – Background Information

I’ve been doing a bunch of research into Pure Storage’s ActiveCluster product recently. I was all set to do an article that explains how to set it up and what a vSphere Metro Cluster looks like with it in place, but Cody Hosterman has beaten me to the punch. Given that it’s more his job than mine to write this stuff, and that he works for Pure Storage, I’m okay with that. In any case, I thought it would be worthwhile to jot down some thoughts and notes and share some links to Cody’s work, if for no other reason than it gives me an aggregation point for my thoughts.

 

Introduction

I was lucky enough to be at Pure//Accelerate in 2017 when ActiveCluster was announced and covered it at a high level here. If you’re unfamiliar with ActiveCluster, it’s “a fully symmetric active/active bidirectional replication solution that provides synchronous replication for RPO zero and automatic transparent failover for RTO zero. ActiveCluster spans multiple sites enabling clustered arrays and clustered ESXi hosts to be used to deploy flexible active/active datacenter configurations.” (https://kb.vmware.com/s/article/51656).

[image courtesy of Pure Storage]

 

Components

There are a few bits that are needed to make ActiveCluster work (besides Purity 5.0 on your FlashArray):

  • Replication Network;
  • Pods; and
  • Pure1 Cloud Mediator.

 

Replication Network

The replication network is used for the initial asynchronous transfer of data to stretch a pod, to synchronously transfer data and configuration information between arrays, and to resynchronise a pod. For this network to work, you should note the following criteria apply:

  • The maximum tolerable RTT is 5ms between clustered FlashArrays;
  • 4x 10GbE replication ports per array (two per controller). Two replication ports per controller are required to ensure redundant access from the primary controller to the other array;
  • 4x dedicated replication IP addresses per array;
  • A redundant, switched replication network. Direct connection of FlashArrays for replication is not supported; and
  • Adequate bandwidth between arrays to support bi-directional synchronous writes and bandwidth for resynchronizing. This depends on the write rate of the hosts at both sites.

So, you need to know (and understand) your workload, and you need some reasonable bandwidth between the arrays. This shouldn’t be unexpected, but it’s clearly well suited to a metro deployment.

 

Pods

A Pod is a replication namespace. Once a pod is created, the pod (and the volumes inside it) can be controlled from either FlashArray. If you create a snapshot, that snapshot is created on both sides. If snapshots exist on the volume before it’s added to the pod, those snapshots will be copied over when you add it in. The pod itself acts as a consistency group.

 

Pure1 Cloud Mediator

The Pure1 Cloud Mediator is used to arbitrate split-brain scenarios. It sits in the cloud and keeps an eye on stuff. Think of it as the Vanilla Ice of the ActiveCluster (before he went off and did moto-x and renovation shows). For “dark” sites, an on-premises mediator (VM) can also be deployed.

 

A Few Other Notes

A few other things to note about the behaviour of ActiveCluster:

  • Data reduction is performed independently between arrays. This is cool because you might have a mix of workloads at each data centre;
  • If the arrays lose connection to the mediator they will continue to serve data and synchronously replicate as long as array to array communication is active; and
  • If both arrays lose communication with each other and with the mediator, this is a dual failure and both the mirrored volumes become unavailable until communication with the other array or the mediator can be re-established. Non-mirrored volumes would not be affected in this instance and would still be accessible.

 

Disaster Avoidance Or Recovery?

Before deploying ActiveCluster, you should think about what kind of goal you’re trying to achieve. Disaster Avoidance assumes that some element of the primary site (Site A) is unavailable due to a disaster. DA uses synchronous replication only and requires a stretched cluster technology (such as VMware vSphere Metro Cluster) to provide active / active workload availability access both sites. Disaster Recovery, on the other hand, assumes that workloads are deployed in an active / passive configuration across sites. There are advantages to each approach, depending on what your recovery point objective (RPO) is, and what your recovery time objective (RTO) is. If you have a very low RPO and RTO requirement, the added expense of deploying a synchronous replication solution (not the Pure bit, but the supporting infrastructure) is worth it. If you have a greater tolerance for a higher RPO and / or RTO, an asynchronous solution (and the less stringent replication network requirements) may be a better fit for you.

You should also think about whether the topology you’re deploying is Uniform or Non-Uniform. A Uniform configuration provides hosts with access across Sites. This requires a bit more investment in terms of stretched FC fabrics (assuming you’re using FC and not iSCSI). This is generally the topology deployed for metro clusters.

You might decide, however, to deploy a Non-Uniform configuration for simpler disaster recovery. In that case, there’s no requirement to have cross-site FC links in place, but your time to recover will be impacted. You’ll also want to look at something like VMware Site Recovery Manager to orchestrate the recovery of workloads at the secondary site.

 

Conclusion

Whilst I think ActiveCluster is a very neat piece of technology, you should be doing a whole lot of thinking about other (possibly very boring) stuff before you take the plunge and decide to deploy vMSC sitting on an ActiveCluster environment. Disaster Avoidance (and Recovery) require a lot of planning and understanding of what’s important to your business before you deploy a solution. In the next little while I hope to be able to report back with some results from testing, and talk a bit about other protection scenarios, including metro clusters with asynchronous protection off to the side.

Zerto Analytics – Seeing Is Understanding

I attended VMworld US in August and had hoped to catch up with Zerto regarding their latest product update (the snappily titled Zerto Virtual Replication 5.5). Unfortunately there were some scheduling issues and we were unable to meet up. I was, however, briefed by them a few weeks later on some of the new features, particularly around the Zerto Analytics capability. This is a short post that focuses primarily on that part of the announcement.

 

Incremental But Important Announcement

If you’re unfamiliar with Zerto, they provide cloud and hypervisor-based workload replication for disaster recovery. They’ve been around since 2010, and the product certainly has its share of fans. The idea behind Zerto Analytics, according to Zerto, is that it “provides real-time and historical analytics on the status and health of multi-site, multi-cloud environments”.

It is deployed on Zerto’s new SaaS platform, is accessible to all Zerto VR customers, and, according to Zerto, “you will be able to quickly visualize your entire infrastructure from a single pane of glass”.

 

The Value

DR is a vital function that a whole bunch of companies don’t understand terribly well. Zerto provide a reasonably comprehensive solution for companies looking to protect their hypervisor-based workloads in multiple locations while leveraging a simple to use interface for recovery. because when it all goes wrong you want it to be easy to come back. The cool thing about Zerto Analytics is that it gives you more than the standard issue status reporting you’ve previously enjoyed. Instead, you can go through historical data to get a better understanding of the replication requirements of your workloads, and the hot and cold times for workloads. I think this is super useful when it comes to (potentially) understanding when planned maintenance needs to occur, and when a good time is to schedule in your test recoveries or data migration activities.

There’s never a good time for a disaster. That’s why they call them disasters. But the more information you have available at the time of a disaster, the better the chances are of you coming out the other end in good shape. The motto at my daughters’ school is “Scientia est Potestas”. This doesn’t actually mean “Science is Potatoes” but is Latin for “Knowledge is Power”. As with most things in IT (and life), a little bit of extra knowledge (in the form of insight and data) can go a long way. Zerto are keen, with this release, to improve the amount of visibility you have into your environment from a DR perspective. This can only be a good thing, particularly when you can consume it across a decent range of platforms.

DR isn’t just about the technology by any stretch. You need an extensive understanding of what’s happening in your environment, and you need to understand what happens to people when things go bang. But one of the building blocks for success, in my opinion, is providing a solid platform for recovery in the event that something goes pear-shaped. Zerto isn’t for everyone, but I get the impression anecdotally that they’re doing some pretty good stuff around making what can be a bad thing into a more positive experience.

 

Read More

Technical documentation on Zerto Virtual Replication 5.5 can be found here. There’s also a great demo on YouTube that you can see here.

2017 – The New What Next

I’m not terribly good at predicting the future, particularly when it comes to technology trends. I generally prefer to leave that kind of punditry to journalists who don’t mind putting it out there and are happy to be proven wrong on the internet time and again. So why do a post referencing a great Hot Water Music album? Well, one of the PR companies I deal with regularly sent me a few quotes through from companies that I’m generally interested in talking about. And let’s face it, I haven’t had a lot to say in the last little while due to day job commitments and the general malaise I seem to suffer from during the onset of summer in Brisbane (no, I really don’t understand the concept of Christmas sweaters in the same way my friends in the Northern Hemisphere do).

Long intro for a short post? Yes. So I’ll get to the point. Here’s one of the quotes I was sent. “As concerns of downtime grow more acute in companies around the globe – and the funds for secondary data centers shrink – companies will be turning to DRaaS. While it’s been readily available for years, the true apex of adoption will hit in 2017-2018, as prices continue to drop and organizations become more risk-averse. There are exceptional technologies out there that can solve the business continuity problem for very little money in a very short time.” This was from Justin Giardina, CTO of iland. I was fortunate enough to meet Justin at the Nimble Storage Predictive Flash launch event in February this year. Justin is a switched on guy and while I don’t want to give his company too much air time (they compete in places with my employer), I think he’s bang on the money with his assessment of the state of play with DR and market appetite for DR as a Service.

I think there are a few things at play here, and it’s not all about technology (because it rarely is). The CxO’s fascination with cloud has been (rightly or wrongly) fiscally focused, with a lot of my customers thinking that public cloud could really help reduce their operating costs. I don’t want to go too much into the accuracy of that idea, but I know that cost has been front and centre for a number of customers for some time now. Five years ago I was working in a conservative environment where we had two production DCs and a third site dedicated to data protection infrastructure. They’ve since reduced that to one production site and are leveraging outsourced providers for both DR and data protection capabilities. The workload hasn’t changed significantly, nor has the requirement to have the data protected and recoverable.

Rightly or wrongly the argument for appropriate disaster recovery infrastructure seems to be a difficult one to make in organisations, even those that have been exposed to disaster and have (through sheer dumb luck) survived the ordeal. I don’t know why it is so difficult for people to understand that good DR and data protection is worth it. I suppose it is the same as me taking a calculated risk on my insurance every year and paying a lower annual rate and gambling on the fact that I won’t have to make a claim and be exposed to higher premiums.

It’s not just about cost though. I’ve spoken to plenty of people who just don’t know what they’re doing when it comes to DR and data protection. And some of these people have been put in the tough position of having lost some data, or had a heck of a time recovering after a significant equipment failure. In the same way that I have a someone come and look at my pool pump when water is coming out of the wrong bit, these companies are keen to get people in who know what they’re doing. If you think about it, it’s a smart move. While it can be hard to admit, sometimes knowing your limitations is actually a good thing.

It’s not that we don’t have the technology, or the facilities (even in BrisVegas) to do DR and data protection pretty well nowadays. In most cases it’s easier and more reliable than it ever was. But, like on-premises email services, it seems to be a service that people are happy to make someone else’s problem. I don’t have an issue with that as a concept, as long as you understand that you’re only outsourcing some technology and processes, you’re not magically doing away with the risk and result when something goes pear-shaped. If you’re a small business without a dedicated team of people to look after your stuff, it makes a lot of sense. Even the bigger players can benefit from making it someone else’s thing to worry about it. Just make sure you know what you’re getting into.

Getting back to the original premise of this post, I agree with Justin that we’re at a tipping point regarding DRaaS adoption, and I think 2017 is going to be really interesting in terms of how companies make use of this technology to protect their assets and keep costs under control.