Pure Storage ActiveCluster – Background Information

I’ve been doing a bunch of research into Pure Storage’s ActiveCluster product recently. I was all set to do an article that explains how to set it up and what a vSphere Metro Cluster looks like with it in place, but Cody Hosterman has beaten me to the punch. Given that it’s more his job than mine to write this stuff, and that he works for Pure Storage, I’m okay with that. In any case, I thought it would be worthwhile to jot down some thoughts and notes and share some links to Cody’s work, if for no other reason than it gives me an aggregation point for my thoughts.

 

Introduction

I was lucky enough to be at Pure//Accelerate in 2017 when ActiveCluster was announced and covered it at a high level here. If you’re unfamiliar with ActiveCluster, it’s “a fully symmetric active/active bidirectional replication solution that provides synchronous replication for RPO zero and automatic transparent failover for RTO zero. ActiveCluster spans multiple sites enabling clustered arrays and clustered ESXi hosts to be used to deploy flexible active/active datacenter configurations.” (https://kb.vmware.com/s/article/51656).

[image courtesy of Pure Storage]

 

Components

There are a few bits that are needed to make ActiveCluster work (besides Purity 5.0 on your FlashArray):

  • Replication Network;
  • Pods; and
  • Pure1 Cloud Mediator.

 

Replication Network

The replication network is used for the initial asynchronous transfer of data to stretch a pod, to synchronously transfer data and configuration information between arrays, and to resynchronise a pod. For this network to work, you should note the following criteria apply:

  • The maximum tolerable RTT is 5ms between clustered FlashArrays;
  • 4x 10GbE replication ports per array (two per controller). Two replication ports per controller are required to ensure redundant access from the primary controller to the other array;
  • 4x dedicated replication IP addresses per array;
  • A redundant, switched replication network. Direct connection of FlashArrays for replication is not supported; and
  • Adequate bandwidth between arrays to support bi-directional synchronous writes and bandwidth for resynchronizing. This depends on the write rate of the hosts at both sites.

So, you need to know (and understand) your workload, and you need some reasonable bandwidth between the arrays. This shouldn’t be unexpected, but it’s clearly well suited to a metro deployment.

 

Pods

A Pod is a replication namespace. Once a pod is created, the pod (and the volumes inside it) can be controlled from either FlashArray. If you create a snapshot, that snapshot is created on both sides. If snapshots exist on the volume before it’s added to the pod, those snapshots will be copied over when you add it in. The pod itself acts as a consistency group.

 

Pure1 Cloud Mediator

The Pure1 Cloud Mediator is used to arbitrate split-brain scenarios. It sits in the cloud and keeps an eye on stuff. Think of it as the Vanilla Ice of the ActiveCluster (before he went off and did moto-x and renovation shows). For “dark” sites, an on-premises mediator (VM) can also be deployed.

 

A Few Other Notes

A few other things to note about the behaviour of ActiveCluster:

  • Data reduction is performed independently between arrays. This is cool because you might have a mix of workloads at each data centre;
  • If the arrays lose connection to the mediator they will continue to serve data and synchronously replicate as long as array to array communication is active; and
  • If both arrays lose communication with each other and with the mediator, this is a dual failure and both the mirrored volumes become unavailable until communication with the other array or the mediator can be re-established. Non-mirrored volumes would not be affected in this instance and would still be accessible.

 

Disaster Avoidance Or Recovery?

Before deploying ActiveCluster, you should think about what kind of goal you’re trying to achieve. Disaster Avoidance assumes that some element of the primary site (Site A) is unavailable due to a disaster. DA uses synchronous replication only and requires a stretched cluster technology (such as VMware vSphere Metro Cluster) to provide active / active workload availability access both sites. Disaster Recovery, on the other hand, assumes that workloads are deployed in an active / passive configuration across sites. There are advantages to each approach, depending on what your recovery point objective (RPO) is, and what your recovery time objective (RTO) is. If you have a very low RPO and RTO requirement, the added expense of deploying a synchronous replication solution (not the Pure bit, but the supporting infrastructure) is worth it. If you have a greater tolerance for a higher RPO and / or RTO, an asynchronous solution (and the less stringent replication network requirements) may be a better fit for you.

You should also think about whether the topology you’re deploying is Uniform or Non-Uniform. A Uniform configuration provides hosts with access across Sites. This requires a bit more investment in terms of stretched FC fabrics (assuming you’re using FC and not iSCSI). This is generally the topology deployed for metro clusters.

You might decide, however, to deploy a Non-Uniform configuration for simpler disaster recovery. In that case, there’s no requirement to have cross-site FC links in place, but your time to recover will be impacted. You’ll also want to look at something like VMware Site Recovery Manager to orchestrate the recovery of workloads at the secondary site.

 

Conclusion

Whilst I think ActiveCluster is a very neat piece of technology, you should be doing a whole lot of thinking about other (possibly very boring) stuff before you take the plunge and decide to deploy vMSC sitting on an ActiveCluster environment. Disaster Avoidance (and Recovery) require a lot of planning and understanding of what’s important to your business before you deploy a solution. In the next little while I hope to be able to report back with some results from testing, and talk a bit about other protection scenarios, including metro clusters with asynchronous protection off to the side.

Zerto Analytics – Seeing Is Understanding

I attended VMworld US in August and had hoped to catch up with Zerto regarding their latest product update (the snappily titled Zerto Virtual Replication 5.5). Unfortunately there were some scheduling issues and we were unable to meet up. I was, however, briefed by them a few weeks later on some of the new features, particularly around the Zerto Analytics capability. This is a short post that focuses primarily on that part of the announcement.

 

Incremental But Important Announcement

If you’re unfamiliar with Zerto, they provide cloud and hypervisor-based workload replication for disaster recovery. They’ve been around since 2010, and the product certainly has its share of fans. The idea behind Zerto Analytics, according to Zerto, is that it “provides real-time and historical analytics on the status and health of multi-site, multi-cloud environments”.

It is deployed on Zerto’s new SaaS platform, is accessible to all Zerto VR customers, and, according to Zerto, “you will be able to quickly visualize your entire infrastructure from a single pane of glass”.

 

The Value

DR is a vital function that a whole bunch of companies don’t understand terribly well. Zerto provide a reasonably comprehensive solution for companies looking to protect their hypervisor-based workloads in multiple locations while leveraging a simple to use interface for recovery. because when it all goes wrong you want it to be easy to come back. The cool thing about Zerto Analytics is that it gives you more than the standard issue status reporting you’ve previously enjoyed. Instead, you can go through historical data to get a better understanding of the replication requirements of your workloads, and the hot and cold times for workloads. I think this is super useful when it comes to (potentially) understanding when planned maintenance needs to occur, and when a good time is to schedule in your test recoveries or data migration activities.

There’s never a good time for a disaster. That’s why they call them disasters. But the more information you have available at the time of a disaster, the better the chances are of you coming out the other end in good shape. The motto at my daughters’ school is “Scientia est Potestas”. This doesn’t actually mean “Science is Potatoes” but is Latin for “Knowledge is Power”. As with most things in IT (and life), a little bit of extra knowledge (in the form of insight and data) can go a long way. Zerto are keen, with this release, to improve the amount of visibility you have into your environment from a DR perspective. This can only be a good thing, particularly when you can consume it across a decent range of platforms.

DR isn’t just about the technology by any stretch. You need an extensive understanding of what’s happening in your environment, and you need to understand what happens to people when things go bang. But one of the building blocks for success, in my opinion, is providing a solid platform for recovery in the event that something goes pear-shaped. Zerto isn’t for everyone, but I get the impression anecdotally that they’re doing some pretty good stuff around making what can be a bad thing into a more positive experience.

 

Read More

Technical documentation on Zerto Virtual Replication 5.5 can be found here. There’s also a great demo on YouTube that you can see here.

2017 – The New What Next

I’m not terribly good at predicting the future, particularly when it comes to technology trends. I generally prefer to leave that kind of punditry to journalists who don’t mind putting it out there and are happy to be proven wrong on the internet time and again. So why do a post referencing a great Hot Water Music album? Well, one of the PR companies I deal with regularly sent me a few quotes through from companies that I’m generally interested in talking about. And let’s face it, I haven’t had a lot to say in the last little while due to day job commitments and the general malaise I seem to suffer from during the onset of summer in Brisbane (no, I really don’t understand the concept of Christmas sweaters in the same way my friends in the Northern Hemisphere do).

Long intro for a short post? Yes. So I’ll get to the point. Here’s one of the quotes I was sent. “As concerns of downtime grow more acute in companies around the globe – and the funds for secondary data centers shrink – companies will be turning to DRaaS. While it’s been readily available for years, the true apex of adoption will hit in 2017-2018, as prices continue to drop and organizations become more risk-averse. There are exceptional technologies out there that can solve the business continuity problem for very little money in a very short time.” This was from Justin Giardina, CTO of iland. I was fortunate enough to meet Justin at the Nimble Storage Predictive Flash launch event in February this year. Justin is a switched on guy and while I don’t want to give his company too much air time (they compete in places with my employer), I think he’s bang on the money with his assessment of the state of play with DR and market appetite for DR as a Service.

I think there are a few things at play here, and it’s not all about technology (because it rarely is). The CxO’s fascination with cloud has been (rightly or wrongly) fiscally focused, with a lot of my customers thinking that public cloud could really help reduce their operating costs. I don’t want to go too much into the accuracy of that idea, but I know that cost has been front and centre for a number of customers for some time now. Five years ago I was working in a conservative environment where we had two production DCs and a third site dedicated to data protection infrastructure. They’ve since reduced that to one production site and are leveraging outsourced providers for both DR and data protection capabilities. The workload hasn’t changed significantly, nor has the requirement to have the data protected and recoverable.

Rightly or wrongly the argument for appropriate disaster recovery infrastructure seems to be a difficult one to make in organisations, even those that have been exposed to disaster and have (through sheer dumb luck) survived the ordeal. I don’t know why it is so difficult for people to understand that good DR and data protection is worth it. I suppose it is the same as me taking a calculated risk on my insurance every year and paying a lower annual rate and gambling on the fact that I won’t have to make a claim and be exposed to higher premiums.

It’s not just about cost though. I’ve spoken to plenty of people who just don’t know what they’re doing when it comes to DR and data protection. And some of these people have been put in the tough position of having lost some data, or had a heck of a time recovering after a significant equipment failure. In the same way that I have a someone come and look at my pool pump when water is coming out of the wrong bit, these companies are keen to get people in who know what they’re doing. If you think about it, it’s a smart move. While it can be hard to admit, sometimes knowing your limitations is actually a good thing.

It’s not that we don’t have the technology, or the facilities (even in BrisVegas) to do DR and data protection pretty well nowadays. In most cases it’s easier and more reliable than it ever was. But, like on-premises email services, it seems to be a service that people are happy to make someone else’s problem. I don’t have an issue with that as a concept, as long as you understand that you’re only outsourcing some technology and processes, you’re not magically doing away with the risk and result when something goes pear-shaped. If you’re a small business without a dedicated team of people to look after your stuff, it makes a lot of sense. Even the bigger players can benefit from making it someone else’s thing to worry about it. Just make sure you know what you’re getting into.

Getting back to the original premise of this post, I agree with Justin that we’re at a tipping point regarding DRaaS adoption, and I think 2017 is going to be really interesting in terms of how companies make use of this technology to protect their assets and keep costs under control.