VMware Cloud Disaster Recovery – Ransomware Recovery Activation

One of the cool features of VMware Cloud Disaster Recovery (VCDR) is the Enhanced Ransomware Recovery capability. This is a quick post to talk through how to turn it on in your VCDR environment, and things you need to consider.

 

Organization Settings

The first step is to enable the ransomware services integration in your VCDR dashboard. You’ll need to be an Organisation owner to do this. Go to Settings, and click on Ransomware Recovery Services.

You’ll then have the option to select where the data analysis is performed.

You’ll also need to tick some boxes to ensure that you understand that an appliance will be deployed in each of your Recovery SDDCs, Windows VMs will get a sensor installed, and some preinstalled sensors may clash with Carbon Black.

Click on Activate and it will take a few moments. If it takes much longer than that, you’ll need to talk to someone in support.

Once the analysis integration is activated, you can then activate NSX Advanced Firewall. Page 245 of the PDF documentation covers this better than I can, but note that NSX Advanced Firewall is a chargeable service (if you don’t already have a subscription attached to your Recovery SDDC). There’s some great documentation here on what you do and don’t have access to if you allow the activation of NSX Advanced Firewall.

Like your favourite TV chef would say, here’s one I’ve prepared earlier.

Recovery Plan Configuration

Once the services integration is done, you can configure Ransomware Recovery on a per Recovery Plan basis.

Start by selecting Activate ransomware recovery. You’ll then need to acknowledge that this is a chargeable feature.

You can also choose whether you want to use integrated analysis (i.e. Carbon Black Cloud), and if you want to manually remove other security sensors when you recover. You can, also, choose to use your own tools if you need to.

And that’s it from a configuration perspective. The actual recovery bit? A story for another time.

VMware Cloud Disaster Recovery – Firewall Ports

I published an article a while ago on getting started with VMware Cloud Disaster Recovery (VCDR). One thing I didn’t cover in any real depth was the connectivity requirements between on-premises and the VCDR service. VMware has worked pretty hard to ensure this is streamlined for users, but it’s still something you need to pay attention to. I was helping a client work through this process for a proof of concept recently and thought I’d cover it off more clearly here. The diagram below highlights the main components you need to look at, being:

  • The Cloud File System (frequently referred to as the SCFS)
  • The VMware Cloud DR SaaS Orchestrator (the Orchestrator); and
  • VMware Cloud DR Auto-support.

It’s important to note that the first two services are assigned IP addresses when you enable the service in the Cloud Service Console, and the Auto-support service has three public IP addresses that you need to be able to communicate with. All of this happens outbound over TCP 443. The Auto-support service is not required, but it is strongly recommended, as it makes troubleshooting issues with the service much easier, and provides VMware with an opportunity to proactively resolve cases. Network connectivity requirements are documented here.

[image courtesy of VMware]

So how do I know my firewall rules are working? The first sign that there might be a problem is that the DRaaS Connector deployment will fail to communicate with the Orchestrator at some point (usually towards the end), and you’ll see a message similar to the following. “ERROR! VMware Cloud DR authentication is not configured. Contact support.”

How can you troubleshoot the issue? Fortunately, we have a tool called the DRaaS Connector Connectivity Check CLI that you can run to check what’s not working. In this instance, we suspected an issue with outbound communication, and ran the following command on the console of the DRaaS Connector to check:

drc network test --scope cloud

This returned a status of “reachable” for the Orchestrator and Auto-support services, but the SCFS was unreachable. Some negotiations with the firewall team, and we were up and running.

Note, also, that VMware supports the use of proxy servers for communicating with Auto-support services, but I don’t believe we support the use of a proxy for Orchestrator and SCFS communications. If you’re worried about VCDR using up all your bandwidth, you can throttle it. Details on how to do that can be found here. We recommend a minimum of 100Mbps, but you can go as low as 20Mbps if required.

Datadobi Announces DobiProtect

Datadobi recently announced DobiProtect. I had the opportunity to speak with Michael Jack and Carl D’Halluin about the announcement, and thought I’d share some thoughts here.

 

The Problem

Disaster Recovery

Modern disaster recovery solutions tend more towards business continuity than DR. The challenge with data replication solutions is that it’s a trivial thing to replicate corruption from your primary storage to your DR storage. Backup systems are vulnerable too, and most instances you need to make some extra effort to ensure you’ve got a replicated catalogue, and that your backup data is not isolated. Invariably, you’ll be looking to restore to like hardware in order to reduce the recovery time. Tape is still a pain to deal with, and invariably you’re also at the mercy of people and processes going wrong.

What Do Customers Need?

To get what you need out of a robust DR system, there are a few criteria that need to be met, including:

  • An easy way to select business-critical data;
  • A simple way to make a golden copy in native format;
  • A bunker site in a DC or cloud;
  • A manual air-gap procedure;
  • A way to restore to anything; and
  • A way to failover if required.

 

Enter DobiProtect

What Does It Do?

The idea is that you have two sites with a manual air-gap between them, usually controlled by a firewall of some type. The first site is where you run your production workload, and there’ll likely be a subset of data that is really quirte important to your business. You can use DobiProtect to get that data from your production site to DR (it might even be in a bunker!). In order to get the data from Production to DR, DobiProtect scans the data before it’s pulled across to DR. Note that the data is pulled, not pushed. This is important as it means that there’s no obvious trace of the bunker’s existence in production.

[image courtesy of Datadobi]

If things go bang, you can recover to any NAS or Object.

  • Browse golden copy
  • Select by directory structure, folder, or object patterns
  • Mounts and shares
  • Specific versions

Bonus Use Case

One of the more popular use cases that Datadobi spoke to me about was heterogeneous edge-to-core protection. Data on the edge is usually more vulnerable, and not every organisation has the funding to put robust protection mechanisms in place at every edge site to protect critical data. With the advent of COVID-19, many organisations have been pushing more data to the edge in order for remote workers to have better access to data. The challenge then becomes keeping that data protected in a reliable fashion. DobiProtect can be used to pull data from the core once data has been pulled back from the edge. Because it’s a software only product, your edge storage can be anything that supports object, SMB, or NFS, and the core could be anything else. This provides a lot of flexibility in terms of the expense traditionally associated with DR at edge sites.

[image courtesy of Datadobi]

 

Thoughts and Further Reading

The idea of an air-gapped site in a bunker somewhere is the sort of thing you might associate with a James Bond story. In Australia these aren’t exactly a common thing (bunkers, not James Bond stories), but Europe and the US is riddled with them. As Jack pointed out in our call, “[t]he first rule of bunker club – you don’t talk about the bunker”. Datadobi couldn’t give me a list of customers using this type of solution because all of the customers didn’t want people to know that they were doing things this way. It seems a bit like security via obscurity, but there’s no point painting a big target on your back or giving clues out for would-be crackers to get into your environment and wreak havoc.

The idea that your RPO is a day, rather than minutes, is also confronting for some folks. But the idea of this solution is that you’ll use it for your absolutely mission critical can’t live without it data, not necessarily your virtual machines that you may be able to recover normally if you’re attacked or the magic black smoke escapes from one of your hosts. If you’ve gone to the trouble of looking into acquiring some rack space in a bunker, limited the people in the know to a handful, and can be bothered messing about with a manual air-gap process, the data you’re looking to protect is clearly pretty important.

Datadobi has a rich heritage in data migration for both file and object storage systems. It makes sense that eventually customer demand would drive them down this route to deliver a migration tool that ostensibly runs all the time as sort of data protection tool. This isn’t designed to protect everything in your environment, but for the stuff that will ruin your business if it goes away, it’s very likely worth the effort and expense. There are some folks out there actively looking for ways to put you over a barrel, so it’s important to think about what it’s worth to your organisation to avoid that if possible.

Zerto – News From ZertoCON 2019

Zerto recently held their annual user conference (ZertoCON) in Nashville, TN. I had the opportunity to talk to Rob Strechay about some of the key announcements coming out of the event and thought I’d cover them here.

 

Key Announcements

Licensing

You can now acquire Zerto either as a perpetual license or via a subscription. There’s previously been some concept of subscription pricing with Zerto, with customers having rented via managed service providers, but this is the first time it’s being offered directly to customers. Strechay noted that Zerto is “[n]ot trying to move to a subscription-only model”, but they are keen to give customers further flexibility in how they consume the product. Note that the subscription pricing also includes maintenance and support.

7.5 Is Just Around The Corner

If it feels like 7.0 was only just delivered, that’s because it was (in April). But 7.5 is already just around the corner. They’re looking to add a bunch of features, including:

  • Deeper integration with StoreOnce from HPE using Catalyst-based API, leveraging source-side deduplication
  • Qualification of Azure’s Data Box
  • Cloud mobility – in 7.0 they started down the path with Azure. Zerto Cloud Appliances now autoscale within Azure.

Azure Integration

There’s a lot more focus on Azure in 7.5, and Zerto are working on

  • Managed failback / managed disks in Azure
  • Integration with Azure Active Directory
  • Adding encryption at rest in AWS, and doing some IAM integration
  • Automated driver injection on the fly as you recover into AWS (with Red Hat)

Resource Planner

Building on their previous analytics work, you’ll also be able to (shortly) download Zerto Virtual Manager. This talks to vCenter and can gather data and help customers plan their VMware to VMware (or to Azure / AWS) migrations.

VAIO

Zerto has now completed the initial certification to use VMware’s vSphere APIs for I/O Filtering (VAIO) and they’ll be leveraging these in 7.5. Strechay said they’ll probably have both versions in the product for a little while.

 

Thoughts And Further Reading

I’d spoken with Strechay previously about Zerto’s plans to compete against the “traditional” data protection vendors, and asked him what the customer response has been to Zerto’s ambitions (and execution). He said that, as they’re already off-siting data (as part of the 3-2-1 data protection philosophy), how hard is it to take it to the next level? He said a number of customers were very motivated to use long term retention, and wanted to move on from their existing backup vendors. I’ve waxed lyrical in the past about what I thought some of the key differences were between periodic data protection, disaster recovery, and disaster avoidance were. That doesn’t mean that companies like Zerto aren’t doing a pretty decent job of blurring the lines between the types of solution they offer, particularly with the data mobility capabilities built in to their offerings. I think there’s a lot of scope with Zerto to move into spaces that they’ve previously only been peripherally involved in. It makes sense that they’d focus on data mobility and off-site data protection capabilities. There’s a good story developing with their cloud integration, and it seems like they’ll just continue to add features and capabilities to the product. I really like that they’re not afraid to make promises on upcoming releases and have (thus far) been able to deliver on them.

The news about VAIO certification is pretty big, and it might remove some of the pressure that potential customers have faced previously about adopting protection solutions that weren’t entirely blessed by VMware.

I’m looking forward to see what Zerto ends up delivering with 7.5, and I’m really enjoying the progress they’re making with both their on-premises and public cloud focused solutions. You can read Zerto’s press release here, and Andrea Mauro published a comprehensive overview here.

Disaster Recovery vs Disaster Avoidance vs Data Protection

This is another one of those rambling posts that I like to write when I’m sitting in an airport lounge somewhere and I’ve got a bit of time to kill. The versus in the title is a bit misleading too, because DR and DA are both forms of data protection. And periodic data protection (PDP) is important too. But what I wanted to write about was some of the differences between DR and DA, in particular.

TL;DR – DR is not DA, and this is not PDP either. But you need to think about all of them at some point.

 

Terminology

I want to be clear about what I mean when I say these terms, because it seems like they can mean a lot of things to different folks.

  • Recovery Point Objective – The Recovery Point Objective (RPO) is the maximum amount of time in which data may have been permanently lost during an incident. You want this to be in minutes and hours, not days or weeks (ideally). RPO 0 is the idea that no data is lost when there’s a failure. A lot of vendors will talk about “Near Zero” RPOs.
  • Recovery Time Objective – The Recovery Time Objective (RTO) is the amount of time the business can be without the service, without incurring significant risks or significant losses. This is, ostensibly, how long it takes you to get back up and running after an event. You don’t really want this to be in days and weeks either.
  • Disaster Recovery – Disaster Recovery is the ability to recover applications after a major event (think flood, fire, DC is now a hole in the ground). This normally involves a failover of workloads from one DC to another in an orchestrated fashion.
  • Disaster Avoidance – Disaster avoidance “is an anticipatory strategy that is in place in order to prevent any such instance of data breach or losses. It is a defensive, proactive approach to keeping data safe” (I’m quoting this from a great blog post on the topic here)
  • Periodic Data Protection – This is the kind of data protection activity we normally associate with “backups”. It is usually a daily activity (or perhaps as frequent as hourly) and the data is normally used for ad-hoc data file recovery requests. Some people use their backup data as an archive. They’re bad people and shouldn’t be trusted. PDP is normally separate to DA or DR solutions.

 

DR Isn’t The Full Answer

I’ve had some great conversations with customers recently about adding resilience to their on-premises infrastructure. It seems like an old-fashioned concept, but a number of organisations are only now seeing the benefits of adding infrastructure-level resilience to their platforms. The first conversation usually goes something like this:

Me: So what’s your key application, and what’s your resiliency requirement?

Customer: Oh, it’s definitely Application X (usually built on Oracle or using SAP or similar). It absolutely can’t go down. Ever. We need to have RPO 0 and RTO 0 for this one. Our while business depends on it.

Me: Okay, it sounds like it’s pretty important. So what about your file server and email?

Customer: Oh, that’s not so important. We can recover those from overnight backups.

Me: But aren’t they used to store data for Application X? Don’t you have workflows that rely on email?

Customer: Oh, yeah, I guess so. But it will be too expensive to protect all of this. Can we change the RPO a bit? I don’t think the CFO will support us doing RPO 0 everywhere.

These requirements tend to change whenever we move from technical discussions to commercial discussions. In an ideal world, Martha in Accounting will have her home directory protected in a highly available fashion such that it can withstand the failure of one or more storage arrays (or data centres). The problem with this is that, if there are 1000 Marthas in the organisation, the cost of protecting that kind of data at scale becomes prohibitive, relative to the perceived value of the data. This is one of the ways I’ve seen “DR” capability added to an environment in the past. Take some older servers and put them in a site removed from the primary site, setup some scripts to copy critical data to that site, and hope nothing ever goes too wrong with the primary site.

There are obviously better ways of doing this, and common solutions may or may not involve block-level storage replication, orchestrated failover tools, and like for like compute at the secondary site (or perhaps you’ve decided to shut down test and development while you’re fixing the problem at the production site).

But what are you trying to protect against? The failure of some compute? Some storage? The network layer? A key application? All of these answers will determine the path you’ll need to go down. Keep in mind also that DR isn’t the only answer. You also need to have business continuity processes in place. A failover of workloads to a secondary site is pointless if operations staff don’t have access to a building to continue doing their work, or if people can’t work when the swipe card access machine is off-lien, or if your Internet feed only terminates in one DC, etc.

 

I’m Avoiding The Problem

Disaster Avoidance is what I like to call the really sexy resilience solution. You can have things go terribly wrong with your production workload and potentially still have it functioning like there was no problem. This is where hardware solutions like Pure Storage ActiveCluster or Dell EMC VPLEX can really shine, assuming you’ve partnered them with applications that have the smarts built in to leverage what they have to offer. Because that’s the real key to a successful disaster avoidance design. It’s great to have synchronous replication and cache-consistency across DCs, but if your applications don’t know what to do when a leg goes missing, they’ll fall over. And if you don’t have other protection mechanisms in place, such as periodic data protection, then your synchronous block replication solution will merrily synchronise malware or corrupted data from one site to another in the blink of an eye.

It’s important to understand the failure scenarios you’re protecting against too. If you’ve deployed vSphere Metro Storage Cluster, you’ll be able to run VMs even when your whole array has gone off-line (assuming you’ve set it up properly). But this won’t necessarily prevent an outage if you lose your vSphere cluster, or the whole DC. Your data will still be protected, and you’ll be in good shape in terms of recovering quickly, but there will be an outage. This is where application-level resilience can help with availability. Remember that, even if you’ve got ultra-resilient workloads protection across DCs, if your staff only have one connection into the environment, they may be left twiddling their thumbs in the event of a problem.

There’s a level of resiliency associated with this approach, and your infrastructure will certainly be able to survive the failure of a compute node, or even a bunch of disk and some compute (everything will reboot in another location). But you need to be careful not to let people think that this is something it’s not.

 

PDP, Yeah You Know Me

I mentioned problems with malware and data corruption earlier on. This is where periodic data protection solutions (such as those sold by Dell EMC, CommVault, Rubrik, Cohesity, Veeam, etc) can really get you out of a spot of bother. And if you don’t need to recover the whole VM when there’s a problem, these solutions can be a lot quicker at getting data back. The good news is that you can integrate a lot of these products with storage protection solutions and orchestration tools for a belt and braces solution to protection, and it’s not the shitshow of scripts and kludges that it was ten years ago. Hooray!

 

Final Thoughts

There’s a lot more to data protection than I’ve covered here. People like Preston have written books about the topic. And a lot of the decision making is potentially going to be out of your hands in terms of what your organisation can afford to spend (until they lose a lot of data, money (or both), then they’ll maybe change their focus). But if you do have the opportunity to work on some of these types of solutions, at least try to make sure that everyone understands exactly what they can achieve with the technologies at hand. There’s nothing worse than being hauled over the coals because some director thought they could do something amazing with infrastructure-level availability and resiliency only to have the whole thing fall over due to lack of budget. It can be a difficult conversation to have, particularly if your executives are the types of people who like to trust the folks with the fancy logos on their documents. All you can do in that case is try and be clear about what’s possible, and clear about what it will cost in time and money.

In the near future I’ll try to put together a post on various infrastructure failure scenarios and what works and what doesn’t. RPO 0 seems to be what everyone is asking for, but it may not necessarily be what everyone needs. Now please enjoy this Unfinished Business stock image.