Datrium Announces CloudShift

I recently had the opportunity to speak to Datrium‘s Brian Biles and Craig Nunes about their CloudShift announcement and thought it was worth covering some of the highlights here.

 

DVX Now

Datrium have had a scalable protection tier and focus on performance since their inception.

[image courtesy of Datrium]

The “mobility tier”, in the form of Cloud DVX, has been around for a little while now. It’s simple to consume (via SaaS), yields decent deduplication results, and the Datrium team tells me it also delivers fast RTO. There’s also solid support for moving data between DCs with the DVX platform. This all sounds like the foundation for something happening in the hybrid space, right?

 

And Into The Future

Datrium pointed out that disaster recovery has traditionally been a good way of finding out where a lot of the problems exist in you data centre. There’s nothing like failing a failover to understand where the integration points in your on-premises infrastructure are lacking. Disaster recovery needs to be a seamless, integrated process, but data centres are still built on various silos of technology. People are still using clouds for a variety of reasons, and some clouds do some things better than others. It’s easy to pick and choose what you need to get things done. This has been one of the big advantages of public cloud and a large reason for its success. As a result of this, however, the silos are moving to the cloud, even as they’re fixed in the DC.

As a result of this, Datrium are looking to develop a solution that delivers on the following theme: “Run. Protect. Any Cloud”. The idea is simple, offering up an orchestrated DR offering that makes failover and failback a painless undertaking. Datrium tell me they’ve been a big supporter of VMware’s SRM product, but have observed that there can be problems with VMware offering an orchestration-only layer, with adapters having issues from time to time, and managing the solution can be complicated. With CloudShift, Datrium are taking a vertical stack approach, positioning CloudShift as an orchestrator for DR as a SaaS offering. Note that it only works with Datrium.

[image courtesy of Datrium]

The idea behind CloudShift is pretty neat. With Cloud DVX you can already backup VMs to AWS using S3 and EC2. The idea is that you can leverage data already in AWS to fire up VMs on AWS (using on-demand instances of VMware Cloud on AWS) to provide temporary disaster recovery capability. The good thing about this is that converting your VMware VMs to someone else’s cloud is no longer a problem you need to resolve. You’ll need to have a relationship with AWS in the first place – it won’t be as simple as entering your credit card details and firing up an instance. But it certainly seems a lot simpler than having an existing infrastructure in place, and dealing with the conversion problems inherent in going from vSphere to KVM and other virtualisation platforms.

[image courtesy of Datrium]

Failover and failback is a fairly straightforward process as well, with the following steps required for failover and failback of workloads:

  1. Backup to Cloud DVX / S3 – This is ongoing and happens in the background;
  2. Failover required – the CloudShift runbook is initiated;
  3. Restart VM groups on VMC – VMs are rehydrated from data in S3; and
  4. Failback to on-premises – CloudShift reverses the process with deltas using change block tracking.

It’s being pitched as a very simple way to run DR, something that has been notorious for being a stressful activity in the past.

 

Thoughts and Further Reading

CloudShift is targeted for release in the first half of 2019. The economic power of DRaaS in the cloud is very strong. People love the idea that they can access the facility on-demand, rather than having passive infrastructure doing nothing on the off chance that it will be required. There’s obviously some additional cost when you need to use on demand versus reserved resources, but this is still potentially cheaper than standing up and maintaining your own secondary DC presence.

Datrium are focused on keeping inherently complex activities like DR simple. I’ll be curious to see whether they’re successful with this approach. The great thing about something like a generic orchestration framework like VMware SRM is that you can use a number of different vendors in the data centre and not have a huge problem with interoperability. The downside to this approach is that this broader ecosystem can leave you exposed to problems with individual components in the solution. Datrium is taking a punt that their customers are going to see the advantages of having an integrated approach to leveraging on demand services. I’m constantly astonished that people don’t get more excited about DRaaS offerings. It’s really cool that you can get this level of protection without having to invest a tonne in running your own passive infrastructure. If you’d like to read more about CloudShift, there’s a blog post that sheds some more light on the solution on Datrium’s site, and you can grab a white paper here too.

Getting Started With The Pure Storage CLI

I used to write a lot about how to manage CLARiiON and VNX storage environments with EMC’s naviseccli tool. I’ve been doing some stuff with Pure Storage FlashArrays in our lab and thought it might be worth covering off some of the basics of their CLI. This will obviously be no replacement for the official administration guide, but I thought it might come in useful as a starting point.

 

Basics

Unlike EMC’s CLI, there’s no executable to install – it’s all on the controllers. If you’re using Windows, PuTTY is still a good choice as an ssh client. Otherwise the macOS ssh client does a reasonable job too. When you first setup your FlashArray, a virtual IP (VIP) was configured. It’s easiest to connect to the VIP, and Purity then directs your session to whichever controller is the current primary controller. Note that you can also connect via the physical IP address if that’s how you want to do things.

The first step is to login to the array as pureuser, with the password that you’ve definitely changed from the default one.

login as: pureuser
pureuser@10.xxx.xxx.30's password:
Last login: Fri Aug 10 09:36:05 2018 from 10.xxx.xxx.xxx

Mon Aug 13 10:01:52 2018
Welcome pureuser. This is Purity Version 4.10.4 on FlashArray purearray
http://www.purestorage.com/

“purehelp” is the command to run to list available commands.

pureuser@purearray> purehelp
Available commands:
-------------------
pureadmin
purealert
pureapp
purearray
purecert
pureconfig
puredns
puredrive
pureds
purehelp
purehgroup
purehost
purehw
purelog
pureman
puremessage
purenetwork
purepgroup
pureplugin
pureport
puresmis
puresnmp
puresubnet
puresw
purevol
exit
logout

If you want to get some additional help with a command, you can run “command -h” (or –help).

pureuser@purearray> purevol -h
usage: purevol [-h]
               {add,connect,copy,create,destroy,disconnect,eradicate,list,listobj,monitor,recover,remove,rename,setattr,snap,truncate}
               ...

positional arguments:
  {add,connect,copy,create,destroy,disconnect,eradicate,list,listobj,monitor,recover,remove,rename,setattr,snap,truncate}
    add                 add volumes to protection groups
    connect             connect one or more volumes to a host
    copy                copy a volume or snapshot to one or more volumes
    create              create one or more volumes
    destroy             destroy one or more volumes or snapshots
    disconnect          disconnect one or more volumes from a host
    eradicate           eradicate one or more volumes or snapshots
    list                display information about volumes or snapshots
    listobj             list objects associated with one or more volumes
    monitor             display I/O performance information
    recover             recover one or more destroyed volumes or snapshots
    remove              remove volumes from protection groups
    rename              rename a volume or snapshot
    setattr             set volume attributes (increase size)
    snap                take snapshots of one or more volumes
    truncate            truncate one or more volumes (reduce size)

optional arguments:
  -h, --help            show this help message and exit

There’s also a facility to access the man page for commands. Just run “pureman command” to access it.

Want to see how much capacity there is on the array? Run “purearray list –space”.

pureuser@purearray> purearray list --space
Name        Capacity  Parity  Thin Provisioning  Data Reduction  Total Reduction  Volumes  Snapshots  Shared Space  System  Total
purearray  12.45T    100%    86%                2.4 to 1        17.3 to 1        350.66M  3.42G      3.01T         0.00    3.01T

Need to check the software version or generally availability of the controllers? Run “purearray list –controller”.

pureuser@purearray> purearray list --controller
Name  Mode       Model   Version  Status
CT0   secondary  FA-450  4.10.4   ready
CT1   primary    FA-450  4.10.4   ready

 

Connecting A Host

To connect a host to an array (assuming you’ve already zoned it to the array), you’d use the following commands.

purehost create hostname
purehost create -wwnlist WWNs hostname
purehost list
purevol connect --host [host] [volume]

 

Host Groups

You might need to create a Host Group if you’re running ESXi and want to have multiple hosts accessing the same volumes. Here’re the commands you’ll need. Firstly, create the Host Group.

purehgroup create [hostgroup]

Add the hosts to the Host Group (these hosts should already exist on the array)

purehgroup setattr --hostlist host1,host2,host3 [hostgroup]

You can then assign volumes to the Host Group

purehgroup connect --vol [volume] [hostgroup]

 

Other Volume Operations

Some other neat (and sometimes destructive) things you can do with volumes are listed below.

To resize a volume, use the following commands.

purevol setattr --size 500G [volume]
purevol truncate --size 20GB [volume]

Note that a snapshot is available for 24 hours to roll back if required. This is good if you’ve shrunk a volume to be smaller than the data on it and have consequently munted the filesystem.

When you destroy a volume it immediately becomes unavailable to host, but remains on the array for 24 hours. Note that you’ll need to remove the volume from any hosts connected to it first.

purevol disconnect [volume] --host [hostname]
purevol destroy [volume]

If you’re running short of capacity, or are just curious about when a deleted volume will disappear, use the following command.

purevol list --pending

If you need the capacity back immediately, the deleted volume can be eradicated with the following comamnd.

purevol eradicate [volume]

 

Further Reading

The Pure CLI is obviously not a new thing, and plenty of bright folks have already done a few articles about how you can use it as part of a provisioning workflow. This one from Chadd Kenney is a little old now but still demonstrates how you can bring it all together to do something pretty useful. You can obviously extend that to do some pretty interesting stuff, and there’s solid parity between the GUI and CLI in the Purity environment.

It seems like a small thing, but the fact that there’s no need to install an executable is a big thing in my book. Array vendors (and infrastructure vendors in general) insisting on installing some shell extension or command environment is a pain in the arse, and should be seen as an act of hostility akin to requiring Java to complete simple administration tasks. The sooner we get everyone working with either HTML5 or simple ssh access the better. In any csase, I hope this was a useful introduction to the Purity CLI. Check out the Administration Guide for more information.

Nexsan Announces Assureon Cloud Transfer

Announcement

Nexsan announced Cloud Transfer for their Assureon product a little while ago. I recently had the chance to catch up with Gary Watson (Founder / CTO at Nexsan) and thought it would be worth covering the announcement here.

 

Assureon Refresher

Firstly, though, it might be helpful to look at what Assureon actually is. In short, it’s an on-premises storage archive that offers:

  • Long term archive storage for fixed content files;
  • Dependable file availability, with files being audited every 90 days;
  • Unparalleled file integrity; and
  • A “policy” system for protecting and stubbing files.

Notably, there is always a primary archive and a DR archive included in the price. No half-arsing it here – which is something that really appeals to me. Assureon also doesn’t have a “delete” key as such – files are only removed based on defined Retention Rules. This is great, assuming you set up your policies sensibly in the first place.

 

Assureon Cloud Transfer

Cloud Transfer provides the ability to move data between on-premises and cloud instances. The idea is that it will:

  • Provide reliable and efficient cloud mobility of archived data between cloud server instances and between cloud vendors; and
  • Optimise cloud storage and backup costs by offloading cold data to on-premises archive.

It’s being positioned as useful for clients who have a large unstructured data footprint on public cloud infrastructure and are looking to reduce their costs for storing data up there. There’s currently support for Amazon AWS and Microsoft Azure, with Google support coming in the near future.

[image courtesy of Nexsan]

There’s stub support for those applications that support. There’s also an optional NFS / SMB interface that can be configured in the cloud as an Assureon archiving target that caches hot files and stubs cold files. This is useful for those non-Windows applications that have a lot of unstructured data that could be moved to an archive.

 

Thoughts and Further Reading

The concept of dedicated archiving hardware and software bundles, particularly ones that live on-premises, might seem a little odd to some folks who spend a lot of time failing fast in the cloud. There are plenty of enterprises, however, that would benefit from the level of rigour that Nexsan have wrapped around the Assureon product. It’s my strong opinion that too many people still don’t understand the difference between backup and recovery and archive data. The idea that you need to take archive data and make it immutable (and available) for a long time has great appeal, particularly for organisations getting slammed with a whole lot of compliance legislation. Vendors have been talking about reducing primary storage use for years, but there seems to have been some pushback from companies not wanting to invest in these solutions. It’s possible that this was also a result of some kludgy implementations that struggled to keep up with the demands of the users. I can’t speak for the performance of the Assureon product, but I like the fact that it’s sold as a pair, and with a lot of the decision-making around protection taken away from the end user. As someone who worked in an organisation that liked to cut corners on this type of thing, it’s nice to see that.

But why would you want to store stuff on-premises? Isn’t everyone moving everything to the cloud? No, they’re not. I don’t imagine that this type of product is being pitched at people running entirely in public cloud. It’s more likely that, if you’re looking at this type of solution, you’re probably running a hybrid setup, and still have a footprint in a colocation facility somewhere. The benefit of this is that you can retain control over where your archived data is placed. Some would say that’s a bit of a pain, and an unnecessary expense, but people familiar with compliance will understand that business is all about a whole lot of wasted expense in order to make people feel good. But I digress. Like most on-premises solutions, the Assureon offering compares well with a public cloud solution on a $/GB basis, assuming you’ve got a lot of sunk costs in place already with your data centre presence.

The immutability story is also a pretty good one when you start to think about organisations that have been hit by ransomware in the last few years. That stuff might roll through your organisation like a hot knife through butter, but it won’t be able to do anything with your archive data – that stuff isn’t going anywhere. Combine that with one of those fancy next generation data protection solutions and you’re in reasonable shape.

In any case, I like what the Assureon product offers, and am looking forward to seeing Nexsan move beyond the Windows-only platform support that it currently offers. You can read the Nexsan Assueron Cloud Transfer press release here. David Marshall covered the announcement over at VMblog and ComputerWeekly.com did an article as well.

NetApp Announces NetApp ONTAP AI

As a member of NetApp United, I had the opportunity to sit in on a briefing from Mike McNamara about NetApp‘s recently announced AI offering, the snappily named “NetApp ONTAP AI”. I thought I’d provide a brief overview here and share some thoughts.

 

The Announcement

So what is NetApp ONTAP AI? It’s a “proven” architecture delivered via NetApp’s channel partners. It’s comprised of compute, storage and networking. Storage is delivered over NFS. The idea is that you can start small and scale out as required.

Hardware

Software

  • NVIDIA GPU Cloud Deep Learning Stack
  • NetApp ONTAP 9
  • Trident, dynamic storage provisioner

Support

  • Single point of contact support
  • Proven support model

 

[image courtesy of NetApp]

 

Thoughts and Further Reading

I’ve written about NetApp’s Edge to Core to Cloud story before, and this offering certainly builds on the work they’ve done with big data and machine learning solutions. Artificial Intelligence (AI) and Machine Learning (ML) solutions are like big data from five years ago, or public cloud. You can’t go to any industry event, or take a briefing from an infrastructure vendor, without hearing all about how they’re delivering solutions focused on AI. What you do with the gear once you’ve bought one of these spectacularly ugly boxes is up to you, obviously, and I don’t want to get in to whether some of these solutions are really “AI” or not (hint: they’re usually not). While the vendors are gushing breathlessly about how AI will conquer the world, if you tone down the hyperbole a bit, there’re still some fascinating problems being solved with these kinds of solutions.

I don’t think that every business, right now, will benefit from an AI strategy. As much as the vendors would like to have you buy one of everything, these kinds of solutions are very good at doing particular tasks, most of which are probably not in your core remit. That’s not to say that you won’t benefit in the very near future from some of the research and development being done in this area. And it’s for this reason that I think architectures like this one, and those from NetApp’s competitors, are contributing something significant to the ongoing advancement of these fields.

I also like that this is delivered via channel partners. It indicates, at least at first glance, that AI-focused solutions aren’t simply something you can slap a SKU on and sells 100s of. Partners generally have a better breadth of experience across the various hardware, software and services elements and their respective constraints, and will often be in a better position to spend time understanding the problem at hand rather than treating everything as the same problem with one solution. There’s also less chance that the partner’s sales people will have performance accelerators tied to selling one particular line of products. This can be useful when trying to solve problems that are spread across multiple disciplines and business units.

The folks at NVIDIA have made a lot of noise in the AI / ML marketplace lately, and with good reason. They know how to put together blazingly fast systems. I’ll be interested to see how this architecture goes in the marketplace, and whether customers are primarily from the NetApp side of the fence, from the NVIDIA side, or perhaps both. You can grab a copy of the solution brief here, and there’s an AI white paper you can download from here. The real meat and potatoes though, is the reference architecture document itself, which you can find here.

Rubrik Basics – SLA Domains

I’ve been doing some work with Rubrik in our lab and thought it worth covering some of the basic features that I think are pretty neat. In this edition of Rubrik Basics, I thought I’d quickly cover off Service Level Agreements (SLA) Domains – one of the key tenets of the Rubrik architecture.

 

The Defaults

Rubrik CDM has three default local SLA Domains. Of course, they’re named after precious metals. There’s something about Gold that people seem to understand better than calling things Tier 0, 1 and 2. The defaults are Gold, Silver, and Bronze. The problem, of course, is that people start to ask for Platinum because they’re very important. The good news is you can create SLA Domains and call them whatever you want. I created one called Adamantium. Snick snick.

Note that these policies have the archival policy and the replication policy disabled, don’t have a Snapshot Window configured, and do not set a Take First Full Snapshot time. I recommend you leave the defaults as they are and create some new SLA Domains that align with what you want to deliver in your enterprise.

 

Service Level Agreement

There are two components to the SLA Domain. The first is the Service Level Agreement, which defines a number of things, including the frequency of snapshot creation and their retention. Note that you can’t go below an hour for your snapshot frequency (unless I’ve done something wrong here). You can go berserk with retention though. Keep those “kitchen duty roster.xls” files for 25 years if you like. Modern office life can be gruelling at times.

A nice feature is the ability to configure a Snapshot Window. The idea is that you can enforce time periods where you don’t perform operations on the systems being protected by the SLA Domain. This is handy if you’ve got systems that run batch processing or just need a little time to themselves every day to reflect on their place in the world. Every systems needs a little time every now and then.

If you have a number of settings in the SLA, the Rubrik cluster creates snapshots to satisfy the smallest frequency that is specified. If the Hourly rule has the smallest frequency, it works to that. If the Daily rule has the smallest frequency, it works to that, and so on. Snapshot expiration is determined by the rules you put in place combined with their frequency.

 

Remote Settings

The second page of the Create SLA Domain window is where you can configure the remote settings. I wrote an article on setting up Archival Locations previously – this is where you can take advantage of that. One of the cool things about Rubrik’s retention policy is that you can choose to send a bunch of stuff to an off-site location and keep, say, 30 days of data on Brik. The idea is that you don’t then have to invest in a tonne of Briks, so to speak, to satisfy your organisation’s data protection retention policy.

 

Thoughts

If you’ve had the opportunity to test-drive Rubrik’s offering, you’ll know that everything about it is pretty simple. From deployment to ongoing operation, there aren’t a whole lot of nerd knobs to play with. It nonetheless does the job of protecting the workloads you point it at. A lot of the complexity normally associated with data protection is masked by a fairly simple model that will hopefully make data protection a little more appealing for the average Joe or Josie responsible for infrastructure operations.

Rubrik, and a number of other solution vendors, are talking a lot about service levels and policy-driven data protection. The idea is that you can protect your data based on a service catalogue type offering rather than the old style of periodic protection that was offered with little flexibility (“We backup daily, we keep it 90 days, and sometimes we keep the monthly tape for longer”). This strikes me as an intuitive way to deliver data protection capabilities, provided that your business knows what they want (or need) from the solution. That’s always the key to success – understanding what the business actually needs to stay in business. You can do a lot with modern data protection offerings. Call it SLA-based, talk about service level objectives, makes t-shirts with policy-driven on them and hand them out to your executives. But unless you understand what’s important for your business to stay in business when there’s a problem, then it won’t really matter which solution you’ve chosen.

Chris Wahl wrote some neat posts (a little while ago) on SLAs and their challenges on the Rubrik blog that you can read here and here.

Dell EMC Announces IDPA DP4400

Dell EMC announced the Integrated Data Protection Appliance (IDPA) at Dell EMC World in May 2017. They recently announced a new edition to the lineup, the IDPA DP4400. I had the opportunity to speak with Steve Reichwein about it and thought I’d share some of my thoughts here.

 

The Announcement

Overview

One of the key differences between this offering and previous IDPA products is the form factor. The DP4400 is a 2RU appliance (based on a PowerEdge server) with the following features:

  • Capacity starts at 24TB, growing in increments of 12TB, up to 96TB useable. The capacity increase is done via licensing, so there’s no additional hardware required (who doesn’t love the golden screwdriver?)
  • Search and reporting is built in to the appliance
  • There are Cloud Tier (ECS, AWS, Azure, Virtustream, etc) and Cloud DR options (S3 at this stage, but that will change in the future)
  • There’s the IDPA System Manager (Data Protection Central), along with Data Domain DD/VE (3.1) and Avamar (7.5.1)

[image courtesy of Dell EMC]

It’s hosted on vSphere 6.5, and the whole stack is referred to as IDPA 2.2. Note that you can’t upgrade the components individually.

 

Hardware Details

Storage Configuration

  • 18x 12TB 3.5″ SAS Drives (12 front, 2 rear, 4 mid-plane)
    • 12TB RAID1 (1+1) – VM Storage
    • 72TB RAID6 (6+2) – DDVE File System Spindle-group 1
    • 72TB RAID6 (6+2) – DDVE File System Spindle-group 2
  • 240GB BOSS Card
    • 240GB RAID1 (1+1 M.2) – ESXi 6.5 Boot Drive
  • 1.6TB NVMe Card
    • 960GB SSD – DDVE cache-tier

System Performance

  • 2x Intel Silver 4114 10-core 2.2GHz
  • Up to 40 vCPU system capacity
  • Memory of 256GB (8x 32GB RDIMMs, 2667MT/s)

Networking-wise, the appliance has 8x 10GbE ports using either SFP+ or Twinax. There’s a management port for initial configuration, along with an iDRAC port that’s disabled by default, but can be configured if required. If you’re using Avamar NDMP accelerator nodes in your environment, you can integrate an existing node with the DP4400. Note that it supports one accelerator node per appliance.

 

Put On Your Pointy Hat

One of the nice things about the appliance (particularly if you’ve ever had to build a data protection environment based on Data Domain and Avamar) is that you can setup everything you need to get started via a simple to use installation wizard.

[image courtesy of Dell EMC]

 

Thoughts and Further Reading

I talked to Steve about what he thought the key differentiators were for the DP4400. He talked about:

  • Ecosystem breadth;
  • Network bandwidth; and
  • Guaranteed dedupe ratio (55:1 vs 5:1?)

He also mentioned the capability of a product like Data Protection Central to manage an extremely large ROBO environment. He said these were some of the opportunities where he felt Dell EMC had an edge over the competition.

I can certainly attest to the breadth of ecosystem support being a big advantage for Dell EMC over some of its competitors. Avamar and DD/VE have also demonstrated some pretty decent chops when it comes to bandwidth-constrained environments in need of data protection. I think it’s great the Dell EMC are delivering these kinds of solutions to market. For every shop willing to go with relative newcomers like Cohesity or Rubrik, there are plenty who still want to buy data protection from Dell EMC, IBM or Commvault. Dell EMC are being fairly upfront about what they think this type of appliance will support in terms of workload, and they’ve clearly been keeping an eye on the competition with regards to usability and integration. People who’ve used Avamar in real life have been generally happy with the performance and feature set, and this is going to be a big selling point for people who aren’t fans of NetWorker.

I’m not going to tell you that one vendor is offering a better solution than the others. You shouldn’t be making strategic decisions based on technical specs and marketing brochures in any case. Some environments are going to like this solution because it fits well with their broader strategy of buying from Dell EMC. Some people will like it because it might be a change from their current approach of building their own solutions. And some people might like to buy it because they think Dell EMC’s post-sales support is great. These are all good reasons to look into the DP4400.

Preston did a write-up on the DP4400 that you can read here. The IDPA DP4400 landing page can be found here. There’s also a Wikibon CrowdChat on next generation data protection being held on August 15th (2am on the 16th in Australian time) that will be worth checking out.

Disaster Recovery vs Disaster Avoidance vs Data Protection

This is another one of those rambling posts that I like to write when I’m sitting in an airport lounge somewhere and I’ve got a bit of time to kill. The versus in the title is a bit misleading too, because DR and DA are both forms of data protection. And periodic data protection (PDP) is important too. But what I wanted to write about was some of the differences between DR and DA, in particular.

TL;DR – DR is not DA, and this is not PDP either. But you need to think about all of them at some point.

 

Terminology

I want to be clear about what I mean when I say these terms, because it seems like they can mean a lot of things to different folks.

  • Recovery Point Objective – The Recovery Point Objective (RPO) is the maximum amount of time in which data may have been permanently lost during an incident. You want this to be in minutes and hours, not days or weeks (ideally). RPO 0 is the idea that no data is lost when there’s a failure. A lot of vendors will talk about “Near Zero” RPOs.
  • Recovery Time Objective – The Recovery Time Objective (RTO) is the amount of time the business can be without the service, without incurring significant risks or significant losses. This is, ostensibly, how long it takes you to get back up and running after an event. You don’t really want this to be in days and weeks either.
  • Disaster Recovery – Disaster Recovery is the ability to recover applications after a major event (think flood, fire, DC is now a hole in the ground). This normally involves a failover of workloads from one DC to another in an orchestrated fashion.
  • Disaster Avoidance – Disaster avoidance “is an anticipatory strategy that is in place in order to prevent any such instance of data breach or losses. It is a defensive, proactive approach to keeping data safe” (I’m quoting this from a great blog post on the topic here)
  • Periodic Data Protection – This is the kind of data protection activity we normally associate with “backups”. It is usually a daily activity (or perhaps as frequent as hourly) and the data is normally used for ad-hoc data file recovery requests. Some people use their backup data as an archive. They’re bad people and shouldn’t be trusted. PDP is normally separate to DA or DR solutions.

 

DR Isn’t The Full Answer

I’ve had some great conversations with customers recently about adding resilience to their on-premises infrastructure. It seems like an old-fashioned concept, but a number of organisations are only now seeing the benefits of adding infrastructure-level resilience to their platforms. The first conversation usually goes something like this:

Me: So what’s your key application, and what’s your resiliency requirement?

Customer: Oh, it’s definitely Application X (usually built on Oracle or using SAP or similar). It absolutely can’t go down. Ever. We need to have RPO 0 and RTO 0 for this one. Our while business depends on it.

Me: Okay, it sounds like it’s pretty important. So what about your file server and email?

Customer: Oh, that’s not so important. We can recover those from overnight backups.

Me: But aren’t they used to store data for Application X? Don’t you have workflows that rely on email?

Customer: Oh, yeah, I guess so. But it will be too expensive to protect all of this. Can we change the RPO a bit? I don’t think the CFO will support us doing RPO 0 everywhere.

These requirements tend to change whenever we move from technical discussions to commercial discussions. In an ideal world, Martha in Accounting will have her home directory protected in a highly available fashion such that it can withstand the failure of one or more storage arrays (or data centres). The problem with this is that, if there are 1000 Marthas in the organisation, the cost of protecting that kind of data at scale becomes prohibitive, relative to the perceived value of the data. This is one of the ways I’ve seen “DR” capability added to an environment in the past. Take some older servers and put them in a site removed from the primary site, setup some scripts to copy critical data to that site, and hope nothing ever goes too wrong with the primary site.

There are obviously better ways of doing this, and common solutions may or may not involve block-level storage replication, orchestrated failover tools, and like for like compute at the secondary site (or perhaps you’ve decided to shut down test and development while you’re fixing the problem at the production site).

But what are you trying to protect against? The failure of some compute? Some storage? The network layer? A key application? All of these answers will determine the path you’ll need to go down. Keep in mind also that DR isn’t the only answer. You also need to have business continuity processes in place. A failover of workloads to a secondary site is pointless if operations staff don’t have access to a building to continue doing their work, or if people can’t work when the swipe card access machine is off-lien, or if your Internet feed only terminates in one DC, etc.

 

I’m Avoiding The Problem

Disaster Avoidance is what I like to call the really sexy resilience solution. You can have things go terribly wrong with your production workload and potentially still have it functioning like there was no problem. This is where hardware solutions like Pure Storage ActiveCluster or Dell EMC VPLEX can really shine, assuming you’ve partnered them with applications that have the smarts built in to leverage what they have to offer. Because that’s the real key to a successful disaster avoidance design. It’s great to have synchronous replication and cache-consistency across DCs, but if your applications don’t know what to do when a leg goes missing, they’ll fall over. And if you don’t have other protection mechanisms in place, such as periodic data protection, then your synchronous block replication solution will merrily synchronise malware or corrupted data from one site to another in the blink of an eye.

It’s important to understand the failure scenarios you’re protecting against too. If you’ve deployed vSphere Metro Storage Cluster, you’ll be able to run VMs even when your whole array has gone off-line (assuming you’ve set it up properly). But this won’t necessarily prevent an outage if you lose your vSphere cluster, or the whole DC. Your data will still be protected, and you’ll be in good shape in terms of recovering quickly, but there will be an outage. This is where application-level resilience can help with availability. Remember that, even if you’ve got ultra-resilient workloads protection across DCs, if your staff only have one connection into the environment, they may be left twiddling their thumbs in the event of a problem.

There’s a level of resiliency associated with this approach, and your infrastructure will certainly be able to survive the failure of a compute node, or even a bunch of disk and some compute (everything will reboot in another location). But you need to be careful not to let people think that this is something it’s not.

 

PDP, Yeah You Know Me

I mentioned problems with malware and data corruption earlier on. This is where periodic data protection solutions (such as those sold by Dell EMC, CommVault, Rubrik, Cohesity, Veeam, etc) can really get you out of a spot of bother. And if you don’t need to recover the whole VM when there’s a problem, these solutions can be a lot quicker at getting data back. The good news is that you can integrate a lot of these products with storage protection solutions and orchestration tools for a belt and braces solution to protection, and it’s not the shitshow of scripts and kludges that it was ten years ago. Hooray!

 

Final Thoughts

There’s a lot more to data protection than I’ve covered here. People like Preston have written books about the topic. And a lot of the decision making is potentially going to be out of your hands in terms of what your organisation can afford to spend (until they lose a lot of data, money (or both), then they’ll maybe change their focus). But if you do have the opportunity to work on some of these types of solutions, at least try to make sure that everyone understands exactly what they can achieve with the technologies at hand. There’s nothing worse than being hauled over the coals because some director thought they could do something amazing with infrastructure-level availability and resiliency only to have the whole thing fall over due to lack of budget. It can be a difficult conversation to have, particularly if your executives are the types of people who like to trust the folks with the fancy logos on their documents. All you can do in that case is try and be clear about what’s possible, and clear about what it will cost in time and money.

In the near future I’ll try to put together a post on various infrastructure failure scenarios and what works and what doesn’t. RPO 0 seems to be what everyone is asking for, but it may not necessarily be what everyone needs. Now please enjoy this Unfinished Business stock image.

Rubrik Basics – Cluster Upgrade Process

I’ve been doing some work with Rubrik in our lab and thought it worth covering some of the basic features that I think are pretty neat. In this edition of Rubrik Basics, I thought I’d quickly cover off software upgrades. There are two ways to upgrade the Rubrik software on your Brik – via USB and SFTP. Either way, you’ll need access to the Downloads section of the support site. If you’re a customer, you’ll have this already. If this all sounds too hard, you can raise a ticket with the support team and they’ll tunnel in and do the upgrade for you (assuming you’ve allowed remote tunnel capability).

 

USB

The good thing about using a USB drive is that you can still keep appliances in “dark” sites up to date. Before you begin you’ll need to do two things:

  • Download the compressed upgrade archive and the matching signature file from the customer portal.
  • Format a removable drive with the FAT32 file system.

You’ll need to copy the upgrade file and matching signature file to the removable drive. Plug that into any node in the cluster. Log in to that node as the admin user. Mount the USB drive by typing the following command:

mount --usb_device

Type the following command to begin the upgrade:

upgrade start

The upgrade system scans the file system for upgrade archives. If multiple archives are available, it display a list of choices. Once you’ve finished, you can unmount the device.

umount --usb_device

 

SFTP

You can also run the upgrade via SFTP. I found the instructions on how to do that here. It’s not too dissimilar to the USB method. You’ll want to use your favourite SFTP client to upload the files to the /upgrade directory. Once you’ve done that, ssh on to the node and you can run a pre-flight check. If everything comes up Milhouse you’ll be good to go for the next step.

Using username "admin".

admin@10.xxx.yyy.131's password:

=======================

Welcome to Rubrik CLI

=======================

Type 'help' or '?' to list commands

RVM165Sxxxx55 >> upgrade start --mode prechecks_only
Do you want to use --share rubrik-4.1.2-2366.tar.gz [y/N] [N]: y
Upgrade status: Started pre-checks successfully
RVM165Sxxxx55 >> upgrade status
Current upgrade mode: prechecks_only
Current upgrade pre-checks node: RVM165Sxxxx55
Current upgrade pre-checks tarball name: --share rubrik-4.1.2-2366.tar.gz
Current upgrade pre-checks status: In progress
Current run started at: 2018-07-19 00:48:04.437000 UTC+0000

Current state (3/6): VERIFYING
Current task: Verify authenticity of new software
Current state progress: 0.0%

Finished states (2/6): ACQUIRING, COPYING
Pending states (3/6): UNTARING, DEPLOYING, PRECHECKING

Time taken so far: 18.38 seconds
Overall upgrade progress: 6.0%

To check on progress, run “upgrade status” to, erm, check on the status of the upgrade.

RVM165Sxxxx55 >> upgrade status
Last upgrade mode: prechecks_only
Last upgrade pre-checks node: RVM165Sxxxx55
Last upgrade pre-checks tarball name: --share rubrik-4.1.2-2366.tar.gz
Last upgrade pre-checks status: Completed successfully
Last run ended at: 2018-07-19 00:51:03.129000 UTC+0000
Current state: IDLE

Now you’re ready to do it for real. Run “upgrade start” to start.

RVM165Sxxxx55 >> upgrade start
Do you want to use --share rubrik-4.1.2-2366.tar.gz [y/N] [N]: y
Upgrade status: Started upgrade successfully
RVM165Sxxxx55 >> upgrade status
Current upgrade mode: normal
Current upgrade node: RVM165Sxxxx55
Current upgrade tarball name: --share rubrik-4.1.2-2366.tar.gz
Current upgrade status: In progress
Current run started at: 2018-07-19 00:52:56.882000 UTC+0000

Current state (4/9): UNTARING
Current task: Extract new software
Current state progress: 0.0%

Finished states (3/9): ACQUIRING, COPYING, VERIFYING
Pending states (5/9): DEPLOYING, PRECHECKING, PREPARING, UPGRADING, RESTARTING

Time taken so far: 22.52 seconds
Overall upgrade progress: 3.5%

It’s a pretty quick process, and eventually you’ll see this message.

RVM165Sxxxx55 >> upgrade status
Last upgrade mode: normal
Last upgrade node: RVM165Sxxxx55
Last upgrade tarball name: --share rubrik-4.1.2-2366.tar.gz
Last upgrade status: Completed successfully
Last run ended at: 2018-07-19 01:19:09.719000 UTC+0000

Current state: IDLE
RVM165Sxxxx55 >>

And you’re all done. Note that you only have to upload the data and run the process on one node in the cluster.

Random Short Take #6

Welcome to the sixth edition of the Random Short Take. Here are a few links to a few things that I think might be useful, to someone.

Rubrik CDM 4.1.1. – A Few Notes

Here are a few random notes on things in Rubrik‘s Cloud Data Management (CDM) 4.1.1-p4-2319 that I’ve come across in my recent testing in the lab. There’s not enough in each item to warrant a full post, hence the “few notes” format. Note that some of these things have been around for a while, I just wanted to note the specific version of Rubrik CDM I’m working with.

 

Guest OS Credentials

Rubrik uses Guest OS credentials for access to a VM’s operating system. When you add VM workload to your Rubrik environment, you may see the following message in the logs.

Note that it’s a warning, not an error. You can still backup the VM, just not to the level you might have hoped for. If you want to do a direct restore on a Linux guest, you’ll need an account with write access. For Windows, you’ll need something with administrative access. You could achieve this with either local or domain administrator accounts. This isn’t recommended though, and Rubrik suggests “a credential for a domain level account that has a small privilege set that includes administrator access to the relevant guests”. You could use a number of credentials across multiple groups of machines to reduce (to a small extent) the level of exposure, but there are plenty of CISOs and Windows administrators who are not going to like this approach.

So what happens if you don’t provide the credentials? My understanding is that you can still do file system consistent snapshots (provided you have a current version of VMware Tools installed), you just won’t be able to do application-consistent backups. For your reference, here’s the table from Rubrik discussing the various levels of available consistency.

Consistency level Description Rubrik usage
Inconsistent A backup that consists of copying each file to the backup target without quiescence.

File operations are not stopped The result is inconsistent time stamps across the backup and, potentially, corrupted files.

Not provided
Crash consistent A point-in-time snapshot but without quiescence.

•                Time stamps are consistent

•                Pending updates for open files are not saved

•                In-flight I/O operations are not completed

The snapshot can be used to restore the virtual machine to the same state that a hard reset would produce.

Provided only when:

•                The Guest OS does not have VMware Tools

•                The Guest OS has an out-of-date version of VMware Tools

The VM’s Application Consistency was manually set to Crash Consistent in the Rubrik UI

File system consistent A point-in-time snapshot with quiescence.

•                Time stamps are consistent

•                Pending updates for open files are saved

•                In-flight I/O operations are completed

•                Application-specific operations may not be completed.

Provided when the guest OS has an up-to-date version of VMware Tools and application consistency is not supported for the guest OS.
Application consistent A point-in-time snapshot with quiescence and application-awareness.

•                Time stamps are consistent

•                Pending updates for open files are saved

•                In-flight I/O operations are completed

•                Application-specific operations are completed.

Provided when the guest OS has an up-to-date version of VMware Tools and application consistency is supported for the guest OS.

 

open-vm-tools

If you’re running something like Debian in your vSphere environment you may have chosen to use open-vm-tools rather than VMware’s package. There’s nothing wrong with this (it’s a VMware-supported configuration), but you’ll see that Rubrik currently has a bit of an issue with it.

It will still backup the VM, just not at the consistency level you may be hoping for. It’s on Rubrik’s list of things to fix. And VMware Tools is still a valid (and arguably preferred) option for supported Linux distributions. The point of open-vm-tools is that appliance vendors can distribute the tools with their VMs without violating licensing agreements.

 

Download Logs

It seems like a simple thing, but I really like the ability to download logs related to a particular error. In this example, I’ve got some issues with a SQL cluster I’m backing up. I can click on “Download Logs” and grab the info I need related to the SLA Activity. It’s a small thing, but it makes wading through logs to identify issues a little less painful.