Cohesity – Cohesity Cluster Virtual Edition ESXi – A Few Notes

I’ve covered the Cohesity appliance deployment in a howto article previously. I’ve also made use of the VMware-compatible Virtual Edition in our lab to test things like cluster to cluster replication and cloud tiering. The benefits of virtual appliances are numerous. They’re generally easy to deploy, don’t need dedicated hardware, can be re-deployed quickly when you break something, and can be a quick and easy way to validate a particular process or idea. They can also be a problem with regards to performance, and are at the mercy of the platform administrator to a point. But aren’t we all? With 6.1, Cohesity have made available a clustered virtual edition (the snappily titled Cohesity Cluster Virtual Edition ESXi). If you have access to the documentation section of the Cohesity support site, there’s a PDF you can download that explains everything. I won’t go into too much detail but there are a few things to consider before you get started.

 

Specifications

Base Appliance 

Just like the non-clustered virtual edition, there’s a small and large configuration you can choose from. The small configuration supports up to 8TB for the Data disk, while the large configuration supports up to 16TB for the Data disk. The small config supports 4 vCPUs and 16GB of memory, while the large configuration supports 8 vCPUs and 32GB of memory.

Disk Configuration

Once you’ve deployed the appliance, you’ll need to add the Metadata disk and Data disk to each VM. The Metadata disk should be between 512GB and 1TB. For the large configuration, you can also apparently configure 2x 512GB disks, but I haven’t tried this. The Data disk needs to be between 512GB and 8TB for the small configuration and up to 16TB for the large configuration (with support for 2x 8TB disks). Cohesity recommends that these are formatted as Thick Provision Lazy Zeroed and deployed in Independent – Persistent mode. Each disk should be attached to its own SCSI controller as well, so you’ll have the system disk on SCSI 0:0, the Metadata disk on SCSI 1:0, and so on.

I did discover a weird issue when deploying the appliance on a Pure Storage FA-450 array in the lab. In vSphere this particular array’s datastore type is identified by vCenter as “Flash”. For my testing I had a 512GB Metadata disk and 3TB Data disk configured on the same datastore, with the three nodes living on three different datastores on the FlashArray. This caused errors with the cluster configuration, with the configuration wizard complaining that my SSD volumes were too big.

I moved the Data disk (with storage vMotion) to an all flash Nimble array (that for some reason was identified by vSphere as “HDD”) and the problem disappeared. Interestingly I didn’t have this problem with the single node configuration of 6.0.1 deployed with the same configuration. I raised a ticket with Cohesity support and they got back to me stating that this was expected behaviour in 6.1.0a. They tell me, however, that they’ve modified the behaviour of the configuration routine in an upcoming version so fools like me can run virtualised secondary storage on primary storage.

Erasure Coding

You can configure the appliance for increased resiliency at the Storage Domain level as well. If you go to Platform – Cluster – Storage Domains you can modify the DefaultStorageDomain (and other ones that you may have created). Depending on the size of the cluster you’ve deployed, you can choose the number of failures to tolerate and whether or not you want erasure coding enabled.

You can also decide whether you want EC to be a post-process activity or something that happens inline.

 

Process

Once you’ve deployed (a minimum) 3 copies of the Clustered VE, you’ll need to manually add Metadata and Data disks to each VM. The specifications for these are listed above. Fire up the VMs and go to the IP of one of the nodes. You’ll need to log in as the admin user with the appropriate password and you can then start the cluster configuration.

This bit is pretty much the same as any Cohesity cluster deployment, and you’ll need to specify things like a hostname for the cluster partition. As always, it’s a good idea to ensure your DNS records are up to date. You can get away with using IP addresses but, frankly, people will talk about you behind your back if you do.

At this point you can also decide to enable encryption at the cluster level. If you decide not to enable it you can do this on a per Domain basis later.

Click on Create Cluster and you should see something like the following screen.

Once the cluster is created, you can hit the virtual IP you’ve configured, or any one of the attached nodes, to log in to the cluster. Once you log in, you’ll need to agree to the EULA and enter a license key.

 

Thoughts

The availability of virtual appliance versions for storage and data protection solutions isn’t a new idea, but it’s certainly one I’m a big fan of. These things give me an opportunity to test new code releases in a controlled environment before pushing updates into my production environment. It can help with validating different replication topologies quickly, and validating other configuration ideas before putting them into the wild (or in front of customers). Of course, the performance may not be up to scratch for some larger environments, but for smaller deployments and edge or remote office solutions, you’re only limited by the available host resources (which can be substantial in a lot of cases). The addition of a clustered version of the virtual edition for ESXi and Hyper-V is a welcome sight for those of us still deploying on-premises Cohesity solutions (I think the Azure version has been clustered for a few revisions now). It gets around the main issue of resiliency by having multiple copies running, and can also address some of the performance concerns associated with running virtual versions of the appliance. There are a number of reasons why it may not be the right solution for you, and you should work with your Cohesity team to size any solution to fit your environment. But if you’re running Cohesity in your environment already, talk to your account team about how you can leverage the virtual edition. It really is pretty neat. I’ll be looking into the resiliency of the solution in the near future and will hopefully be able to post my findings in the next few weeks.

Rubrik Announces Cloud Data Management 5.0 – Drops In A Shedload Of Enhancements

I recently had the opportunity to hear from Chris Wahl about Rubrik CDM 5.0 (codename Andes) and thought it worthwhile covering here.

 

Announcement Summary

  • Instant recovery for Oracle databases;
  • NAS Direct Archive to protect massive unstructured data sets;
  • Microsoft Office 365 support via Polaris SaaS Platform;
  • SAP-certified protection for SAP HANA;
  • Policy-driven protection for Epic EHR; and
  • Rubrik works with Rubrik Datos IO to protect NoSQL databases.

 

New Features and Enhancements

As you can see from the list above, there’s a bunch of new features and enhancements. I’ll try and break down a few of these in the section below.

Oracle Protection

Rubrik have had some level of capability with Oracle protection for a little while now, but things are starting to hot up with 5.0.

  • Simplified configuration (Oracle Auto Protection and Live Mount, Oracle Granular SLA Policy Assignments, and Oracle Automated Instance and Database Discovery)
  • Orchestration of operational and PiT recoveries
  • Increased control for DBAs

NAS Direct Archive

People have lots of data now. Like, a real lot. I don’t know how many Libraries of Congress exactly, but it can be a lot. Previously, you’d have to buy a bunch of Briks to store this data. Rubrik have recognised that this can be a bit of a problem in terms of footprint. With NAS Direct Archive, you can send the data to an “archive” target of your choice. So now you can protect a big chunk of data that goes through the Rubrik environment to end target such as object storage, public cloud, or NFS. The idea is to reduce the amount of Rubrik devices you need to buy. Which seems a bit weird, but their customers will be pretty happy to spend their money elsewhere.

[image courtesy of Rubrik]

It’s simple to get going, requiring a tick of a box to be configured. The metadata remains protected with the Rubrik cluster, and the good news is that nothing changes from the end user recovery experience.

Elastic App Service (EAS)

Rubrik now provides the ability to ingest DBs across a wider spectrum, allowing you to protect more of the DB-based applications you want, not just SQL and Oracle workloads.

SAP HANA Protection

I’m not really into SAP HANA, but plenty of organisations are. Rubrik now offer a SAP Certified Solution which, if you’ve had the misfortune of trying to protect SAP workloads before, is kind of a neat feature.

[image courtesy of Rubrik]

SQL Server Enhancements

There have been some nice enhancements with SQL Server protection, including:

  • A Change Block Tracking (CBT) filter driver to decrease backup windows; and
  • Support for group Volume Shadow Copy Service (VSS) snapshots.

So what about Group Backups? The nice thing about these is that you can protect many databases on the same SQL Server. Rather than process each VSS Snapshot individually, Rubrik will group the databases that belong to the same SLA Domain and process the snapshots as a batch group. There are a few benefits to this approach:

  • It reduces SQL Server overhead, as well as decreases the amount of time a backup requires to be completed; and
  • In turn, allowing customers to take more frequent backups of their databases delivering a lower RPO to the business.

vSphere Enhancements

Rubrik have done vSphere things since forever, and this release includes a few nice enhancements, including:

  • Live Mount VMDKs from a Snapshot – providing the option to choose to mount specific VMDKs instead of an entire VM; and
  • After selecting the VMDKs, the user can select a specific compatible VM to attach the mounted VMDKs.

Multi-Factor Authentication

The Rubrik Andes 5.0 integration with RSA SecurID will include RSA Authentication Manager 8.2 SP1+ and RSA SecurID Cloud Authentication Service. Note that CDM will not be supporting the older RADIUS protocol. Enabling this is a two-step process:

  • Add the RSA Authentication Manager or RSA Cloud Authentication Service in the Rubrik Dashboard; and
  • Enable RSA and associate a new or existing local Rubrik user or a new or existing LDAP server with the RSA Authentication Manager or RSA Cloud Authentication Service.

You also get the ability to generate API tokens. Note that if you want to interact with the Rubrik CDM CLI (and have MFA enabled) you’ll need these.

Other Bits and Bobs

There are a few other enhancements included, including:

  • Windows Bare Metal Recovery;
  • SLA Policy Advanced Configuration;
  • Additional Reporting and Metrics; and
  • Snapshot Retention Enhancements.

 

Thoughts and Further Reading

Wahl introduced the 5.0 briefing by talking about digital transformation as being, at its core, an automation play. The availability of a bunch of SaaS services can lead to fragmentation in your environment, and legacy technology doesn’t deal with with makes transformation. Rubrik are positioning themselves as a modern company, well-placed to help you with the challenges of protecting what can quickly become a complex and hard to contain infrastructure. It’s easy to sit back and tell people how transformation can change their business for the better, but these kinds of conversations often eschew the high levels of technical debt in the enterprise that the business is doing its best to ignore. I don’t really think that transformation is as simple as some vendors would have us believe, but I do support the idea that Rubrik are working hard to make complex concepts and tasks as simple as possible. They’ve dropped a shedload of features and enhancements in this release, and have managed to do so in a way that you won’t need to install a bunch of new applications to support these features, and you won’t need to do a lot to get up and running either. For me, this is the key advantage that the “next generation” data protection companies have over their more mature competitors. If you haven’t been around for decades, you very likely don’t offer support for every platform and application under the sun. You also likely don’t have customers that have been with you for 20 years that you need to support regardless of the official support status of their applications. This gives the likes of Rubrik the flexibility to deliver features as and when customers require them, while still focussing on keeping the user experience simple.

I particularly like the NAS Direct Archive feature, as it shows that Rubrik aren’t simply in this to push a bunch of tin onto their customers. A big part of transformation is about doing things smarter, not just faster. the folks at Rubrik understand that there are other solutions out there that can deliver large capacity solutions for protecting big chunks of data (i.e. NAS workloads), so they’ve focussed on leveraging other capabilities, rather than trying to force their customers to fill their data centres with Rubrik gear. This is the kind of thinking that potential customers should find comforting. I think it’s also the kind of approach that a few other vendors would do well to adopt.

*Update*

Here’re some links to other articles on Andes from other folks I read that you may find useful:

Vembu BDR Suite 4.0 Is Coming

Disclaimer

Vembu are a site sponsor of PenguinPunk.net. They’ve asked me to look at their product and write about it. I’m in the early stages of evaluating the BDR Suite in the lab, but thought I’d pass on some information about their upcoming 4.0 release. As always, if you’re interested in these kind of solutions, I’d encourage you to do your own evaluation and get in touch with the vendor, as everyone’s situation and requirements are different. I can say from experience that the Vembu sales and support staff are very helpful and responsive, and should be able to help you with any queries. I recently did a brief article on getting started with BDR Suite 3.9.1 that you can download from here.

 

New Features

So what’s coming in 4.0?

Hyper-V Cluster Backup

Vembu will support backing up VMs in a Hyper-V cluster and, even if VMs configured for backup are moved from one host to another, the incremental backup will continue to happen without any interruption.

Shared VHDx Backup

Vembu now supports backup of the shared VHDx of Hyper-V.

CheckSum-based Incrementals

Vembu uses CBT for incremental backups. And for some CBT failure cases they will be using CheckSum for the incremental to happen without any interruption.

Credential Manager

No need to enter credentials every time, Vembu Credential Manager now allows you to manage the credentials of the host and the VMs running in it. This will be particularly handy if you’re doing a lot of application-aware backup job configuration.

 

Thoughts

I had a chance to speak with Vembu about the product’s functionality. There’s a lot to like in terms of breadth of features. I’m interested in seeing how 4.0 goes when it’s released and hope to do a few more articles on the product then. If you’re looking to evaluate the product, this evaluator’s guide is as good place as any to start. As an aside, Vembu are also offering 10% off their suite this Halloween (until November 2nd) – see here for more details.

For a fuller view of what’s coming in 4.0, you can read Vladan‘s coverage here.

Updated Articles Page

I recently had the opportunity to deploy a Vembu BDR 3.9.1 Update 1 appliance and thought I’d run through the basics of getting started. There’s a new document outlining the process on the articles page.

Cohesity Basics – Excluding VMs Using Tags

I’ve been doing some work with Cohesity in our lab and thought it worth covering some of the basic features that I think are pretty neat. In this edition of Cohesity Basics, I thought I’d quickly cover off how to exclude VMs from protection jobs based on assigned tags. In this example I’m using version 6.0.1b_release-20181014_14074e50 (a “feature release”).

 

Process

The first step is to find the VM in vCenter that you want to exclude from a protection job. Right-click on the VM and select Tags & Custom Attributes. Click on Assign Tag.

In the Assign Tag window, click on the New Tag icon.

Assign a name to the new tag, and add a description if that’s what you’re into.

In this example, I’ve created a tag called “COH-Test”, and put it in the “Backup” category.

Now go to the protection job you’d like to edit.

Click on the Tag icon on the right-hand side. You can then select the tag you created in vCenter. Note that you may need to refresh your vCenter source for this new tag to be reflected.

When you select the tag, you can choose to Auto Protect or Exclude the VM based on the applied tags.

If you drill in to the objects in the protection job, you can see that the VM I wanted to exclude from this job has been excluded based on the assigned tag.

 

Thoughts

I’ve written enthusiastically about Cohesity’s Auto Protect feature previously. Sometimes, though, you need to exclude VMs from protection jobs. Using tags is a quick and easy way to do this, and it’s something that your virtualisation admin team will be happy to use too.

Hyper-Veeam

Disclaimer: I recently attended VeeamON Forum Sydney 2018My flights and accommodation were paid for by Veeam. There is no requirement for me to blog about any of the content presented and I am not compensated in any way for my time at the event.  Some materials presented were discussed under NDA and don’t form part of my blog posts, but could influence future discussions.

I recently had the opportunity to attend VeeamON Forum in Sydney courtesy of Veeam. I was lucky enough to see Dave Russell‘s keynote speech, and also fortunate to spend some time chatting with him in the afternoon. Dave was great to talk to and I thought I’d share some of the key points here.

 

Hyper All of the Things

If you scroll down Veeam’s website you’ll see mention of a number of different “hyper” things, including hyper-availability. Veeam are keen to position themselves as an availability company, with their core focus being on making data you need recoverable, at the time when you need it to be recoverable.

Hyper-critical

Russell mentioned that data has become “hyper-critical” to business, with the likes of:

  • GDPR compliance;
  • PII data retention;
  • PCI compliance requirements;
  • Customer data; and
  • Financial records, etc.

Hyper-growth

Russell also spoke about the hyper-growth of data, with all kinds of data (including structured, unstructured, application, and Internet of things data) is also growing at a rapid clip.

Hyper-sprawl

This explosive growth of data has also lead to the “hyper-sprawl” of data, with your data now potentially living in any or all of the following locations:

  • SaaS-based solutions
  • Private cloud
  • Public cloud

 

Five Stages of Intelligent Data Management

Russell broke down Intelligent Data Management (IDM) into 5 stages.

Backup

A key part of any data management strategy is the ability to backup all workloads and ensure they are always recoverable in the event of outages, attack, loss or theft.

Aggregation

The ability to cope with data sprawl, as well as growth, means you need to ensure protection and access to data across multiple clouds to drive digital services and ensure continuous business operations.

Visibility

It’s not just about protecting vast chunks of data in multiple places though. You also need to look at the requirement to “improve management of data across multi-clouds with clear, unified visibility and control into usage, performance issues and operations”.

Orchestration

Orchestration, ideally, can then be used to “[s]eamlessly move data to the best location across multi-clouds to ensure business continuity, compliance, security and optimal use of resources for business operations”.

Automation

The final piece of the puzzle is automation. According to Veeam, you can get to a point where the “[d]ata becomes self-managing by learning to backup, migrate to ideal locations based on business needs, secure itself during anomalous activity and recover instantaneously”.

 

Thoughts

Data growth is not a new phenomenon by any stretch, and Veeam obviously aren’t the first to notice that protecting all this staff can be hard. Sprawl is also becoming a real problem in all types environments. It’s not just about knowing you have some unstructured data that can impact workflows in a key application. It’s about knowing which cloud platform that data might reside in. If you don’t know where it is, it makes it a lot harder to protect, and your risk profile increases as a result. It’s not just the vendors banging on about data growth through IoT either, it’s a very real phenomena that is creating all kinds of headaches for CxOs and their operations teams. Much like the push in to public cloud by “shadow IT” teams, IoT solutions are popping up in all kinds of unexpected places in the enterprise and making it harder to understand exactly where the important data is being kept and how it’s protected.

Veeam are talking a very good game around intelligent data management. I remember a similar approach being adopted by a three-letter storage company about a decade ago. They lost their way a little under the weight of acquisitions, but the foundation principles seem to still hold water today. Dave Russell obviously saw quite a bit at Gartner in his time there prior to Veeam, so it’s no real surprise that he’s pushing them in this direction.

Backup is just the beginning of the data management problem. There’s a lot else that needs to be done in order to get to the “intelligent” part of the equation. My opinion remains that a lot of enterprises are still some ways away from being there. I also really like Veeam’s focus on moving from policy-based through to a behaviour-based approach to data management.

I’ve been aware of Veeam for a number of years now, and have enjoyed watching them grow as a company. They’re working hard to make their way in the enterprise now, but still have a lot to offer the smaller environments. They tell me they’re committed to remaining a software-only solution, which gives them a certain amount of flexibility in terms of where they focus their R & D efforts. There’s a great cloud story there, and the bread and butter capabilities continue to evolve. I’m looking to see what they have coming over the next 12 months. It’s a relatively crowded market now, and it’s only going to get more competitive. I’ll be doing a few more articles in the next month or two focusing on some of Veeam’s key products so stay tuned.

Rubrik Basics – SLA Domains

I’ve been doing some work with Rubrik in our lab and thought it worth covering some of the basic features that I think are pretty neat. In this edition of Rubrik Basics, I thought I’d quickly cover off Service Level Agreements (SLA) Domains – one of the key tenets of the Rubrik architecture.

 

The Defaults

Rubrik CDM has three default local SLA Domains. Of course, they’re named after precious metals. There’s something about Gold that people seem to understand better than calling things Tier 0, 1 and 2. The defaults are Gold, Silver, and Bronze. The problem, of course, is that people start to ask for Platinum because they’re very important. The good news is you can create SLA Domains and call them whatever you want. I created one called Adamantium. Snick snick.

Note that these policies have the archival policy and the replication policy disabled, don’t have a Snapshot Window configured, and do not set a Take First Full Snapshot time. I recommend you leave the defaults as they are and create some new SLA Domains that align with what you want to deliver in your enterprise.

 

Service Level Agreement

There are two components to the SLA Domain. The first is the Service Level Agreement, which defines a number of things, including the frequency of snapshot creation and their retention. Note that you can’t go below an hour for your snapshot frequency (unless I’ve done something wrong here). You can go berserk with retention though. Keep those “kitchen duty roster.xls” files for 25 years if you like. Modern office life can be gruelling at times.

A nice feature is the ability to configure a Snapshot Window. The idea is that you can enforce time periods where you don’t perform operations on the systems being protected by the SLA Domain. This is handy if you’ve got systems that run batch processing or just need a little time to themselves every day to reflect on their place in the world. Every systems needs a little time every now and then.

If you have a number of settings in the SLA, the Rubrik cluster creates snapshots to satisfy the smallest frequency that is specified. If the Hourly rule has the smallest frequency, it works to that. If the Daily rule has the smallest frequency, it works to that, and so on. Snapshot expiration is determined by the rules you put in place combined with their frequency.

 

Remote Settings

The second page of the Create SLA Domain window is where you can configure the remote settings. I wrote an article on setting up Archival Locations previously – this is where you can take advantage of that. One of the cool things about Rubrik’s retention policy is that you can choose to send a bunch of stuff to an off-site location and keep, say, 30 days of data on Brik. The idea is that you don’t then have to invest in a tonne of Briks, so to speak, to satisfy your organisation’s data protection retention policy.

 

Thoughts

If you’ve had the opportunity to test-drive Rubrik’s offering, you’ll know that everything about it is pretty simple. From deployment to ongoing operation, there aren’t a whole lot of nerd knobs to play with. It nonetheless does the job of protecting the workloads you point it at. A lot of the complexity normally associated with data protection is masked by a fairly simple model that will hopefully make data protection a little more appealing for the average Joe or Josie responsible for infrastructure operations.

Rubrik, and a number of other solution vendors, are talking a lot about service levels and policy-driven data protection. The idea is that you can protect your data based on a service catalogue type offering rather than the old style of periodic protection that was offered with little flexibility (“We backup daily, we keep it 90 days, and sometimes we keep the monthly tape for longer”). This strikes me as an intuitive way to deliver data protection capabilities, provided that your business knows what they want (or need) from the solution. That’s always the key to success – understanding what the business actually needs to stay in business. You can do a lot with modern data protection offerings. Call it SLA-based, talk about service level objectives, makes t-shirts with policy-driven on them and hand them out to your executives. But unless you understand what’s important for your business to stay in business when there’s a problem, then it won’t really matter which solution you’ve chosen.

Chris Wahl wrote some neat posts (a little while ago) on SLAs and their challenges on the Rubrik blog that you can read here and here.

Disaster Recovery vs Disaster Avoidance vs Data Protection

This is another one of those rambling posts that I like to write when I’m sitting in an airport lounge somewhere and I’ve got a bit of time to kill. The versus in the title is a bit misleading too, because DR and DA are both forms of data protection. And periodic data protection (PDP) is important too. But what I wanted to write about was some of the differences between DR and DA, in particular.

TL;DR – DR is not DA, and this is not PDP either. But you need to think about all of them at some point.

 

Terminology

I want to be clear about what I mean when I say these terms, because it seems like they can mean a lot of things to different folks.

  • Recovery Point Objective – The Recovery Point Objective (RPO) is the maximum amount of time in which data may have been permanently lost during an incident. You want this to be in minutes and hours, not days or weeks (ideally). RPO 0 is the idea that no data is lost when there’s a failure. A lot of vendors will talk about “Near Zero” RPOs.
  • Recovery Time Objective – The Recovery Time Objective (RTO) is the amount of time the business can be without the service, without incurring significant risks or significant losses. This is, ostensibly, how long it takes you to get back up and running after an event. You don’t really want this to be in days and weeks either.
  • Disaster Recovery – Disaster Recovery is the ability to recover applications after a major event (think flood, fire, DC is now a hole in the ground). This normally involves a failover of workloads from one DC to another in an orchestrated fashion.
  • Disaster Avoidance – Disaster avoidance “is an anticipatory strategy that is in place in order to prevent any such instance of data breach or losses. It is a defensive, proactive approach to keeping data safe” (I’m quoting this from a great blog post on the topic here)
  • Periodic Data Protection – This is the kind of data protection activity we normally associate with “backups”. It is usually a daily activity (or perhaps as frequent as hourly) and the data is normally used for ad-hoc data file recovery requests. Some people use their backup data as an archive. They’re bad people and shouldn’t be trusted. PDP is normally separate to DA or DR solutions.

 

DR Isn’t The Full Answer

I’ve had some great conversations with customers recently about adding resilience to their on-premises infrastructure. It seems like an old-fashioned concept, but a number of organisations are only now seeing the benefits of adding infrastructure-level resilience to their platforms. The first conversation usually goes something like this:

Me: So what’s your key application, and what’s your resiliency requirement?

Customer: Oh, it’s definitely Application X (usually built on Oracle or using SAP or similar). It absolutely can’t go down. Ever. We need to have RPO 0 and RTO 0 for this one. Our while business depends on it.

Me: Okay, it sounds like it’s pretty important. So what about your file server and email?

Customer: Oh, that’s not so important. We can recover those from overnight backups.

Me: But aren’t they used to store data for Application X? Don’t you have workflows that rely on email?

Customer: Oh, yeah, I guess so. But it will be too expensive to protect all of this. Can we change the RPO a bit? I don’t think the CFO will support us doing RPO 0 everywhere.

These requirements tend to change whenever we move from technical discussions to commercial discussions. In an ideal world, Martha in Accounting will have her home directory protected in a highly available fashion such that it can withstand the failure of one or more storage arrays (or data centres). The problem with this is that, if there are 1000 Marthas in the organisation, the cost of protecting that kind of data at scale becomes prohibitive, relative to the perceived value of the data. This is one of the ways I’ve seen “DR” capability added to an environment in the past. Take some older servers and put them in a site removed from the primary site, setup some scripts to copy critical data to that site, and hope nothing ever goes too wrong with the primary site.

There are obviously better ways of doing this, and common solutions may or may not involve block-level storage replication, orchestrated failover tools, and like for like compute at the secondary site (or perhaps you’ve decided to shut down test and development while you’re fixing the problem at the production site).

But what are you trying to protect against? The failure of some compute? Some storage? The network layer? A key application? All of these answers will determine the path you’ll need to go down. Keep in mind also that DR isn’t the only answer. You also need to have business continuity processes in place. A failover of workloads to a secondary site is pointless if operations staff don’t have access to a building to continue doing their work, or if people can’t work when the swipe card access machine is off-lien, or if your Internet feed only terminates in one DC, etc.

 

I’m Avoiding The Problem

Disaster Avoidance is what I like to call the really sexy resilience solution. You can have things go terribly wrong with your production workload and potentially still have it functioning like there was no problem. This is where hardware solutions like Pure Storage ActiveCluster or Dell EMC VPLEX can really shine, assuming you’ve partnered them with applications that have the smarts built in to leverage what they have to offer. Because that’s the real key to a successful disaster avoidance design. It’s great to have synchronous replication and cache-consistency across DCs, but if your applications don’t know what to do when a leg goes missing, they’ll fall over. And if you don’t have other protection mechanisms in place, such as periodic data protection, then your synchronous block replication solution will merrily synchronise malware or corrupted data from one site to another in the blink of an eye.

It’s important to understand the failure scenarios you’re protecting against too. If you’ve deployed vSphere Metro Storage Cluster, you’ll be able to run VMs even when your whole array has gone off-line (assuming you’ve set it up properly). But this won’t necessarily prevent an outage if you lose your vSphere cluster, or the whole DC. Your data will still be protected, and you’ll be in good shape in terms of recovering quickly, but there will be an outage. This is where application-level resilience can help with availability. Remember that, even if you’ve got ultra-resilient workloads protection across DCs, if your staff only have one connection into the environment, they may be left twiddling their thumbs in the event of a problem.

There’s a level of resiliency associated with this approach, and your infrastructure will certainly be able to survive the failure of a compute node, or even a bunch of disk and some compute (everything will reboot in another location). But you need to be careful not to let people think that this is something it’s not.

 

PDP, Yeah You Know Me

I mentioned problems with malware and data corruption earlier on. This is where periodic data protection solutions (such as those sold by Dell EMC, CommVault, Rubrik, Cohesity, Veeam, etc) can really get you out of a spot of bother. And if you don’t need to recover the whole VM when there’s a problem, these solutions can be a lot quicker at getting data back. The good news is that you can integrate a lot of these products with storage protection solutions and orchestration tools for a belt and braces solution to protection, and it’s not the shitshow of scripts and kludges that it was ten years ago. Hooray!

 

Final Thoughts

There’s a lot more to data protection than I’ve covered here. People like Preston have written books about the topic. And a lot of the decision making is potentially going to be out of your hands in terms of what your organisation can afford to spend (until they lose a lot of data, money (or both), then they’ll maybe change their focus). But if you do have the opportunity to work on some of these types of solutions, at least try to make sure that everyone understands exactly what they can achieve with the technologies at hand. There’s nothing worse than being hauled over the coals because some director thought they could do something amazing with infrastructure-level availability and resiliency only to have the whole thing fall over due to lack of budget. It can be a difficult conversation to have, particularly if your executives are the types of people who like to trust the folks with the fancy logos on their documents. All you can do in that case is try and be clear about what’s possible, and clear about what it will cost in time and money.

In the near future I’ll try to put together a post on various infrastructure failure scenarios and what works and what doesn’t. RPO 0 seems to be what everyone is asking for, but it may not necessarily be what everyone needs. Now please enjoy this Unfinished Business stock image.

Rubrik Basics – Cluster Upgrade Process

I’ve been doing some work with Rubrik in our lab and thought it worth covering some of the basic features that I think are pretty neat. In this edition of Rubrik Basics, I thought I’d quickly cover off software upgrades. There are two ways to upgrade the Rubrik software on your Brik – via USB and SFTP. Either way, you’ll need access to the Downloads section of the support site. If you’re a customer, you’ll have this already. If this all sounds too hard, you can raise a ticket with the support team and they’ll tunnel in and do the upgrade for you (assuming you’ve allowed remote tunnel capability).

 

USB

The good thing about using a USB drive is that you can still keep appliances in “dark” sites up to date. Before you begin you’ll need to do two things:

  • Download the compressed upgrade archive and the matching signature file from the customer portal.
  • Format a removable drive with the FAT32 file system.

You’ll need to copy the upgrade file and matching signature file to the removable drive. Plug that into any node in the cluster. Log in to that node as the admin user. Mount the USB drive by typing the following command:

mount --usb_device

Type the following command to begin the upgrade:

upgrade start

The upgrade system scans the file system for upgrade archives. If multiple archives are available, it display a list of choices. Once you’ve finished, you can unmount the device.

umount --usb_device

 

SFTP

You can also run the upgrade via SFTP. I found the instructions on how to do that here. It’s not too dissimilar to the USB method. You’ll want to use your favourite SFTP client to upload the files to the /upgrade directory. Once you’ve done that, ssh on to the node and you can run a pre-flight check. If everything comes up Milhouse you’ll be good to go for the next step.

Using username "admin".

admin@10.xxx.yyy.131's password:

=======================

Welcome to Rubrik CLI

=======================

Type 'help' or '?' to list commands

RVM165Sxxxx55 >> upgrade start --mode prechecks_only
Do you want to use --share rubrik-4.1.2-2366.tar.gz [y/N] [N]: y
Upgrade status: Started pre-checks successfully
RVM165Sxxxx55 >> upgrade status
Current upgrade mode: prechecks_only
Current upgrade pre-checks node: RVM165Sxxxx55
Current upgrade pre-checks tarball name: --share rubrik-4.1.2-2366.tar.gz
Current upgrade pre-checks status: In progress
Current run started at: 2018-07-19 00:48:04.437000 UTC+0000

Current state (3/6): VERIFYING
Current task: Verify authenticity of new software
Current state progress: 0.0%

Finished states (2/6): ACQUIRING, COPYING
Pending states (3/6): UNTARING, DEPLOYING, PRECHECKING

Time taken so far: 18.38 seconds
Overall upgrade progress: 6.0%

To check on progress, run “upgrade status” to, erm, check on the status of the upgrade.

RVM165Sxxxx55 >> upgrade status
Last upgrade mode: prechecks_only
Last upgrade pre-checks node: RVM165Sxxxx55
Last upgrade pre-checks tarball name: --share rubrik-4.1.2-2366.tar.gz
Last upgrade pre-checks status: Completed successfully
Last run ended at: 2018-07-19 00:51:03.129000 UTC+0000
Current state: IDLE

Now you’re ready to do it for real. Run “upgrade start” to start.

RVM165Sxxxx55 >> upgrade start
Do you want to use --share rubrik-4.1.2-2366.tar.gz [y/N] [N]: y
Upgrade status: Started upgrade successfully
RVM165Sxxxx55 >> upgrade status
Current upgrade mode: normal
Current upgrade node: RVM165Sxxxx55
Current upgrade tarball name: --share rubrik-4.1.2-2366.tar.gz
Current upgrade status: In progress
Current run started at: 2018-07-19 00:52:56.882000 UTC+0000

Current state (4/9): UNTARING
Current task: Extract new software
Current state progress: 0.0%

Finished states (3/9): ACQUIRING, COPYING, VERIFYING
Pending states (5/9): DEPLOYING, PRECHECKING, PREPARING, UPGRADING, RESTARTING

Time taken so far: 22.52 seconds
Overall upgrade progress: 3.5%

It’s a pretty quick process, and eventually you’ll see this message.

RVM165Sxxxx55 >> upgrade status
Last upgrade mode: normal
Last upgrade node: RVM165Sxxxx55
Last upgrade tarball name: --share rubrik-4.1.2-2366.tar.gz
Last upgrade status: Completed successfully
Last run ended at: 2018-07-19 01:19:09.719000 UTC+0000

Current state: IDLE
RVM165Sxxxx55 >>

And you’re all done. Note that you only have to upload the data and run the process on one node in the cluster.

Rubrik CDM 4.1.1. – A Few Notes

Here are a few random notes on things in Rubrik‘s Cloud Data Management (CDM) 4.1.1-p4-2319 that I’ve come across in my recent testing in the lab. There’s not enough in each item to warrant a full post, hence the “few notes” format. Note that some of these things have been around for a while, I just wanted to note the specific version of Rubrik CDM I’m working with.

 

Guest OS Credentials

Rubrik uses Guest OS credentials for access to a VM’s operating system. When you add VM workload to your Rubrik environment, you may see the following message in the logs.

Note that it’s a warning, not an error. You can still backup the VM, just not to the level you might have hoped for. If you want to do a direct restore on a Linux guest, you’ll need an account with write access. For Windows, you’ll need something with administrative access. You could achieve this with either local or domain administrator accounts. This isn’t recommended though, and Rubrik suggests “a credential for a domain level account that has a small privilege set that includes administrator access to the relevant guests”. You could use a number of credentials across multiple groups of machines to reduce (to a small extent) the level of exposure, but there are plenty of CISOs and Windows administrators who are not going to like this approach.

So what happens if you don’t provide the credentials? My understanding is that you can still do file system consistent snapshots (provided you have a current version of VMware Tools installed), you just won’t be able to do application-consistent backups. For your reference, here’s the table from Rubrik discussing the various levels of available consistency.

Consistency level Description Rubrik usage
Inconsistent A backup that consists of copying each file to the backup target without quiescence.

File operations are not stopped The result is inconsistent time stamps across the backup and, potentially, corrupted files.

Not provided
Crash consistent A point-in-time snapshot but without quiescence.

•                Time stamps are consistent

•                Pending updates for open files are not saved

•                In-flight I/O operations are not completed

The snapshot can be used to restore the virtual machine to the same state that a hard reset would produce.

Provided only when:

•                The Guest OS does not have VMware Tools

•                The Guest OS has an out-of-date version of VMware Tools

The VM’s Application Consistency was manually set to Crash Consistent in the Rubrik UI

File system consistent A point-in-time snapshot with quiescence.

•                Time stamps are consistent

•                Pending updates for open files are saved

•                In-flight I/O operations are completed

•                Application-specific operations may not be completed.

Provided when the guest OS has an up-to-date version of VMware Tools and application consistency is not supported for the guest OS.
Application consistent A point-in-time snapshot with quiescence and application-awareness.

•                Time stamps are consistent

•                Pending updates for open files are saved

•                In-flight I/O operations are completed

•                Application-specific operations are completed.

Provided when the guest OS has an up-to-date version of VMware Tools and application consistency is supported for the guest OS.

 

open-vm-tools

If you’re running something like Debian in your vSphere environment you may have chosen to use open-vm-tools rather than VMware’s package. There’s nothing wrong with this (it’s a VMware-supported configuration), but you’ll see that Rubrik currently has a bit of an issue with it.

It will still backup the VM, just not at the consistency level you may be hoping for. It’s on Rubrik’s list of things to fix. And VMware Tools is still a valid (and arguably preferred) option for supported Linux distributions. The point of open-vm-tools is that appliance vendors can distribute the tools with their VMs without violating licensing agreements.

 

Download Logs

It seems like a simple thing, but I really like the ability to download logs related to a particular error. In this example, I’ve got some issues with a SQL cluster I’m backing up. I can click on “Download Logs” and grab the info I need related to the SLA Activity. It’s a small thing, but it makes wading through logs to identify issues a little less painful.