VMware Cloud on AWS – TMCHAM – Part 13 – Delete the SDDC

Following on from my article on host removal, in this edition of Things My Customers Have Asked Me (TMCHAM), I’m going to cover SDDC removal on the VMware-managed VMware Cloud on AWS platform. Don’t worry, I haven’t lost my mind in a post-acquisition world. Rather, this is some of the info you’ll find useful if you’ve been running a trial or a proof of concept (as opposed to a pilot) deployment of VMware Cloud Disaster Recovery (VCDR) and / or VMware Cloud on AWS and want to clean some stuff up when you’re all done.

 

Process

Firstly, if you’re using VCDR and want to deactivate the deployment, the steps to perform are outlined here, and I’ve copied the main bits from that page below.

  1. Remove all DRaaS Connectors from all protected sites. See Remove a DRaaS Connector from a Protected Site.
  2. Delete all recovery SDDCs. See Delete a Recovery SDDC.
  3. Deactivate the recovery region from the Global DR Console. (Do this step last.) See Deactivate a Recovery Region. Usage charges for VMware Cloud DR are not stopped until this step is completed.

Funnily enough, as I was writing this, someone zapped our lab for reasons. So this is what a Region deactivation looks like in the VCDR UI.

Note that it’s important you perform these steps in that order, or you’ll have more cleanup work to do to get everything looking nice and tidy. I have witnessed firsthand someone doing it the other way and it’s not pretty. Note also that if your Recovery SDDC had services such as HCX connected, you should hold off deleting the Recovery SDDC until you’ve cleaned that bit up.

Secondly, if you have other workloads deployed in a VMware Cloud on AWS SDDC and want to remove a PoC SDDC, there are a few steps that you will need to follow.

If you’ve been using HCX to test migrations or network extension, you’ll need to follow these steps to remove it. Note that this should be initiated from the source side, and your HCX deployment should be in good order before you start (site pairings functioning, etc). You might also wish to remove a vCenter Cloud Gateway, and you can find information on that process here.

Finally, there are some AWS activities that you might want to undertake to clean everything up. These include:

  • Removing VIFs attached to your AWS VPC.
  • Deleting the VPC (this will likely be required if your organisation has a policy about how PoC deployments  are managed).
  • Tidy up and on-premises routing and firewall rules that may have been put in place for the PoC activity.

And that’s it. There’s not a lot to it, but tidying everything up after a PoC will ensure that you avoid any unexpected costs popping up in the future.

VMware Cloud on AWS – TMCHAM – Part 12 – Host Removal

In this edition of Things My Customers Have Asked Me (TMCHAM), I’m going to cover host removal on the VMware-managed VMware Cloud on AWS platform. This is a fairly brief post, as there’s not a lot to say about the process, but I’ve had enough questions about that I thought it was worth covering.

 

Background

I’ve written about Elastic DRS (EDRS) in VMware Cloud on AWS previously. It’s something you can’t turn off, and the Baseline policy does a good job of making sure you don’t get in hot water from a storage perspective. That said, there might be occasions where you want to remove a host or two to scale in your cluster manually. This might happen after a cluster conversion, or you may have had a rapid scale out event and you have now removed whatever workloads caused that scale out event to occur.

 

Process

The process to remove a host is documented here. Note that it is a sequential process, with one host being removed at a time. Depending on the number of hosts in your cluster, you may need to adjust your storage and fault tolerance policies as well. To start the process, go to your cloud console and select the SDDC you want to remove the hosts from. If there’s only one cluster, you can click on Remove Hosts under Actions. If there are multiple clusters in the SDDC, you’ll need to select the cluster you want to remove the host from.

You’ll then be advised that you need to understand what you’re doing (and acknowledge that), and you may be advised to change your default storage policies as well. More info on those policies is here.

Once you kick off the process, the cluster will be evaluated to ensure that removing hosts will not violate the applicable EDRS policies. VMs will be migrated off the host when it’s put into maintenance mode, and billing will be stopped for that host.

And that’s it. Pretty straightforward.

VMware Cloud Disaster Recovery – Using A Script VM

This is a quick post covering the steps required to configure a script VM for use in a recovery plan with VMware Cloud Disaster Recovery (VCDR). Why would you want to do this? You might be running a recovery for a Linux VM and you need to run a script to update the DNS settings of the VM once it’s powered on at another site. Or you might have a site-specific application that needs to be installed. Whatever. The point is that VCDR gives you that ability to do that via the Script VM. You can read the documentation on the feature here.

Firstly, you configure the Script VM as part of the Recovery Plan creation process. Specify the name of the VM and the vCenter it’s hosted on.

Under Recovery steps, click on Add Step to add a step to the recovery process.

When you add the step, you’ll want to add an action for the post-recovery phase.

You can then select “Run script on the Script VM”.

At this point you can specify the full path to the script file, keeping in mind that Windows looks different to Linux. You can also set a timeout for the script.

And that’s pretty much it. Remember that you’ll need working DNS, or, failing that, valid IP addresses for things to work.

VMware Cloud on AWS – Check TRIM/UNMAP

This a really quick follow up to one of my TMCHAM articles on TRIM/UNMAP on VMware Cloud on AWS. In short, a customer wanted to know whether TRIM/UNMAP had been enabled on one of their clusters, as they’d requested. The good news is it’s easy enough to find out. On your cluster, go to Configure. Under vSAN, you’ll see Services. Expand the Advanced Options section and you’ll see whether TRIM/UNMAP has been enabled for the cluster or not.

VMware Cloud Disaster Recovery – Ransomware Recovery Activation

One of the cool features of VMware Cloud Disaster Recovery (VCDR) is the Enhanced Ransomware Recovery capability. This is a quick post to talk through how to turn it on in your VCDR environment, and things you need to consider.

 

Organization Settings

The first step is to enable the ransomware services integration in your VCDR dashboard. You’ll need to be an Organisation owner to do this. Go to Settings, and click on Ransomware Recovery Services.

You’ll then have the option to select where the data analysis is performed.

You’ll also need to tick some boxes to ensure that you understand that an appliance will be deployed in each of your Recovery SDDCs, Windows VMs will get a sensor installed, and some preinstalled sensors may clash with Carbon Black.

Click on Activate and it will take a few moments. If it takes much longer than that, you’ll need to talk to someone in support.

Once the analysis integration is activated, you can then activate NSX Advanced Firewall. Page 245 of the PDF documentation covers this better than I can, but note that NSX Advanced Firewall is a chargeable service (if you don’t already have a subscription attached to your Recovery SDDC). There’s some great documentation here on what you do and don’t have access to if you allow the activation of NSX Advanced Firewall.

Like your favourite TV chef would say, here’s one I’ve prepared earlier.

Recovery Plan Configuration

Once the services integration is done, you can configure Ransomware Recovery on a per Recovery Plan basis.

Start by selecting Activate ransomware recovery. You’ll then need to acknowledge that this is a chargeable feature.

You can also choose whether you want to use integrated analysis (i.e. Carbon Black Cloud), and if you want to manually remove other security sensors when you recover. You can, also, choose to use your own tools if you need to.

And that’s it from a configuration perspective. The actual recovery bit? A story for another time.

VMware Cloud Disaster Recovery – Firewall Ports

I published an article a while ago on getting started with VMware Cloud Disaster Recovery (VCDR). One thing I didn’t cover in any real depth was the connectivity requirements between on-premises and the VCDR service. VMware has worked pretty hard to ensure this is streamlined for users, but it’s still something you need to pay attention to. I was helping a client work through this process for a proof of concept recently and thought I’d cover it off more clearly here. The diagram below highlights the main components you need to look at, being:

  • The Cloud File System (frequently referred to as the SCFS)
  • The VMware Cloud DR SaaS Orchestrator (the Orchestrator); and
  • VMware Cloud DR Auto-support.

It’s important to note that the first two services are assigned IP addresses when you enable the service in the Cloud Service Console, and the Auto-support service has three public IP addresses that you need to be able to communicate with. All of this happens outbound over TCP 443. The Auto-support service is not required, but it is strongly recommended, as it makes troubleshooting issues with the service much easier, and provides VMware with an opportunity to proactively resolve cases. Network connectivity requirements are documented here.

[image courtesy of VMware]

So how do I know my firewall rules are working? The first sign that there might be a problem is that the DRaaS Connector deployment will fail to communicate with the Orchestrator at some point (usually towards the end), and you’ll see a message similar to the following. “ERROR! VMware Cloud DR authentication is not configured. Contact support.”

How can you troubleshoot the issue? Fortunately, we have a tool called the DRaaS Connector Connectivity Check CLI that you can run to check what’s not working. In this instance, we suspected an issue with outbound communication, and ran the following command on the console of the DRaaS Connector to check:

drc network test --scope cloud

This returned a status of “reachable” for the Orchestrator and Auto-support services, but the SCFS was unreachable. Some negotiations with the firewall team, and we were up and running.

Note, also, that VMware supports the use of proxy servers for communicating with Auto-support services, but I don’t believe we support the use of a proxy for Orchestrator and SCFS communications. If you’re worried about VCDR using up all your bandwidth, you can throttle it. Details on how to do that can be found here. We recommend a minimum of 100Mbps, but you can go as low as 20Mbps if required.

QNAP – Expand Volume With Larger Drives

This is one of those articles I’ve been meaning to post for a while, simply because I forget every time I do it how to do it properly. One way to expand the capacity of your QNAP NAS (non-disruptively) is to replace the drives one at a time with larger capacity drives. It’s recommended that you follow this process, rather than just ripping the drives out one by one and waiting for the RAID Group to expand. It’s a simple enough process to follow, although the QNAP UI has always struck me as a little on the confusing side to navigate, so I took some pictures. Note that this was done on QNAP firmware 5.0.0.1986.

Firstly, go to Storage/Snapshots under Storage in the ControlPanel. Click on Manage.

Select the Storage Pool you want to expand, and click on Manage again.

This will give you a drop-down menu. Select Replace Disks One by One.

Now select the disk you want to replace and click on Change.

Once you’ve done this for all of the disks (and it will take some time to rebuild depending on a variety of factors), click on Expand Capacity. It will ask you if you’re sure and hopefully you’ll click OK.

It’ll take a while for the RAID Group to synchronise.

You’ll notice then that, while the Storage Pool has expanded, the Volume is still the original size. Select the Volume and click on Manage.

Now you can click on Resize Volume.

The Wizard will give you information on the Storage Pool capacity and give you the option to set the new capacity of the volume. I usually click on Set to Max.

It will warn you about that. Click on OK because you like to live on the edge.

It will take a little while, but eventually your Volume will have expanded to fill the space.

 

Rubrik Basics – Multi-tenancy – Create An Organization

I covered multi-tenancy with Rubrik some time ago, but things have certainly advanced since then. One of the useful features of Rubrik CDM (and something that’s really required for Envoy to make sense) is the Organizations feature. This is the way in which you can use a combination of LDAP sources, roles, and tenant workloads to deliver a packaged multi-tenancy feature to organisations either within or external to your company. In this article I’ll run through the basics of setting up an Organization. If you’d like to see how it can be applied in a practical sense, it’s worth checking out my post on deploying Rubrik Envoy.

It starts, as these things often do, by clicking on the gear in the Rubrik CDM UI. Select Organizations (located under Access Management).

Click on Create Organization.

You’ll want to give it a name, and think about whether you want to give your tenant the ability to do per-tenant access control.

You’ll want an Org Admin Role to have particular abilities, and you might like to get fancy and add in some additional roles that will have some other capabilities.

At this point you’ll get to select which users you want in your Organization.

Hopefully you’ve added the tenant’s LDAP source to your environment already.

And it’s worth thinking about what users and / or groups you’ll be using from that LDAP source to populate your Organization’s user list.

You’ll also need to consider which role will be assigned to these users (rather than relying on Global Admins to do things for tenants).

You can then assign particular resources, including VMs, vApps, and so forth.

You can also select what SLA Domains the Organization has access to, as well as Archival locations, and replication targets and sources. This becomes important in a multi-tenanted environment as you don’t want folks putting data where they shouldn’t.

At this point you can download the Rubrik Envoy OVA, deploy it, and connect it to your Organization.

And then you’re done. Well, normally you would be, but I didn’t select a whole lot of objects in this example. Click Finish and you’re on your way.

Assuming you’ve assigned your roles correctly, when your tenant logs in, he or she will only be able to see and control resources that belong to that particular Organization.

 

Rubrik Basics – Envoy Deployment

I’ve recently been doing some work with Rubrik Envoy in the lab and thought I’d run through the basics. There’s a new document outlining the process on the articles page.

 

Why Envoy?

This page explains it better than I do, but Envoy is ostensibly a way for service providers to deliver Rubrik services to customers sitting on networks that are isolated from the Rubrik environment. Why would you need to do this? There are all kinds of reasons why you don’t want to give your tenants direct access to your data protection resources, and most of these revolve around security (even if your Rubrik environment is secured appropriately). As many SPs will also tell you, bringing private networks from a tenant / edge into your core is usually not a great experience either.

At a high level, it looks like this.

In this example, Tenant A sits on a private network, and the Envoy Tenant Network is 10.0.1.10. The Rubrik Routable Network on the Envoy appliance is 192.168.0.201, and the data management interface on the Rubrik cluster is 192.168.0.200. The Envoy appliance talks to tenant hosts over ports 12800 and 12801. The Rubrik cluster communicates with Envoy over ports 7500 and 7501. The only time the tenant network communicates with the Rubrik cluster is when the Envoy / Rubrik UI is used by the tenant. This is accessed over a port specified when the Organization is created (see below), and the Envoy to cluster communication is over port 443.

Other Notes

Envoy isn’t a data mover in its current iteration, but rather a way for SPs to present some self-service capabilities to tenants in a controlled fashion without relying on third-party portals or network translation tools. So if you had a bunch of workloads sitting in a tenant’s environment, you’d be better served deploying Rubrik Air / Edge appliances and then replicating that data into the core. If your tenant has a vCenter environment with a few VMs, you can use the Rubrik Backup Service to backup those VMs, but you couldn’t setup vCenter as a source for the tenant unless you opened up networks between your environments by some other means and added it to your Rubrik cluster. This would be ugly at best.

Note also that the deployment assumes you’re creating an Organization in the Rubrik appliance that will be used to isolate the tenant’s data and access from other tenants in the environment. To get hold of the Envoy OVA appliance and credentials, you need to run through the Organization creation process and connect the Envoy appliance when prompted. You’ll also need to ensure that you’ve configured Roles correctly for your tenant’s environment.

If, for some reason, you need to change or view the IP configuration of the Envoy appliance, it’s important to note that the articles on the Rubrik support site are a little out of step with CentOS 7 (i.e. written for Ubuntu). I don’t know whether this is because I’m using Rubrik Air appliances in the lab, but I think it’s maybe just a shift. In any case, to get IP information, you need to login to the console and go to /etc/sysconfig/network-scripts. You’ll find a couple of files (ifcfg-eth0 and ifcfg-eth1) that will tell you whether you’ve made a boo boo with your configuration or not.

 

Conclusion

I’m the first to admit it took a little while to understand the utility of something like Envoy. Most SPs struggle to deliver self-service capabilities for services that don’t always do network multi-tenancy very well. This is a good step in the direction of solving some of the problems associated with that. It’s also important to understand that, if your tenant has workloads sitting in VMware Cloud Director, for example, they’ll be accessing Rubrik resources in a different fashion. As I mentioned before, if there is a bit to protect on the edge site, it’s likely a better option to deploy a virtualised Rubrik appliance or a smaller cluster and replicate that data. In any case, I’ll update this post if I come across anything else useful.

Rubrik Basics – Rubrik CDM Upgrades With Polaris – Part 2

This is the second part of the super exciting article “Rubrik CDM Upgrades With Polaris”. In the first episode, I connected my Polaris tenancy to a valid Rubrik Support account so it could check for CDM upgrades. In this post, I’ll be covering the actual update process using Polaris. Hold on to your hats.

To get started, login to Polaris, click on the Gear icon, and select CDM Upgrades.

If there’s a new version of CDM available for deployment, you’ll see it listed in the dashboard. In this example, my test Edge cluster has an update available (5.3.1-p3). Happy days!

You’ll need to get this update downloaded to the cluster you want to install it on first. Click on the ellipsis and select Download.

You can then choose to download the release from Rubrik or locally.

Click on the version you want to download and click Next.

You then get the opportunity to confirm the download. Click on Confirm to do this.

It will then let you know that it’s working on it.

Once the update has downloaded, you’ll see “Ready for upgrade” on the dashboard.

Select the cluster you’d like to upgrade and click on Upgrade.

At this point, you’ll get the option to schedule the upgrade, and select to rollback if the upgrade fails for some reason.

Confirm the upgrade and you’ll be on your way.

Polaris lets you know that it’s working on it.

You can see the progress in the dashboard.

When it’s done, it’s done.

And that’s it. This highlights the utility of something like Polaris, particularly when you’re managing a large number of clusters and need to keep things in tip-top shape.