This a really quick follow up to one of my TMCHAM articles on TRIM/UNMAP on VMware Cloud on AWS. In short, a customer wanted to know whether TRIM/UNMAP had been enabled on one of their clusters, as they’d requested. The good news is it’s easy enough to find out. On your cluster, go to Configure. Under vSAN, you’ll see Services. Expand the Advanced Options section and you’ll see whether TRIM/UNMAP has been enabled for the cluster or not.
One of the cool features of VMware Cloud Disaster Recovery (VCDR) is the Enhanced Ransomware Recovery capability. This is a quick post to talk through how to turn it on in your VCDR environment, and things you need to consider.
The first step is to enable the ransomware services integration in your VCDR dashboard. You’ll need to be an Organisation owner to do this. Go to Settings, and click on Ransomware Recovery Services.
You’ll then have the option to select where the data analysis is performed.
You’ll also need to tick some boxes to ensure that you understand that an appliance will be deployed in each of your Recovery SDDCs, Windows VMs will get a sensor installed, and some preinstalled sensors may clash with Carbon Black.
Click on Activate and it will take a few moments. If it takes much longer than that, you’ll need to talk to someone in support.
Once the analysis integration is activated, you can then activate NSX Advanced Firewall. Page 245 of the PDF documentation covers this better than I can, but note that NSX Advanced Firewall is a chargeable service (if you don’t already have a subscription attached to your Recovery SDDC). There’s some great documentation here on what you do and don’t have access to if you allow the activation of NSX Advanced Firewall.
Like your favourite TV chef would say, here’s one I’ve prepared earlier.
Recovery Plan Configuration
Once the services integration is done, you can configure Ransomware Recovery on a per Recovery Plan basis.
Start by selecting Activate ransomware recovery. You’ll then need to acknowledge that this is a chargeable feature.
You can also choose whether you want to use integrated analysis (i.e. Carbon Black Cloud), and if you want to manually remove other security sensors when you recover. You can, also, choose to use your own tools if you need to.
And that’s it from a configuration perspective. The actual recovery bit? A story for another time.
I published an article a while ago on getting started with VMware Cloud Disaster Recovery (VCDR). One thing I didn’t cover in any real depth was the connectivity requirements between on-premises and the VCDR service. VMware has worked pretty hard to ensure this is streamlined for users, but it’s still something you need to pay attention to. I was helping a client work through this process for a proof of concept recently and thought I’d cover it off more clearly here. The diagram below highlights the main components you need to look at, being:
- The Cloud File System (frequently referred to as the SCFS)
- The VMware Cloud DR SaaS Orchestrator (the Orchestrator); and
- VMware Cloud DR Auto-support.
It’s important to note that the first two services are assigned IP addresses when you enable the service in the Cloud Service Console, and the Auto-support service has three public IP addresses that you need to be able to communicate with. All of this happens outbound over TCP 443. The Auto-support service is not required, but it is strongly recommended, as it makes troubleshooting issues with the service much easier, and provides VMware with an opportunity to proactively resolve cases. Network connectivity requirements are documented here.
[image courtesy of VMware]
So how do I know my firewall rules are working? The first sign that there might be a problem is that the DRaaS Connector deployment will fail to communicate with the Orchestrator at some point (usually towards the end), and you’ll see a message similar to the following. “ERROR! VMware Cloud DR authentication is not configured. Contact support.”
How can you troubleshoot the issue? Fortunately, we have a tool called the DRaaS Connector Connectivity Check CLI that you can run to check what’s not working. In this instance, we suspected an issue with outbound communication, and ran the following command on the console of the DRaaS Connector to check:
drc network test --scope cloud
This returned a status of “reachable” for the Orchestrator and Auto-support services, but the SCFS was unreachable. Some negotiations with the firewall team, and we were up and running.
Note, also, that VMware supports the use of proxy servers for communicating with Auto-support services, but I don’t believe we support the use of a proxy for Orchestrator and SCFS communications. If you’re worried about VCDR using up all your bandwidth, you can throttle it. Details on how to do that can be found here. We recommend a minimum of 100Mbps, but you can go as low as 20Mbps if required.
VMware recently announced that VMware Cloud on AWS is now available in the AWS Asia-Pacific (Melbourne) Region. I thought I’d share some brief thoughts here along with a video I did with my colleague Satya.
VMware Cloud on AWS is now available to consume in three Availability Zones (apse4-az1, apse4-az2, apse4-az3) in the Melbourne Region. From a host type – you have the option to deploy either I3en.metal or I4i.metal hosts. There is also support for stretched clusters and PCI-DSS compliance if required. The full list of VMware Cloud on AWS Regions and Availability Zones is here.
Why Is This Interesting?
Since the launch of VMware Cloud on AWS, customers have only had one choice when it comes to a Region – Sydney. This announcement gives organisations the ability to deploy architectures that can benefit from both increased availability and resiliency by leveraging multi-regional capabilities.
VMware Cloud on AWS already offers platform availability at a number of levels, including a choice of Availability Zones, Partition Placement groups, and support for stretched clusters across two Availability Zones. There’s also support for VMware High Availability, as well as support for automatically remediating failed hosts.
In addition to the availability options customers can take advantage of, VMware Cloud on AWS also provides support for a number of resilience solutions, including VMware Cloud Disaster Recovery (VCDR) and VMware Site Recovery. Previously, customers in Australia and New Zealand were able to leverage these VMware (or third-party) solutions and deploy them across multiple Availability Zones. Invariably, it would look like the below diagram, with workloads hosted in one Availability Zone, and a second Availability Zone being used as the recovery location for those production workloads.
With the introduction of a second Region in A/NZ, customers can now look to deploy resilience solutions that are more like this diagram:
In this example, they can choose to run production workloads in the Melbourne Region and recover workloads into the Sydney Region if something goes pear-shaped. Note that VCDR is not currently available to deploy in the Melbourne Region, although it’s expected to be made available before the end of 2023.
Why Else Should I Care?
There are a variety of legal, regulatory, and administrative obligations governing the access, use, security and preservation of information within various government and commercial organisations in Victoria. These regulations are both national and state-based, and in the case of the Melbourne Region, provide organisations in Victoria the opportunity to store data in VMware Cloud on AWS that may not otherwise have been possible.
Not all applications and data reside in the same location. Many organisations have a mix of workloads residing on-premises and in the cloud. Some of these applications are latency-sensitive, and the launch of the Melbourne Region provides organisations with the ability to host applications closer to that data, as well as accessing native AWS services with improved responsiveness over applications hosted in the Sydney Region.
If you’re an existing VMware Cloud on AWS customer, head over to https://cloud.vmware.com. Login to the Cloud Services Console. Click on the VMware Cloud on AWS tile. Click on Inventory. Then click on Create SDDC.
Some of the folks in the US and Europe are probably wondering why on earth this is such a big deal for the Australian and New Zealand market. And plenty of folks in this part of the world are probably not that interested either. Not every organisation is going to benefit from or look to take advantage of the Melbourne Region. Many of them will continue to deploy workloads into one or two of the Sydney-based Availability Zones, with DR in another Availability Zone, and not need to do any more. But for those organisations looking for resiliency across geographical regions, this is a great opportunity to really do some interesting stuff from a disaster recovery perspective. And while it seems delightfully antiquated to think that, in this global world we live in, some information can’t cross state lines, there are plenty of organisations in Victoria facing just that issue, and looking at ways to store that data in a sensible fashion close to home. Finally, we talk a lot about data having gravity, and this provides many organisations in Victoria with the ability to run workloads closer to that centre of data gravity.
If you’d like to hear me talking about this with my learned colleague Satya, you can check out the video here. Thanks to Satya for prompting me to do the recording, and for putting it all together. We’re aiming to do this more regularly on a variety of VMware-related topics, so keep an eye out.
In this edition of Things My Customers Have Asked Me (TMCHAM), I’m going to cover Managed Storage Policy Profiles (MSPPs) on the VMware-managed VMware Cloud on AWS platform.
VMware Cloud on AWS has MSPPs deployed on clusters to ensure that customers have sufficient resilience built into the cluster to withstand disk or node failures. By default, clusters are configured with RAID 1, Failures to Tolerate (FTT):1 for 2 – 5 nodes, and RAID 6, FTT:2 for clusters with 6 or more nodes. Note that single-node clusters have no Service Level Agreement (SLA) attached to them, as you generally only run those on a trial basis, and if the node fails, there’s nowhere for the data to go. You can read more about vSAN Storage Polices and MSPPs here, and there’s a great Tech Zone article here. The point of these policies is that they are designed to ensure your cluster(s) remain in compliance with the SLAs for the platform. You can view the policies in your environment by going to Policies and Profiles in vCenter and selecting VM Storage Policies.
Can I Change Them?
The MSPPs are maintained by VMware, and so it’s not a great idea to change the default policies on your cluster, as the system will change them back at some stage. And why would you want to change the policies on your cluster? Well, you might decide that 4 or 5 nodes could actually run better (from a capacity perspective) using RAID 5, rather than RAID 1. This is a reasonable thing to want to do, and as the SLA talks about FTT numbers, not RAID types, you can change the RAID type and remain in compliance. And the capacity difference can be material in some cases, particularly if you’re struggling to fit your workloads onto a smaller node count.
So How Do I Do It Then?
Clone The Policy
There are a few ways to approach this, but the simplest is by cloning an existing policy. In this example, I’ll clone the vSAN Default Storage Policy. In the VMware Cloud on AWS, there is an MSPP assigned to each cluster with the name “VMC Workload Storage Policy – ClusterName“. Select the policy you want to clone and then click on Clone.
The first step is to give the VM Storage Policy a name. Something cool with your initials should do the trick.
You can edit the policy structure at this point, or just click Next.
Here you can configure your Availability options. You can also do other things, like configure Tags and Advanced Policy Rules.
Once this is configured, the system will check that your vSAN datastore are compatible with your policy.
And then you’re ready to go. Click Finish, make yourself a beverage, bask in the glory of it all.
Apply The Policy
So you have a fresh new policy, now what? You can choose to apply it to your workload datastore, or apply it to specific Virtual Machines. To apply it to your datastore, select the datastore you want to modify, click on General, then click on Edit next to the Default Storage Policy option. The process to apply the policy to VMs is outlined here. Note that if you create a non-compliant policy and apply it to your datastore, you’ll get hassled about it and you should likely consider changing your approach.
The thing about managed platforms is that the service provider is on the hook for architecture decisions that reduce the resilience of the platform. And the provider is trying to keep the platform running within the parameters of the SLA. This is why you’ll come across configuration items in VMware Cloud on AWS that you either can’t change, or have some default options that seem conservative. Many of these decisions have been made with the SLAs and the various use cases in mind for the platform. That said, it doesn’t mean there’s no flexibility here. If you need a little more capacity, particularly in smaller environments, there are still options available that won’t reduce the platform’s resilience, while still providing additional capacity options.
In this edition of Things My Customers Have Asked Me (TMCHAM), I’m going to delve into the topic of cluster conversions on the VMware-managed VMware Cloud on AWS platform.
With the end of sale announcement of the I3.metal node type in VMware Cloud on AWS, I’ve had a few customers ask about how the cluster conversion process works. We’ve previously offered the ability to convert nodes from I3.metal to I3en.metal, and we’ve taken that process and made it possible for the I4i.metal node type as well. The process is outlined in some detail here. From a technical perspective, you’ll need to be on SDDC version 1.18v8 or 1.20v2 at a minimum. From a commercial perspective, to use your existing subscriptions, they’ll need to be flexible, or you can choose to add new subscriptions. Your account team can help with that.
Sounds Easy, What’s the Catch?
I’ve had a few customers run through this process now in my part of the world, and more and more folks are converting across to I4i.metal every week. One of the key considerations when planning the conversion, particularly with smaller environments, is sizing and storage policies. When the team converts your cluster, they will do some sizing estimates prior to the activity, and the results of this sizing might be higher than you’d expect. For example, we talk about the I4i.metal being something in the order of 1.6 – 2 times as powerful as the I3.metal node. But this really depends on a variety of factors, including the vSAN RAID policy in use, the types of workloads running on the cluster, and so forth. I’ve seen scenarios where a customer has wanted to convert a 6-node I3.metal cluster to 4 I4i.metal nodes. From a calculated capacity perspective, this should be a no-brainer. But what you’ll find, when working with the conversion team, is that they will likely come back to you saying that 6 nodes will be the target. The reason for this is that they’re assuming your cluster is running RAID 6.
How do you solve this problem? Think about the vSAN policy you want to run moving forward. If you’re happy to drop to RAID 5, for example, you have a way forward. Once the cluster conversion is complete, jump on and change the default policy to RAID 5 / FTT:1. This will cause vSAN to modify the policy for all of the VMs on the cluster. This is a background process, and won’t interfere with normal operations. Once you’ve done that, you can then remove the additional nodes. It’s a little fiddly, and will require some amount of coordination with the conversion team and your account team, but it’s a fairly simple task, and will get you running on new shiny boxes without having to muck about with setting up another cluster (or SDDC) and manually migrating workloads across.
You’ll want to ensure that changing your RAID policy won’t have an impact on your available storage. Every workload is different, but at a high level, you can use the public sizer to work through some of these numbers. A 16-node I3.metal cluster with RAID 6 configured will give you roughly 165.89 TiB of useable capacity (ignoring management workload overheads and vSAN slack space), and a similar storage footprint can be had with a 8 or 9-node cluster of I4i.metal nodes. You’ll also want to be sure your organisation is comfortable with the vSAN policy you’re moving to. If you’re moving from 16 nodes to 8 or 9 nodes, for example, this isn’t really a problem, as you’ll likely be sticking with RAID 6 for clusters that large. But if you’re going from 6 nodes to 3 nodes, you’re going from RAID 6 to RAID 1.
Thoughts and Further Reading
The neat thing about the VMware Cloud on AWS offering is that it’s a managed service from VMware, and we do a good job of managing boring stuff like this for you, reducing the impact of software and hardware changes by leveraging core VMware technologies that aren’t otherwise available on native cloud platforms. If you’d like to read more about the I4i.metal node – check out our FAQ here.
In this edition of Things My Customers Have Asked Me (TMCHAM), I’m going to delve into some questions around recent(ish) changes to Elastic DRS policies and capacity on the VMware-managed VMware Cloud on AWS platform.
I’ve had a few customers ask about changes VMware has made to Elastic DRS policies on VMware Cloud on AWS. I’ve talked a little about eDRS previously, and the release notes cover the changes here (go to March 27th, 2023). In short the changes are as follows:
- Elastic DRS optimize for rapid scaling policy now supports rapid scaling-in to enable faster scaling use cases like VDI, disaster recovery or any other business needs.
- The Elastic DRS Cost Policy improvement will allow automated scale-in of a cluster if the storage utilization falls below 40% instead of the current 20% limit.
What does it mean from a practical perspective? Not a lot for customers using the default baseline policy. But if you’re using “Optimize for Lower Cost” or “Rapid Scaling”, it might be worth looking into.
Optimize for Lowest Cost
The documentation does a great job of describing how this works: “When scaling in, this policy removes hosts quickly to maintain baseline performance while keeping host counts to a practical minimum. It removes hosts only if it anticipates that storage utilization would not result in a scale out in the near term after host removal”. It has the following thresholds:
|Old High||Old Low||New High||New Low|
|Storage||70%||20%||80% (this changed a while ago)||40%|
You’ll see that the new low has 40% as the threshold for storage now (I added in the change from 70 – 80% as well, but this was done a while ago). Generally speaking, the algorithm is designed not to do silly things, but we’ve added in this number to enable customers to scale in workloads sooner, helping to reduce the cost of scaling events.
From the documentation: “[t]his policy adds multiple hosts at a time when needed for memory or CPU, and adds hosts incrementally when needed for storage. By default, hosts are added four at a time. You can specify a larger scale-out increment (8 or 12) if you need faster scaling for disaster recovery, Virtual Desktop Infrastructure (VDI), and similar use cases. As with any EDRS policy, scale-out time increases with increment size. When the increment is large (12 hosts), it can take up to 40 minutes to complete in some configurations.
When scaling in, this policy removes hosts rapidly, maintaining baseline performance while keeping host count to a practical minimum. It does not remove hosts if it anticipates that doing so would degrade performance and force a near-term scale-out. Scale-in stops when the cluster reaches the minimum host count or the number of hosts in the scale-out increment has been removed”. This policy has the following thresholds:
|Old High||Old Low||New High||New Low|
What does that mean? We’ve added in some guardrails for rapid scale-in to ensure that things don’t get too hectic too quickly. And on the flip side, it means that you’ll scale out your environment faster as well. Again, this is useful for bursty workloads such as VDI or, potentially, rapid DR.
Elastic DRS is one of the cooler features of VMware Cloud on AWS. You can do some really interesting things from a scaling perspective, particularly if you’re operating with some volatile / bursty workloads. That said, if you only use the default baseline policy you’ll also likely be in a good spot, as the thing that can really hurt in these kinds of environments is when your hosts run short of storage.
I recently had the opportunity to run through a VMware Cloud on Disaster Recovery deployment with a customer and thought I’d run through the basics. It’s important to note that there a variety of topologies supported with VCDR, and many things that need to be considered before you click deploy, and this is just one way of doing it. In any case, there’s a new document outlining the process on the articles page.
In this edition of Things My Customers Have Asked Me (TMCHAM), I’m going to delve into some questions around TRIM/UNMAP and capacity reclamation on the VMware-managed VMware Cloud on AWS platform.
TRIM/UNMAP, in short, is the capability for operating systems to reclaim no longer used space on thin-provisioned filesystems. Why is this important? Imagine you have a thin-provisioned volume that has 100GB of capacity allocated to it. It consumes maybe 1GB when it’s first deployed. You then add 50GB of data to it. You then delete 50GB of data from the volume. You’ll still see 51GB of capacity being consumed on the filesystem. This is because older operating systems just mark the blocks as deleted, but don’t zero them out. Modern operating systems do support TRIM/UNMAP though, but the hypervisor needs to understand the commands being sent to it. You can read more on that here.
How I Do This For VMware Cloud on AWS?
You can contact your account team, and we raise a ticket to get the feature enabled. We had some minor issues recently that meant we weren’t enabling the feature, but if you’re running M16v12 or M18v5 (or above) on your SDDCs, you should be good to go. Note that this feature is enabled on a per-cluster basis, and you need to reboot the VMs in the cluster for it to take effect.
What About Migrating With HCX?
Do the VMs come across thin? Do you need to reclaim space first? If you’re using HCX to go from thick to thin, you should be fine. If you’re migrating thin to thin, it’s worth checking whether you’ve got any space reclamation in place on your source side. I’ve had customers report back that some environments have migrated across with higher than expected storage usage due to a lack of space reclamation happening on the source storage environment. You can use something like Live Optics to report on your capacity consumed vs allocated, and how much capacity can be reclaimed.
Why Isn’t This Enabled By Default?
I don’t know for sure, but I imagine it has something to do with the fact that TRIM/UNMAP has the potential to have a performance impact from a latency perspective, depending on the workloads running in the environment, and the amount of capacity being reclaimed at any given time. We recommend that you “schedule large space reclamation jobs during off-peak hours to reduce any potential impact”. Given that VMware Cloud on AWS is a fully-managed service, I imagine we want to control as many of the performance variables as possible to ensure our customers enjoy a reliable and stable platform. That said, TRIM/UNMAP is a really useful feature, and you should look at getting it enabled if you’re concerned about the potential for wasted capacity in your SDDC.
At VMware Explore 2022 in the US, VMware announced a number of new offerings for VMware Cloud on AWS, including a new bare-metal instance type: the I4i.metal. You can read the official blog post here. I thought it would be useful to provide some high-level details and cover some of the caveats that punters should be aware of.
By The Numbers
- The CPU is 3rd generation Intel Xeon Ice Lake @ 2.4GHz / Turbo 3.5GHz
- 64 physical cores, supporting 128 logical cores with Hyper Threading (HT)
- 1024 GiB memory
- 30 TiB NVMe (Raw local capacity)
- Up to 75 Gbps networking speed
- US East (N. Virginia) – use1-az1, use1-az2, use1-az4, use1-az5, use1-az6
- US West (Oregon) – usw2-az1, usw2-az2, usw2-az3, usw2-az4
- US West (N. California) – usw1-az1, usw1-az3
- US East (Ohio) – use2-az1, use2-az2, use2-az3
- Canada (Central) – cac1-az1, cac1-az2
- Europe (Ireland) – euw1-az1, euw1-az2, euw1-az3
- Europe (London) – euw2-az1, euw2-az2, euw2-az3
- Europe (Frankfurt) – euc1-az1, euc1-az2, euc1-az3
- Europe (Paris) – euw3-az1, euw3-az2, euw3-az3
- Asia Pacific (Singapore) – apse1-az1, apse1-az2, apse1-az3
- Asia Pacific (Sydney) – apse2-az1, apse2-az2, apse2-az3
- Asia Pacific (Tokyo) – apne1-az1, apne1-az2, apne1-az4
Other Regions will have availability over the coming months.
The i3.metal isn’t going anywhere, but it’s nice to have an option that supports more cores and it a bit more storage and RAM. The I4i.metal is great for SQL workloads and VDI deployments where core count can really make a difference. Coupled with the addition of supplemental storage via VMware Cloud Flex Storage and Amazon FSx for NetApp ONTAP, there are some great options available to deal with the variety of workloads customers are looking to deploy on VMware Cloud on AWS.
On another note, if you want to hear more about all the cloudy news from VMware Explore US, I’ll be presenting at the Brisbane VMUG meeting on October 12th, and my colleague Ray will be doing something in Sydney on October 19th. If you’re in the area, come along.