VMware Cloud on AWS – TMCHAM – Part 10 – Cluster Conversion

In this edition of Things My Customers Have Asked Me (TMCHAM), I’m going to delve into the topic of cluster conversions on the VMware-managed VMware Cloud on AWS platform.

 

Background

With the end of sale announcement of the I3.metal node type in VMware Cloud on AWS, I’ve had a few customers ask about how the cluster conversion process works. We’ve previously offered the ability to convert nodes from I3.metal to I3en.metal, and we’ve taken that process and made it possible for the I4i.metal node type as well. The process is outlined in some detail here. From a technical perspective, you’ll need to be on SDDC version 1.18v8 or 1.20v2 at a minimum. From a commercial perspective, to use your existing subscriptions, they’ll need to be flexible, or you can choose to add new subscriptions. Your account team can help with that.

 

Sounds Easy, What’s the Catch?

I’ve had a few customers run through this process now in my part of the world, and more and more folks are converting across to I4i.metal every week. One of the key considerations when planning the conversion, particularly with smaller environments, is sizing and storage policies. When the team converts your cluster, they will do some sizing estimates prior to the activity, and the results of this sizing might be higher than you’d expect. For example, we talk about the I4i.metal being something in the order of 1.6 – 2 times as powerful as the I3.metal node. But this really depends on a variety of factors, including the vSAN RAID policy in use, the types of workloads running on the cluster, and so forth. I’ve seen scenarios where a customer has wanted to convert a 6-node I3.metal cluster to 4 I4i.metal nodes. From a calculated capacity perspective, this should be a no-brainer. But what you’ll find, when working with the conversion team, is that they will likely come back to you saying that 6 nodes will be the target. The reason for this is that they’re assuming your cluster is running RAID 6.

How do you solve this problem? Think about the vSAN policy you want to run moving forward. If you’re happy to drop to RAID 5, for example, you have a way forward. Once the cluster conversion is complete, jump on and change the default policy to RAID 5 / FTT:1. This will cause vSAN to modify the policy for all of the VMs on the cluster. This is a background process, and won’t interfere with normal operations. Once you’ve done that, you can then remove the additional nodes. It’s a little fiddly, and will require some amount of coordination with the conversion team and your account team, but it’s a fairly simple task, and will get you running on new shiny boxes without having to muck about with setting up another cluster (or SDDC) and manually migrating workloads across.

You’ll want to ensure that changing your RAID policy won’t have an impact on your available storage. Every workload is different, but at a high level, you can use the public sizer to work through some of these numbers. A 16-node I3.metal cluster with RAID 6 configured will give you roughly 165.89 TiB of useable capacity (ignoring management workload overheads and vSAN slack space), and a similar storage footprint can be had with a 8 or 9-node cluster of I4i.metal nodes. You’ll also want to be sure your organisation is comfortable with the vSAN policy you’re moving to. If you’re moving from 16 nodes to 8 or 9 nodes, for example, this isn’t really a problem, as you’ll likely be sticking with RAID 6 for clusters that large. But if you’re going from 6 nodes to 3 nodes, you’re going from RAID 6 to RAID 1.

 

Thoughts and Further Reading

The neat thing about the VMware Cloud on AWS offering is that it’s a managed service from VMware, and we do a good job of managing boring stuff like this for you, reducing the impact of software and hardware changes by leveraging core VMware technologies that aren’t otherwise available on native cloud platforms. If you’d like to read more about the I4i.metal node – check out our FAQ here.

VMware Cloud on AWS – TMCHAM – Part 9 – Elastic DRS Policy Changes

In this edition of Things My Customers Have Asked Me (TMCHAM), I’m going to delve into some questions around recent(ish) changes to Elastic DRS policies and capacity on the VMware-managed VMware Cloud on AWS platform.

I’ve had a few customers ask about changes VMware has made to Elastic DRS policies on VMware Cloud on AWS. I’ve talked a little about eDRS previously, and the release notes cover the changes here (go to March 27th, 2023). In short the changes are as follows:

  • Elastic DRS optimize for rapid scaling policy now supports rapid scaling-in to enable faster scaling use cases like  VDI, disaster recovery or any other business needs.
  • The Elastic DRS Cost Policy improvement will allow automated scale-in of a cluster if the storage utilization falls below 40% instead of the current 20% limit.

What does it mean from a practical perspective? Not a lot for customers using the default baseline policy. But if you’re using “Optimize for Lower Cost” or “Rapid Scaling”, it might be worth looking into.

 

Huh?

Optimize for Lowest Cost

The documentation does a great job of describing how this works: “When scaling in, this policy removes hosts quickly to maintain baseline performance while keeping host counts to a practical minimum. It removes hosts only if it anticipates that storage utilization would not result in a scale out in the near term after host removal”. It has the following thresholds:

Old High Old Low New High New Low
CPU 90% 60% 90% 60%
Memory 80% 60% 80% 60%
Storage 70% 20% 80% (this changed a while ago) 40%

You’ll see that the new low has 40% as the threshold for storage now (I added in the change from 70 – 80% as well, but this was done a while ago). Generally speaking, the algorithm is designed not to do silly things, but we’ve added in this number to enable customers to scale in workloads sooner, helping to reduce the cost of scaling events.

Rapid Scaling

From the documentation: “[t]his policy adds multiple hosts at a time when needed for memory or CPU, and adds hosts incrementally when needed for storage. By default, hosts are added four at a time. You can specify a larger scale-out increment (8 or 12) if you need faster scaling for disaster recovery, Virtual Desktop Infrastructure (VDI), and similar use cases. As with any EDRS policy, scale-out time increases with increment size. When the increment is large (12 hosts), it can take up to 40 minutes to complete in some configurations.

When scaling in, this policy removes hosts rapidly, maintaining baseline performance while keeping host count to a practical minimum. It does not remove hosts if it anticipates that doing so would degrade performance and force a near-term scale-out. Scale-in stops when the cluster reaches the minimum host count or the number of hosts in the scale-out increment has been removed”. This policy has the following thresholds:

Old High Old Low New High New Low
CPU 80% 0% 80% 50%
Memory 80% 0% 80% 50%
Storage 70% 0% 80% 40%

What does that mean? We’ve added in some guardrails for rapid scale-in to ensure that things don’t get too hectic too quickly. And on the flip side, it means that you’ll scale out your environment faster as well. Again, this is useful for bursty workloads such as VDI or, potentially, rapid DR.

 

Thoughts

Elastic DRS is one of the cooler features of VMware Cloud on AWS. You can do some really interesting things from a scaling perspective, particularly if you’re operating with some volatile / bursty workloads. That said, if you only use the default baseline policy you’ll also likely be in a good spot, as the thing that can really hurt in these kinds of environments is when your hosts run short of storage.

Updated Articles Page

I recently had the opportunity to run through a VMware Cloud on Disaster Recovery deployment with a customer and thought I’d run through the basics. It’s important to note that there a variety of topologies supported with VCDR, and many things that need to be considered before you click deploy, and this is just one way of doing it. In any case, there’s a new document outlining the process on the articles page.

Random Short Take #82

Happy New Year (to those who celebrate). Let’s get random.

VMware Cloud on AWS – TMCHAM – Part 8 – TRIM/UNMAP

In this edition of Things My Customers Have Asked Me (TMCHAM), I’m going to delve into some questions around TRIM/UNMAP and capacity reclamation on the VMware-managed VMware Cloud on AWS platform.

 

Why TRIM/UNMAP?

TRIM/UNMAP, in short, is the capability for operating systems to reclaim no longer used space on thin-provisioned filesystems. Why is this important? Imagine you have a thin-provisioned volume that has 100GB of capacity allocated to it. It consumes maybe 1GB when it’s first deployed. You then add 50GB of data to it. You then delete 50GB of data from the volume. You’ll still see 51GB of capacity being consumed on the filesystem. This is because older operating systems just mark the blocks as deleted, but don’t zero them out. Modern operating systems do support TRIM/UNMAP though, but the hypervisor needs to understand the commands being sent to it. You can read more on that here.

How I Do This For VMware Cloud on AWS?

You can contact your account team, and we raise a ticket to get the feature enabled. We had some minor issues recently that meant we weren’t enabling the feature, but if you’re running M16v12 or M18v5 (or above) on your SDDCs, you should be good to go. Note that this feature is enabled on a per-cluster basis, and you need to reboot the VMs in the cluster for it to take effect.

What About Migrating With HCX?

Do the VMs come across thin? Do you need to reclaim space first? If you’re using HCX to go from thick to thin, you should be fine. If you’re migrating thin to thin, it’s worth checking whether you’ve got any space reclamation in place on your source side. I’ve had customers report back that some environments have migrated across with higher than expected storage usage due to a lack of space reclamation happening on the source storage environment. You can use something like Live Optics to report on your capacity consumed vs allocated, and how much capacity can be reclaimed.

Why Isn’t This Enabled By Default?

I don’t know for sure, but I imagine it has something to do with the fact that TRIM/UNMAP has the potential to have a performance impact from a latency perspective, depending on the workloads running in the environment, and the amount of capacity being reclaimed at any given time. We recommend that you “schedule large space reclamation jobs during off-peak hours to reduce any potential impact”. Given that VMware Cloud on AWS is a fully-managed service, I imagine we want to control as many of the performance variables as possible to ensure our customers enjoy a reliable and stable platform. That said, TRIM/UNMAP is a really useful feature, and you should look at getting it enabled if you’re concerned about the potential for wasted capacity in your SDDC.

Random Short Take #78

Welcome to Random Short Take #78. We’re hurtling towards the silly season. Let’s get random.

VMware Cloud on AWS – I4i.metal – A Few Notes …

At VMware Explore 2022 in the US, VMware announced a number of new offerings for VMware Cloud on AWS, including a new bare-metal instance type: the I4i.metal. You can read the official blog post here. I thought it would be useful to provide some high-level details and cover some of the caveats that punters should be aware of.

 

By The Numbers

What do you get from a specifications perspective?
  • The CPU is 3rd generation Intel Xeon Ice Lake @ 2.4GHz / Turbo 3.5GHz
  • 64 physical cores, supporting 128 logical cores with Hyper Threading (HT)
  • 1024 GiB memory
  • 30 TiB NVMe (Raw local capacity)
  • Up to 75 Gbps networking speed
So, how does the I4i.metal compare with the i3.metal? You get roughly 2x compute, storage, and memory, with improved network speed as well.
FAQ Highlights
Can I use custom core counts? Yep, the I4i will support physical custom core counts of 8, 16, 24, 30, 36, 48, 64.
Is there stretched cluster support? Yes, you can deploy these in stretched clusters (of the same host type).
Can I do in-cluster conversions? Yes, read more about that here.
Other Considerations
Why does the sizer say 20 TiB useable for the I4i? Around 7 TiB is consumed by the cache tier at the moment, so you’ll see different numbers in the sizer. And your useable storage numbers will obviously be impacted by the usual constraints around failures to tolerate (FTT) and RAID settings.
Region Support?
The I4i.metal instances will be available in the following Regions (and Availability Zones):
  • US East (N. Virginia) – use1-az1, use1-az2, use1-az4, use1-az5, use1-az6
  • US West (Oregon) – usw2-az1, usw2-az2, usw2-az3, usw2-az4
  • US West (N. California) – usw1-az1, usw1-az3
  • US East (Ohio) – use2-az1, use2-az2, use2-az3
  • Canada (Central) – cac1-az1, cac1-az2
  • Europe (Ireland) – euw1-az1, euw1-az2, euw1-az3
  • Europe (London) – euw2-az1, euw2-az2, euw2-az3
  • Europe (Frankfurt) – euc1-az1, euc1-az2, euc1-az3
  • Europe (Paris) –  euw3-az1, euw3-az2, euw3-az3
  • Asia Pacific (Singapore) – apse1-az1, apse1-az2, apse1-az3
  • Asia Pacific (Sydney) – apse2-az1, apse2-az2, apse2-az3
  • Asia Pacific (Tokyo) – apne1-az1, apne1-az2, apne1-az4

Other Regions will have availability over the coming months.

 

Thoughts

The i3.metal isn’t going anywhere, but it’s nice to have an option that supports more cores and it a bit more storage and RAM. The I4i.metal is great for SQL workloads and VDI deployments where core count can really make a difference. Coupled with the addition of supplemental storage via VMware Cloud Flex Storage and Amazon FSx for NetApp ONTAP, there are some great options available to deal with the variety of workloads customers are looking to deploy on VMware Cloud on AWS.

On another note, if you want to hear more about all the cloudy news from VMware Explore US, I’ll be presenting at the Brisbane VMUG meeting on October 12th, and my colleague Ray will be doing something in Sydney on October 19th. If you’re in the area, come along.

VMware Cloud on AWS – Supplemental Storage – A Few Notes …

At VMware Explore 2022 in the US, VMware announced a number of new offerings for VMware Cloud on AWS, including something we’re calling “Supplemental Storage”. There are some great (official) posts that have already been published, so I won’t go through everything here. I thought it would be useful to provide some high-level details and cover some of the caveats that punters should be aware of.

 

The Problem

VMware Cloud on AWS has been around for just over 5 years now, and in that time it’s proven to be a popular platform for a variety of workloads, industry verticals, and organisations of all different sizes. However, one of the challenges that a hyper-converged architecture presents is that resource growth is generally linear (depending on the types of nodes you have available). In the case of VMware Cloud on AWS, we (now) have 3 nodes available for use: the I3, I3en, and I4i. Each of these instances provides a fixed amount of CPU, RAM, and vSAN storage for use within your VMC cluster. So when your storage grows past a certain threshold (80%), you need to add an additional node. This is a longwinded way of saying that, even if you don’t need the additional CPU and RAM, you need to add it anyway. To address this challenge, VMware now offers what’s called “Supplemental Storage” for VMware Cloud on AWS. This is ostensibly external dat stores presented to the VMC hosts over NFS. This comes in two flavours: FSx for NetApp ONTAP and VMware Cloud Flex Storage. I’ll cover this in a little more detail below.

[image courtesy of VMware]

 

Amazon FSx for NetApp ONTAP

The first cab off the rank is Amazon FSx for NetApp ONTAP (or FSxN to its friends). This one is ONTAP-like storage made available to your VMC environment as a native service. It’s fully customer managed, and VMware managed from a networking perspective.

[image courtesy of VMware]

There’s a 99.99% Availability SLA attached to the service. It’s based on NetApp ONTAP, and offers support for:

  • Multi-Tenancy
  • SnapMirror
  • FlexClone
​Note that it currently requires VMware Managed Transit Gateway (vTGW) for Multi-AZ deployment (the only deployment architecture currently supported), and can connect to multiple clusters and SDDCs for scale. You’ll need to be on SDDC version 1.20 (or greater) to leverage this service in your SDDC, and there is currently no support for attachment to stretched clusters. While you can only connect datastores to VMC hosts using NFSv3, there is support for connecting directly to guest via other protocols. More information can be found in the FAQ here. There’s also a simulator you can access here that runs you through the onboarding process.

 

VMware Cloud Flex Storage

The other option for supplemental storage is VMware Cloud Flex Storage (sometimes referred to as VMC-FS). This is a datastore presented to your hosts over NFSv3.

Overview

VMware Cloud Flex Storage is:

  • A natively integrated cloud storage service for VMware Cloud on AWS that is fully managed by VMware;
  • Cost effective multi-cloud Cloud storage solution built on SCFS;
  • Delivered via a two-tier architecture for elasticity and performance (AWS S3 and local NVMe cache); and
  • Provides integrated Data-Management.

In short, VMware has taken a lot of the technology used in VMware Cloud Disaster Recovery (the result of the Datrium acquisition in 2020) and used it to deliver up to 400 TiB of storage per SDDC.

[image courtesy of VMware]
The intent of the solution, at this stage at least, is that it is only offered as a datastore for hosts via NFSv3, rather than other protocols directly to guests. There are some limitations around the supported topologies too, with stretched clusters not currently supported. From a disaster recovery perspective, it’s important to note that VMware Cloud Flex Storage is currently only offered on a single-AZ basis (although the supporting components are spread across multiple Availability Zones), and there is currently no support for VMware Cloud Disaster Recovery co-existence with this solution.

 

Thoughts
I’ve only been at VMware for a short period of time, but I’ve had numerous conversations with existing and potential VMware Cloud on AWS customers looking to solve their storage problems without necessarily putting everything on vSAN. There are plenty of reasons why you wouldn’t want to use vSAN for high capacity storage workloads, and I believe these two initial solutions go some ways to solving that issue. Many of the caveats that are wrapped around these two products at General Availability will be removed over time, and the traditional objections relating to VMware Cloud on AWS being not great at high-capacity, cost-effective storage will also have been removed.
Finally, if you’re an existing NetApp ONTAP customer, and were thinking about what you were going to do with that Petabyte of unstructured data you had lying about when you moved to VMware Cloud on AWS, or wanting to take advantage of the sweat equity you’ve poured into managing your ONTAP environment over the years, I think we’ve got you covered as well.

Random Short Take #75

Welcome to Random Short Take #75. Half the year has passed us by already. Let’s get random.

  • I talk about GiB all the time when sizing up VMware Cloud on AWS for customers, but I should take the time to check in with folks if they know what I’m blithering on about. If you don’t know, this explainer from my friend Vincent is easy to follow along with – A little bit about Gigabyte (GB) and Gibibyte (GiB) in computer storage.
  • MinIO has been in the news a bit recently, but this article from my friend Chin-Fah is much more interesting than all of that drama – Beyond the WORM with MinIO object storage.
  • Jeff Geerling seems to do a lot of projects that I either can’t afford to do, or don’t have the time to do. Either way, thanks Jeff. This latest one – Building a fast all-SSD NAS (on a budget) – looked like fun.
  • You like ransomware? What if I told you you can have it cross-platform? Excited yet? Read Melissa’s article on Multiplatform Ransomware for a more thorough view of what’s going on out there.
  • Speaking of storage and clouds, Chris M. Evans recently published a series of videos over at Architecting IT where he talks to NetApp’s Matt Watt about the company’s hybrid cloud strategy. You can see it here.
  • Speaking of traditional infrastructure companies doing things with hyperscalers, here’s the July 2022 edition of What’s New in VMware Cloud on AWS.
  • In press release news, Aparavi and Backblaze have joined forces. You can read more about that here.
  • I’ve spent a lot of money over the years trying to find the perfect media streaming device for home. I currently favour the Apple TV 4K, but only because my Boxee Box can’t keep up with more modern codecs. This article on the Best Device for Streaming for Any User – 2022 seems to line up well with my experiences to date, although I admit I haven’t tried the NVIDIA device yet. I do miss playing ISOs over the network with the HD Mediabox 100, but those were simpler times I guess.

VMware Cloud on AWS – TMCHAM – Part 7 – Elastic DRS and Host Failure Remediation

In this edition of Things My Customers Have Asked Me (TMCHAM), I’m going to delve into some questions around managing host additions and failures on the VMware-managed VMware Cloud on AWS platform.

Elastic DRS

One of the questions I frequently get asked by customers is what happens when you reach a certain capacity in your VMware Cloud on AWS cluster? The good news is we have a feature called Elastic DRS that can take care of that for you. Elastic DRS is a little different to what you might know as the vSphere Distributed Resource Scheduler (DRS). Elastic DRS operates at a host level and takes care of capacity constraints in your VMC environment. The idea is that, when your cluster reaches a certain resource threshold (be it storage, vCPU, or RAM), Elastic DRS takes care of adding in additional host resources as required. 

The algorithm runs every 5 minutes and uses the following parameters:

  • Minimum and maximum number of hosts the algorithm should scale up or down to.
  • Thresholds for CPU, memory and storage utilisation such that host allocation is optimized for cost or performance.

Note also that your cluster may scale back in, assuming the resources stay consistently below the threshold for a number of iterations.

Settings

There are a few different options for Elastic DRS, with the default being the “Elastic DRS Baseline Policy”. With this policy, a host is automatically added when there’s less than 20% free vSAN storage. Note that this doesn’t apply to single-node SDDC configurations, and only the baseline policy is available with 2-node configurations. Beyond those limitations, though, there are a number of other configurations available and these are outlined here. The neat thing is that there’s some amount of flexibility in how you have your SDDC automatically managed, with options for best performance, lowest cost, or rapid scale-out also available.

Can I Turn It Off?

No, but you can fiddle with the settings from your VMC cloud console.

Other Questions

What happens if I’m adding a host manually? The Elastic DRS recommendations are ignored. Same goes with planned maintenance or SDDC maintenance, where the support team may be adding in an additional host. But what if you’ve lost a host? The auto-remediation process kicks in and the Elastic DRS recommendations are ignored while the failed host is being replaced. You can read more about that process here.

 

Thoughts

One of the things I like about the VMware Cloud on AWS approach is that VMware has looked into a number of common scenarios that occur in the wild (hosts running out of capacity, for example) and built some automation on top of an already streamlined SDDC stack. Elastic DRS and the Auto-Scaler features seem like minor things, but when you’re managing an SDDC of any significant scale, it’s nice to have the little things taken care of.