In this edition of Things My Customers Have Asked Me (TMCHAM), I’m going to delve into some questions around recent(ish) changes to Elastic DRS policies and capacity on the VMware-managed VMware Cloud on AWS platform.
I’ve had a few customers ask about changes VMware has made to Elastic DRS policies on VMware Cloud on AWS. I’ve talked a little about eDRS previously, and the release notes cover the changes here (go to March 27th, 2023). In short the changes are as follows:
- Elastic DRS optimize for rapid scaling policy now supports rapid scaling-in to enable faster scaling use cases like VDI, disaster recovery or any other business needs.
- The Elastic DRS Cost Policy improvement will allow automated scale-in of a cluster if the storage utilization falls below 40% instead of the current 20% limit.
What does it mean from a practical perspective? Not a lot for customers using the default baseline policy. But if you’re using “Optimize for Lower Cost” or “Rapid Scaling”, it might be worth looking into.
Huh?
Optimize for Lowest Cost
The documentation does a great job of describing how this works: “When scaling in, this policy removes hosts quickly to maintain baseline performance while keeping host counts to a practical minimum. It removes hosts only if it anticipates that storage utilization would not result in a scale out in the near term after host removal”. It has the following thresholds:
Old High | Old Low | New High | New Low | |
CPU | 90% | 60% | 90% | 60% |
Memory | 80% | 60% | 80% | 60% |
Storage | 70% | 20% | 80% (this changed a while ago) | 40% |
You’ll see that the new low has 40% as the threshold for storage now (I added in the change from 70 – 80% as well, but this was done a while ago). Generally speaking, the algorithm is designed not to do silly things, but we’ve added in this number to enable customers to scale in workloads sooner, helping to reduce the cost of scaling events.
Rapid Scaling
From the documentation: “[t]his policy adds multiple hosts at a time when needed for memory or CPU, and adds hosts incrementally when needed for storage. By default, hosts are added four at a time. You can specify a larger scale-out increment (8 or 12) if you need faster scaling for disaster recovery, Virtual Desktop Infrastructure (VDI), and similar use cases. As with any EDRS policy, scale-out time increases with increment size. When the increment is large (12 hosts), it can take up to 40 minutes to complete in some configurations.
When scaling in, this policy removes hosts rapidly, maintaining baseline performance while keeping host count to a practical minimum. It does not remove hosts if it anticipates that doing so would degrade performance and force a near-term scale-out. Scale-in stops when the cluster reaches the minimum host count or the number of hosts in the scale-out increment has been removed”. This policy has the following thresholds:
Old High | Old Low | New High | New Low | |
CPU | 80% | 0% | 80% | 50% |
Memory | 80% | 0% | 80% | 50% |
Storage | 70% | 0% | 80% | 40% |
What does that mean? We’ve added in some guardrails for rapid scale-in to ensure that things don’t get too hectic too quickly. And on the flip side, it means that you’ll scale out your environment faster as well. Again, this is useful for bursty workloads such as VDI or, potentially, rapid DR.
Thoughts
Elastic DRS is one of the cooler features of VMware Cloud on AWS. You can do some really interesting things from a scaling perspective, particularly if you’re operating with some volatile / bursty workloads. That said, if you only use the default baseline policy you’ll also likely be in a good spot, as the thing that can really hurt in these kinds of environments is when your hosts run short of storage.