VMware Cloud on AWS – TMCHAM – Part 7 – Elastic DRS and Host Failure Remediation

In this edition of Things My Customers Have Asked Me (TMCHAM), I’m going to delve into some questions around managing host additions and failures on the VMware-managed VMware Cloud on AWS platform.

Elastic DRS

One of the questions I frequently get asked by customers is what happens when you reach a certain capacity in your VMware Cloud on AWS cluster? The good news is we have a feature called Elastic DRS that can take care of that for you. Elastic DRS is a little different to what you might know as the vSphere Distributed Resource Scheduler (DRS). Elastic DRS operates at a host level and takes care of capacity constraints in your VMC environment. The idea is that, when your cluster reaches a certain resource threshold (be it storage, vCPU, or RAM), Elastic DRS takes care of adding in additional host resources as required. 

The algorithm runs every 5 minutes and uses the following parameters:

  • Minimum and maximum number of hosts the algorithm should scale up or down to.
  • Thresholds for CPU, memory and storage utilisation such that host allocation is optimized for cost or performance.

Note also that your cluster may scale back in, assuming the resources stay consistently below the threshold for a number of iterations.

Settings

There are a few different options for Elastic DRS, with the default being the “Elastic DRS Baseline Policy”. With this policy, a host is automatically added when there’s less than 20% free vSAN storage. Note that this doesn’t apply to single-node SDDC configurations, and only the baseline policy is available with 2-node configurations. Beyond those limitations, though, there are a number of other configurations available and these are outlined here. The neat thing is that there’s some amount of flexibility in how you have your SDDC automatically managed, with options for best performance, lowest cost, or rapid scale-out also available.

Can I Turn It Off?

No, but you can fiddle with the settings from your VMC cloud console.

Other Questions

What happens if I’m adding a host manually? The Elastic DRS recommendations are ignored. Same goes with planned maintenance or SDDC maintenance, where the support team may be adding in an additional host. But what if you’ve lost a host? The auto-remediation process kicks in and the Elastic DRS recommendations are ignored while the failed host is being replaced. You can read more about that process here.

 

Thoughts

One of the things I like about the VMware Cloud on AWS approach is that VMware has looked into a number of common scenarios that occur in the wild (hosts running out of capacity, for example) and built some automation on top of an already streamlined SDDC stack. Elastic DRS and the Auto-Scaler features seem like minor things, but when you’re managing an SDDC of any significant scale, it’s nice to have the little things taken care of.

VMware Cloud on AWS – TMCHAM – Part 6 – Sizing

In this edition of Things My Customers Have Asked Me (TMCHAM), I’m going to touch briefly on some things you might come across when sizing workloads for the VMware Cloud on AWS platform using the VMware Cloud on AWS Sizer.

VMware Cloud on AWS Sizer

One of the neat things about VMware Cloud on AWS is that you can jump on the publicly available sizing tool and input some numbers (or import RVTools or LiveOptics files) and it will spit out the number of nodes that you’ll (likely) need to support your workloads. Of course, if that’s all there was to it, you wouldn’t need folks like me to help you with sizing. That said, VMware has worked hard to ensure that the sizing part of your VMware Cloud on AWS planning is fairly straightforward. There are a few things to look out for though.

Why Do I See A Weird Number Of Cores In The Sizer?

If you put a workload into the sizer, you might see some odd core counts in the output. For example, the below screenshot shows 4x i3en nodes with 240 cores, but clearly it should be 192 cores (4x 48).

Yet when the same workload is changed to the i3 instance type, the correct amount of cores (5x 36 = 180) is displayed.

The reason for this is that the i3en instance types support Hyper-Threading, and the Sizer applies a weighting to calculations. This can be changed via the Global Settings in the Advanced section of the Sizer. If you’re not into HT, set it to 0%. If you’re a believer, set it to 100%. By default it’s set to 25%, hence the 240 cores number in the previous example (48 x 1.25 x 4 nodes).

Why Do I Need This Many Nodes?

You might need to satisfy Host Admission Control requirements. The current logic of Host Admission Control (as it’s applied in VMC sizer) is as follows:

  • A 2-host cluster should have 50.00 percent reserved CPU and memory capacity for HA Admission Control.
  • A 3-host cluster reserves 33.33 percent for HAC

And so on until you get to

  • A 16-host cluster reserving 6.25 percent of resources for HAC.

It’s also important to note that a 2-host cluster can accommodate a maximum of 35 VMs. Anything above that will need an extra host. And if you’re planning to run a full HCX configuration on two nodes, you should review this Knowledge Base article. Speaking of running things at capacity, I’ll go into Elastic DRS in another post, but by default we add another host to your cluster when you hit 80% storage capacity.

What About My Storage Consumption?

By default there are some storage policies applied to your vSAN configurations too. A standard Cluster with 5 hosts or less is set to 1 Failure / RAID-1, whilst a standard Cluster with 6 hosts or more is set to tolerate 2 Failures / RAID-6 by default. You can read more about that here.

Conclusion

There’s a bunch of stuff I haven’t covered here, including the choices you have to make between using RVTools and LiveOptics, and whether you should size with a high CPU to core ratio or keep it one to one like the old timers like. But hopefully this post has been of some use explaining some of the quirky things that pop up in the Sizer from time to time.

VMware Cloud on AWS – TMCHAM – Part 4 – VM Resource Management

In this episode of Things My Customers Have Asked Me (TMCHAM), I’m going to delve into some questions around resource management for VMs running on the VMware-managed VMware Cloud on AWS platform, and what customers need to know to make it work for them.

Distributed Resource Scheduler

If you’ve used VMware vSphere before, it’s likely that you’ve come across the Distributed Resource Scheduler (DRS) capability. DRS is a way to keep workloads evenly distributed across nodes in a cluster, and moves VMs around based on various performance considerations. The cool thing about this is that you don’t need to manually move workloads around when a particular guest or host goes a little nuts from a CPU or Memory usage perspective. There are cases, however, when you might not want your VMs to be moving around too much. In this instance, you’ll want to create what is called a “Disable DRS vMotion Policy”. You configure this via Compute Policies in vCenter, and you can read more about the process here.

If you don’t like reading documentation though, I’ve got some pictures you can look at instead. Log in to your vSphere Client and click on Policies and Profiles.

Then click on Compute Policies and click Add.

Under Policy type, there’s a dropdown box where you can select Disable DRS vMotion.

You’ll then give the policy a Name and Description. You then need to select the tag category you want to use.

Once you’ve selected the tag category you want to use, you can select the tags you want to apply to the policy.

Click on Create to create the Compute Policy, and you’re good to go.

Memory Overcommit Techniques

I’ve had a few customers ask me about how some of the traditional VMware resource management technologies translate to VMware Cloud on AWS. The good news is there’s quite a lot in common with what you’re used to with on-premises workload management, including memory overcommit techniques. As with anything, the effectiveness or otherwise of these technologies really depends on a number of different factors. If you’re interested in finding out more, I recommend checking out this article.

General Resource Management

Can I use the resource management mechanisms I know and love, such as Reservations, Shares, and Limits? You surely can, and you can read more about that capability here.

Conclusion

Just as you would with on-premises vSphere workloads, you do need to put some thought into your workload resource planning prior to moving your VMs onto the magic sky computers. The good news, however, is that there are quite a few smart technologies built into VMware Cloud on AWS that means you’ve got a lot of flexibility when it comes to managing your workloads.

VMware Cloud on AWS – TMCHAM – Part 3 – SDDC Lifecycle

In this episode of Things My Customers Have Asked Me (TMCHAM), I’m going to delve into some questions around the lifecycle of the VMware-managed VMware Cloud on AWS platform, and what customers need to know to make sense of it all.

 

The SDDC

If you talk to VMware folks about VMware Cloud on AWS, you’ll hear a lot of talk about software-defined data centres (SDDCs). This is the logical construct in place that you use within your Organization to manage your hosts and clusters, in much the same fashion as you would your on-premises workloads. Unlike most on-premises workloads, however, the feeding and watering of the SDDC, from a software currency perspective, is done by VMware.

Release Notes

If you’ve read the VMware Cloud on AWS Release Notes, you’ll see something like this at the start:

“Beginning with the SDDC version 1.11 release, odd-numbered releases of the SDDC software are optional and available for new SDDC deployments only. By default, all new SDDC deployments and upgrades will use the most recent even-numbered release. If you want to deploy an SDDC with an odd-numbered release version, contact your VMware TAM, sales, or customer success representative to make the request.”

Updated on: 5 April  2022

Essential Release: VMware Cloud on AWS (SDDC Version 1.18) | 5 April 2022

Optional Release: VMware Cloud on AWS (SDDC Version 1.17) | 19 November 2021

Basically, when you deploy onto the platform, you’ll usually get put on what VMware calls an “Essential” release. From time to time, customers may have requirements that mean that they qualify to be deployed on an “Optional” release. This might be because they have a software integration requirement that hasn’t been handled in 1.16, for example, but is available for 1.17. It’s also important to note that each major release will have a variety of minor releases as well, depending on issues that need to be resolved or features that need to be rolled out. So you’ll also see references to 1.16v5 in places, for example.

Upgrades and Maintenance

So what happens when your SDDC is going to be upgraded? Well, we let you know in advance, and it’s done in phases, as you’d imagine.

[image courtesy of VMware]

You can read more about the process here, and there’s a blog post that covers the release cadence here. VMware also does the rollout of releases in waves, so not every customer has the upgrade done at the same time. If you’re the type of customer that needs to be on the latest version of everything, or perhaps you have a real requirement to be near the front of the line, you should talk to your account team and they’ll liaise with the folks who can make it happen for you. When the upgrades are happening, you should be careful not to:

  • Perform hot or cold workload migrations. Migrations fail if they are started or in progress during maintenance.
  • Perform workload provisioning (New/Clone VM). Provisioning operations fail if they are started or in progress during maintenance.
  • Make changes to Storage-based Policy Management settings for workload VMs.

You should also ensure that there is enough storage capacity (> 30% slack space) in each cluster.

How Long Will It Take?

As usual, it depends. But you can make some (very) rough estimates by following the guidance on this page.

Will My SDDC Expire?

Yes, your SDDC version will some day expire. But it will be upgraded before that happens. There’s a page where you can look up the expiration dates of the various SDDC releases. It’s all part of the lifecycle part of the SDDC lifecycle.

Correlating VMware Cloud on AWS with Component Releases

Ever found yourself wondering what component versions are being used in VMware Cloud on AWS? Wonder no more with this very handy reference.

 

Conclusion

There’s obviously a lot more that goes on behind the scenes to keep everything running in tip-top shape for our customers. All of this talk of phases, waves, and release notes can be a little confusing if you’re new to the platform. Having worked in a variety of (managed and unmanaged) service providers over the years, I do like that VMware has bundled up all of this information and put it out there for people to check out. As always, if you’ve got questions about how the various software integrations work, and you can’t find the information in the documentation, reach out to your local account team and they’ll be able to help.

VMware Cloud on AWS – TMCHAM – Part 2 – VCDR Notes

In this episode of “Things My Customers Have Asked Me” (or TMCHAM for short), I’m going to dive into a few questions around VMware Cloud Disaster Recovery (VCDR), a service we offer as an add-on to VMware Cloud on AWS. If you’re unfamiliar with VCDR, you can read a bit more about it here.

VCDR Roles and Permissions

Can RBAC roles be customised? Not really, as these are cascaded down from the Cloud Services hub. As I understand it, I don’t believe you have granular control over it, just the pre-defined, default roles as outlined here, so you need to be careful about what you hand out to folks in your organisation. To see what Service Roles have been assigned to your account, in the VMware Cloud Services, go to My Account, and then click on My Roles. Under Service Roles, you’ll see a list of services, such as VCDR, Skyline, and so on. You can then check what roles have been assigned. 

VCDR Protection Groups

VCDR Protection Groups are the way that we logically group together workloads to be protected with the same RPO, schedule, and retention. There are two types of protection group: standard-frequency and high-frequency. Standard-frequency snapshots can be run as often as every 4 hours, while high-frequency snapshots can go as often as every 30 minutes. You can read more on protection groups here. It’s important to note that there are some caveats to be aware of with high-frequency snapshots. These are outlined here.

30-minute RPOs were introduced in late 2021, but there are some caveats that you need to be aware of. Some of these are straightforward, such as the minimum software levels for on-premises protection. But you also need to be mindful that VMs with existing vSphere snapshots will not be included, and, more importantly, high-frequency snapshots can’t be quiesced.

Can you have a VM instance in both a standard- and high-frequency snapshot protection group?  Would this allow us to get the best of both worlds – e.g. RPO could be as low as 30 minutes, but with a guaranteed snapshot of 4 hours?  Once you do a high-frequency snap on a VM, it keeps using that mechanism thereafter, even if it sits in a protection group using standard protection. Note also that you set a schedule for a protection group, so you can have snapshots running ever 30 mins and kept for a particular period of time (customer selects this). You could also run snapshots at 4 hours and keep those for a period of time too. While you can technically have a VM in multiple groups, what you’re better off doing is configuring a variety of schedules for your protection groups to meet those different RPOs.

Quiesced Snapshots

What happens to a VM during a quiesced state – would we experience micro service outages? The best answer I can give is “it depends”. The process for the standard, quiesced snapshot is similar to the one described hereThe VM will be stunned by the process, so depending on what kind of activity is happening on the VM, there may be a micro outage to the service.

Other Considerations

The documentation talks about not changing anything when a scheduled snapshot is being run – how do we manage configuration of the SDDC if jobs are running 24/7?  Seems odd that nothing can be changed when a scheduled snapshot is being run? This refers more to the VM that is being snapped. i.e. Don’t change configs or make changes to the environment, as that would impact this VM. It’s not a blanket rule for the whole environment. 

Like most things, success with VCDR relies heavily on understanding the outcomes your organisation wants to achieve, and then working backwards from there. It’s also important to understand that this is a great way to do DR, but not necessarily a great way to do standard backup and recovery activities. Hopefully this article helps clarify some of the questions folks have around VCDR, and if it doesn’t, please don’t hesitate to get in contact.

VMware Cloud on AWS – TMCHAM – Part 1 – PCI DSS

I’m starting a new series on the blog. It’s called “Things My Customers Have Asked Me” (or TMCHAM for short). There are frequently occasions where the customer collateral I present on VMware Cloud on AWS doesn’t cover every single use case that my customers are interested in, or perhaps it doesn’t dive deeply enough into some of the material people would like to know more about. The idea behind these posts is that if I have one customer asking about this stuff, chances are another one might like to know about it too. I won’t be talking about internal-only stuff, or roadmap details in these posts (or anywhere publicly, for that matter), but hopefully these articles will be a useful point of information consolidation for folks who are into that sort of thing.

 

PCI DSS?

The Payment Card Industry Data Security Standard (PCI DSS) is the security standard adhered to by organisations handling credit card information from the major card vendors. You can find the official Attestation of Compliance (AoC) in the VMware Cloud Trust Center, and there’s also a comprehensive whitepaper here.

Getting Started on VMware Cloud on AWS

The capability was covered in March 2021, and you can see some of the details in the VMware Cloud on AWS Release Notes. You can also read my learned colleague Greg Vinton’s take on it here, and there’s a YouTube video for people who prefer that sort of thing. To enable PCI compliance on your Organization, you need to request the capability via your VMware account team. It’s not just something that’s configured by default, as some of the requirements around PCI DSS might be considered an unnecessary overhead by some folks. The account team will get it enabled on your Organization, and you can then deploy your SDDC. It’s important to note that your Organization needs to be empty – PCI DSS can’t be enabled on an Organization with SDDCs that are already deployed.

Configuration Changes

There are a number of configuration changes needed to ensure that your SDDC is PCI-compliant too. This includes disabling add-on services like HCX and Site Recovery. To do this, go to Inventory – Settings, and scroll down to Compliance Hardening.

Note that you’ll only see the “Compliance Hardening” section if your Organization has been configured for PCI DSS compliance. You’ll need to finish your HCX migrations before your Organization is compliant. You’ll also need to change your NSX configuration (Network & Security Tab Access). There is some more info on that here and there’s a blog post that also runs through it step by step that you can read here. Note that you’ll need to use the API to change the local NSX Manager user password every 90 days. Information on that can be found here.

Other Considerations

One final thing to note is that this process doesn’t automatically make your Virtual Machines PCI compliant. You’ll still need to ensure that you’ve done the work in that respect. And I can’t repeat this enough – your Organization will only pass a PCI audit if you’ve done these additional steps. Merely requesting that VMware enable this at an Organization level won’t be enough.