VMware Cloud on AWS – TMCHAM – Part 5 – VM Management

In this edition of Things My Customers Have Asked Me (TMCHAM), I’m going to delve into some questions around managing VMs running on the VMware-managed VMware Cloud on AWS platform, and talk about vCenter plugins and what that looks like when you move across to VMware Cloud on AWS.

How Can I Access vCenter?

VMware vCenter has been around since Hector was a pup, and the good news is that it can be used to manage your VMware Cloud on AWS environment. It’s accessible via a few different methods, including PowerCLI. If you want to access the HTML5 UI via the cloud console, you’ll need to ensure there’s a firewall rule in place to allow access via your Management gateway – the official documentation is here. If the rule has already been created and you just need to add your IP to the mix, here’s the process.

The first step is to find out your public IP address. I use WhatIsMyIP.com to do this.

In your console, go to Networking & Security -> Inventory -> Groups.

Under Groups, make sure you select Management Groups.

You’ll find a Group that was created that stores the IP information of folks wanting to access vCenter. In this example, we’ve called it “SET Home IP Addresses”.

Click on the vertical ellipsis and click Edit.

Click on the IPs section.

You’ll then see a spot where you can enter your IP address. You can do a single address or enter a range, as shown below.

Click Apply and then click Save to save the rule. Now you should be able to open vCenter.

Can I run RVTools and other scripts on my VMC environment?

Yes, you can run RVTools against your environment. In terms of privilege levels with VMware Cloud on AWS, you get CloudAdmin. The level of access is outlined here. It’s important to understand these privilege levels, because some things will and won’t work as a result of these.

Can I lockdown my VMs using PowerShell?

You will have the ability to set these advanced settings on your VMs in the SDDC, but this is limited to per-VM, rather than on a per-cluster basis. So if you normally ran a script on a pre-VM basis to harden the VM config, you’d need to run that on each VM individually, rather than on a per-cluster level.

What about vCenter plugins?

We don’t have a concept of vCenter plugins in VMware Cloud on AWS, so there are different ways to get the information you’d normally need. vROps, for example, has the ability to look at VMware Cloud on AWS, using either the on-premises version or the cloud version. There’s information on that here, but note that the plugin isn’t supported with VMC vCenter.

What about my Site Recovery Manager plugin? The mechanism for managing this will change depending on whether you’re using SRaaS or VCDR to protect your workloads. There’s some good info on SRaaS here, and some decent VCDR information here. Again, there is no plugin available, but the element managers are available via the cloud console.  

What about NSX-V? VMware Cloud on AWS is all NSX-T, and you can access the NSX Manager via the cloud console.

Conclusion

A big part of the reason people like VMware Cloud on AWS is that the management experience doesn’t differ significantly from what you get VMware Cloud Foundation of VMware Validated Designs on-premises. That said, there are a few things that do change when you move to VMware Cloud on AWS. Things like plugins don’t exist, but you can still run many of the scripts you know and love against the platform. Remember, though, it is a fully managed service, so some of the stuff you used to run against your on-premises environment is no longer necessary.

VMware – VMworld 2017 – STO2063BU – Architecting Site Recovery Manager to Meet Your Recovery Goals

Disclaimer: I recently attended VMworld 2017 – US.  My flights were paid for by ActualTech Media, VMware provided me with a free pass to the conference and various bits of swag, and Tech Field Day picked up my hotel costs. There is no requirement for me to blog about any of the content presented and I am not compensated in any way for my time at the event.  Some materials presented were discussed under NDA and don’t form part of my blog posts, but could influence future discussions.

Here are my rough notes from “STO2063BU – Architecting Site Recovery Manager to Meet Your Recovery Goals” presented by GS Khalsa. You can grab a PDF version of them from here. It’s mainly bullet points, but I’m sure you know the drill.

 

Terminology

  • RPO – Last viable restore point
  • RTO – How long it will take before all functionality is recovered

You should break these down to an application, or a service tier level.

 

Protection Groups and Recovery Plans

What is a Protection Group?

Group of VMS that will be recovered together

  • Application
  • Department
  • System Type
  • ?

Different depending on replication type

A VM can only belong to one Protection Group

 

How do protection groups fit into Recovery Plans?

[Image via https://blogs.vmware.com/vsphere/2015/05/srm-protection-group-design.html]

 

vSphere Replication Protection Groups

  • Group VMs as desired into Protection Groups
  • What storage they are located on doesn’t matter

 

Array Based Protection Groups

If you want your protection groups to align to your applications – you’ll need to shuffle storage around

 

Policy Driven Protection

  • New style Protection Group leveraging storage profiles
  • High level of automation compared to traditional protection groups
  • Policy based approach reduces OpEx
  • Simpler integration of VM provisioning, migration, and decommissioning

This was introduced in SRM 6.1

 

How Should You Organise Your Protection Groups?

More Protection Groups

  • Higher RTO
  • Easier testing
  • Only what is needed
  • More granular and complex

Fewer Protection Groups

  • Lower RTO
  • Less granular, complex and flexible

This varies by customer and will be dictated by the appropriate combination of complexity and flexibility

 

Topologies

SRM Supports Multiple DR Topologies

Active-Passive Failover

  • Dedicated resources for recovery

Active-Active Failover

  • Run low priority apps on recovery infrastructure

Bi-directional Failover

  • Production applications at both sites
  • Each site acts as the recovery site for the other

Multi-site

  • Many-to-one failover
  • Useful for Remote Office / Branch Office

 

Enhanced Topology Support

There are a few different topologies that are supported.

[Images via https://blogs.vmware.com/virtualblocks/2016/07/28/srm-multisite/]

  • 10 SRM pairs per vCenter
  • A VM can only be protected once

 

SRM & Stretched Storage

[Image via https://blogs.vmware.com/virtualblocks/2015/09/01/srm-6-1-whats-new/]

Supported as of SRM 6.1

 

SRM and vSAN Stretched Cluster

[Image via https://blogs.vmware.com/virtualblocks/2015/08/31/whats-new-vmware-virtual-san-6-1/]

Failover to the third site (not the 2 sites comprising the cluster)

 

Enhanced Linked Mode

You can find more information on Enhanced Linked Mode here. It makes it easier to manage your environment and was introduced in vSphere 6.0.

 

Impacts to RTO

Decision Time

How long does it take to decide to failover?

 

IP Customisation

Workflow without customisation

  • Power on VM and wait for VMtools heartbeats

Workflow with IP customisation

  • Power on VM with network disconnected
  • Customise IP utilising VMtools
  • Power off VM
  • Power on VM and wait for VMtools heartbeats

Alternatives

  • Stretched Layer 2
  • Move VLAN / Subnet

It’s going to take some time to do when you failover a guest

 

Priorities and Dependencies vs Priorities Only

Organisation for lower RTO

  • Fewer / larger NFS datastore / LUNs
  • Fewer protection groups
  • Don’t replicate VM swap files
  • Fewer recovery plans

 

VM Configuration

  • VMware Tools installed in all VMs
  • Suspend VMS on Recovery vs PowerOff VMs
  • Array-based replication vs vSphere Replication

 

Recovery Site Sizing

  • vCenter sizing – it works harder than you think
  • Number of hosts – more is better
  • Enable DRS – why wouldn’t you?
  • Different recovery plans target different clusters

 

Recommendations

Be Clear with the Business

What is / are their

  • RPOs?
  • RTOs?
  • Cost of downtime?
  • Application priorities?
  • Units of failover?
  • Externalities?

Do you have Executive buy-in?

 

Risk with Infrequent DR Plan Testing

  • Parallel and cutover tests provide the best verification, but are very resource intensive and time consuming
  • Cutover tests are disruptive, may take days to complete and leaves the business at risk

 

Frequent DR Testing Reduces Risk

  • Increased confidence that the plan will work
  • Recovery can be tested at anytime without impact to production

 

Test Network

Use VLAN or isolated network for test environment

  • Default “auto” setting does not allow VM communication between hosts

Different PortGroup can be specified in SRM for test vs actual run

  • Specified in Network Mapping and / or Recovery Plan

 

Test Network – Multiple Options

Two Options

  • Disconnect NSX Uplink (this can be easily scripted)
  • Use NSX to create duplicate “Test” networks

RTO = dollars

*Demos

 

Conclusion and Further Reading

I enjoy these kind of sessions, as they provide a nice overview of the product capabilities that ties in well with business requirements. SRM is a pretty neat solution, and something you might consider using if you need to move workload from one DC to another. If you’re after a technical overview of Site Recovery Manager 6.5, this site is pretty good too. 4.5 stars.

VMware – VMworld 2017 – STO1179BU – Understanding the Availability Features of vSAN

Disclaimer: I recently attended VMworld 2017 – US.  My flights were paid for by ActualTech Media, VMware provided me with a free pass to the conference and various bits of swag, and Tech Field Day picked up my hotel costs. There is no requirement for me to blog about any of the content presented and I am not compensated in any way for my time at the event.  Some materials presented were discussed under NDA and don’t form part of my blog posts, but could influence future discussions.

Here are my rough notes from “STO1179BU – Understanding the Availability Features of vSAN”, presented by GS Khalsa (@gurusimran) and Jeff Hunter (@jhuntervmware). You can grab a PDF of the notes from here. Note that these posts don’t provide much in the way of opinion, analysis, or opinionalysis. They’re really just a way of providing you with a snapshot of what I saw. Death by bullet point if you will.

 

Components and Failure

vSAN Objects Consist of Components

VM

  • VM Home – multiple components
  • Virtual Disk – multiple components
  • Swap File – multiple components

vSAN has a cache tier and capacity tier (objects are stored here)

 

Quorum

Greater than 50% must be online to achieve quorum

  • Each component has one vote by default
  • Odd number of votes required to break tie – preserves data integrity
  • Greater than 50% of components (votes) must be online
  • Components can have more than one vote
  • Votes added by vSAN, if needed, to ensure odd number

 

Component Vote Counts Are Visible Using RVC CLI

/<vcenter>/datacenter/vms> vsan_vm_object_info <vm>

 

Storage Policy Determines Component Number and Placement

  • Primary level of failures to tolerate
  • Failure Tolerance Method

Primary level of failures to tolerate = 0 Means only one copy

  • Maximum component size is 255GB
  • vSAN will split bigger into smaller sized VMDKs
  • RAID-5/6 Erasure Coding Uses Stripes and Parity (need to be using all-flash)
  • Consumes less RAW capacity
  • Number of stripes also affects component counts

 

Each Host is an Implicit Fault Domain

  • Multiple components can end up in the same rack
  • Configure Fault Domains in the UI
  • Add at least one more host or fault domain for rebuilds

 

Component States Change as a Result of a Failure

  • Active
  • Absent
  • Degraded

vSAN selects most efficient way

Which is most efficient? Repair or Rebuild? It depends. Partial repairs are performed if full repair is not possible

 

vSAN Maintenance Mode

Three vSAN Options for Host Maintenance Mode

  • Evacuate all data to other hosts
  • Ensure data accessibility from other hosts
  • No data evacuation

 

Degraded Device Handling (DDH) in vSAN 6.6

  • vSAN 6.6 is more “intelligent”, builds on previous versions of DDH
  • When device is degraded, components are evaluated …
  • If component does not belong to last replica, mark as absent – “Lazy” evacuation since another replica of the object exists
  • If component belongs to last replica, start evacuation
  • Degraded devices will not be used for new component placement
  • Evacuation failures reported in UI

 

DDH and S.M.A.R.T.

Following items logged in vmkernel.log when drive is identified as unhealthy

  • Sectors successfully reallocated 0x05
  • Reported uncorrectable sectors 0xBB
  • Disk command timeouts 0xBC
  • Sector reallocation events 0xC4
  • Pending sector reallocations 0xC5
  • Uncorrectable sectors 0xC6

Helps GSS determine what to do with drive after evacuation

 

Stretched Clusters

Stretched Cluster Failure Scenarios

  • Extend the idea of fault domains from racks to sites
  • Witness component (tertiary site) – witness host
  • 5ms RTT (around 60 miles)
  • VM will have a preferred and secondary site
  • When component fails, starts rebuilding of preferred site

 

Stretched Cluster Local Failure Protection – new in vSAN 6.6

  • Redundancy against host failure and site failure
  • If site fails, vSAN maintains local redundancy in surviving site
  • No change in stretched cluster configuration steps
  • Optimised logic to minimise I/O traffic across sites
    • Local read, local resync
    • Single inter-site write for multiple replicas
  • RAID-1 between the sites, and then RAID-5 in the local sites

What happens during network partition or site failure?

  • HA Restart

Inter-site network disconnected (split brain)

  • HA Power-off

Witness Network Disconnected

  • Witness leaves cluster

VMs continue to operate normally. Very simple to redeploy a new one. Recommended host isolation response in a stretched cluster is power off

Witness Host Offline

  • Recover or redeploy witness host

New in 6.6 – change witness host

 

vSAN Backup, Replication and DR

Data Protection

  • vSphere APIs – Data Protection
  • Same as other datastore (VMFS, etc)
  • Verify support with backup vendor
  • Production and backup data on vSAN
    • Pros: Simple, rapid restore
    • Cons: Both copies lost if vSAN datastore is lost, can consume considerable capacity

 

Solutions …

  • Store backup data on another datastore
    • SAN or NAS
    • Another vSAN cluster
    • Local drives
  • Dell EMC Avamar and NetWorker
  • Veeam Backup and Replication
  • Cohesity
  • Rubrik
  • Others …

vSphere Replication included with Essentials Plus Kit and higher. With this you get per-VM RPOs as low as 5 minutes

 

Automated DR with Site Recovery Manager

  • HA with Stretched Cluster, Automated DR with SRM
  • SRM at the tertiary site

Useful session. 4 stars.

New book on VMware SRM now available

Good news, “Disaster Recovery Using VMware vSphere Replication and vCenter Site Recovery Manager – Second Edition” has just been released via Packt Publishing. It was written by Abhilash G B and I had the pleasure of serving as the technical reviewer. While SRM 6.5 has just been announced this is nonetheless a handy manual with some great guidance (and pictures!) on how to effectively use SRM with both array-based and vSphere Replication-based protection. There’s an ebook version available for purchase with a print copy also available for order.

6096en_5349_disasterrecoveryusingvmwarevspherereplicationandvcentersiterecoverymanagersecond

VMware – VMworld 2016 – STO7973 – Architecting Site Recovery Manager to Meet Your Recovery Goals

Disclaimer: I recently attended VMworld 2016 – US.  My flights were paid for by myself, VMware provided me with a free pass to the conference and various bits of swag, and Tech Field Day picked up my hotel costs. There is no requirement for me to blog about any of the content presented and I am not compensated in any way for my time at the event.  Some materials presented were discussed under NDA and don’t form part of my blog posts, but could influence future discussions.

vmworld-2016-hero-US_950

Here are my rough notes from “STO7973 – Architecting Site Recovery Manager to Meet Your Recovery Goals” presented by GS Khalsa and Ivan Jordanov. This was a session I’d been looking forward to all week and it didn’t disappoint.

STO7973

 

Agenda

  • Protection Groups and Recovery Plans
  • Topologies
  • Impacts to RTO
  • Recommendations
  • Recap

(Terminology)

  • DR – Disaster Recovery
  • RPO – how much data you’re going to lose
  • RTO – how long does it take you to get back up and running

 

Protection Groups and Recovery Plans

What is a protection group?

  • Group of VMs that will be recovered together
  • Application
  • Department
  • System type
  • ?
  • Different depending on replication type
  • A VM can only belong to one Protection Group

vSphere Replication Protection Groups

  • Group VMs as desired into PGs
  • Storage they are located on doesn’t matter
  • Replicates only deltas, option for compression

Array-based PGs

  • Protect all the VMs on a datastore, not granular

Storage Policy PG

  • new style PG leveraging storage policies
  • high level of automation compared to traditional protection groups
  • policy based approach reduces OpEx
  • simpler integration of VM provisioning, migration and decommissioning

How should you organise your Protection Groups?

  • Majority of outages are partial (not the entire DC)
  • Design accordingly
  • Failover only what is needed
  • These requirements vary by customer – everyone’s different

Fewer PGs = smaller RTO = less flexibility

Topologies

  • Active-Passive failover
  • active-active failover
  • bi-directional failover
  • multi-site

Shared Protection and Shared Recovery Site

STO7973_SRM-Shared-Protection-Recovery

Image from here. vCenter and SRM deployed at your remote sites

Enhanced topology site

STO7973_SRM-Shared-Recovery-Central-VC

Image from here.

STO7973_SRM-3-site-config

Original image taken from here. Remember that each VM can be protected and replicated only once.

SRM and Stretched Storage

STO7973_SRM_Stretched

Original image from here.

  • Best of both worlds
  • zero downtime with orchestrated cross-VC vMotion
  • non-disruptive testing

 

SRM & Virtual SAN Stretched Cluster

  • VSAN 6.1 – stretched cluster support
  • Uses witness host (should be outside VSAN stretched cluster but accessible)

Enhanced Linked Mode

  • Join PSC nodes in a single SSO domain
  • you need this to do cross VC vMotion on top of stretched storage
  • Central point of management for vCenters in the SSO domain

Multi-site guidelines

  • keep it simple
  • each VM only protected once
  • each VM only replicated once
  • Use enhanced linked mode

 

Impacts to RTO

Decision Time

How long does it take to decide to failover? decision time plus failover time = RTO
IP Customisation

This will impact in the time it takes to recover (guest shutdown, customise IP, guest startup and wait for VMware Tools)

Alternatives? Stretched Layer-2 (eww), Move VLAN/Subnet
Priorities and Dependencies

Priorities only – think about how you need to organise things to achieve the goal that you want
Organisation for lower RTO

  • Fewer/larger NFS datastores/LUNs
  • Fewer Protection Groups (depending on your business goals)
  • Don’t replicate VM swap files
  • Fewer Recovery Plans

VM Configuration

  • VM Tools installed in all VMs – this will cause timeouts if not installed waiting for the heartbeat
  • Suspend VMs on recovery – do this as part of a recovery plan
  • PowerOff VMs – you can do this as part of a recovery plan as well

Recovery Site

  • vCenter sizing – it works harder than you think
  • Number of hosts – more is better
  • Enable DRS – why wouldn’t you?
  • Different recovery plans target different clusters

More good stuff

  • script timeouts – make use of them
  • Install and admin guides – good source for best practices
  • Separate vCenter and SRM DBs

 

Recommendations

Be clear with your business

What is/are their

  • RPOs?
  • RTOs?
  • Cost of downtime?
  • Application priorities?
  • Units of failover?
  • Externalities?

Do you have

  • Executive buy-in?

Service Level Agreements

  • Do you have documented SLAs?
  • Do your SLAs clearly communicate the RPO, RTO and availability of your service tiers?
  • Are your SLA documents readily available to everyone in the company?

Recommendations

  • Use Service Tiers
  • Minimal requirements/decisions

vRealize Infrastructure Navigator – good for mapping dependencies
Risk with infrequent DR Plan Testing

  • Parallel and cover tests provide the best verification but very resource intensive and time consuming
  • Cutover tests are disruptive, may take days to complete and leaves the business at risk

Frequent DR testing reduces risk

  • Increased confidence that the plan will work
  • Recovery can be tested at anytime without impact to production

Test network

  • Use VLAN or isolated network for test environment
    • default “Auto” setting does not allow VM communication between hosts
  • Different PortGroup can be specified in SRM for test vs actual run
    • specified in Network Mapping and/or Recovery Plan

SRM & Cross vCenter NSX 6.2
Test Network – Multiple Options

A lot of stuff can now be done via API

  • Don’t protect everything
  • Apps that provide their own protection (e.g. Active Directory)
  • Locate VMs where they do their work

Demos

Q&A

Great session and a highlight of the conference (but I do loves me some SRM). 5 stars.

 

Brisbane VMUG – November 2015

The November Brisbane VMUG will be held on Thursday 19th November at EMC’s office in the city (Level 11, 345 Queen Street, Brisbane) from 4 – 6 pm. It’s sponsored by EMC and should be a ripper.

Here’s the agenda:

  • Latest SRM Features and VMworld Highlights – VMware
  • EMC VPLEX Metro with Site Recovery Manager Presentation
  • EMC Demonstration
  • Pizza and Refreshments (I promise)

You can find out more information and register for the event here. I hope to see you there. Also, if you’re interested in sponsoring one of these events, please get in touch with me and I can help make it happen.

VMware – Deploying vSphere Replication 5.8

As part of a recent vSphere 5.5 deployment, I installed a small vSphere Replication 5.8 proof-of-concept for the customer to trial site-to-site replication and get their minds around how they can do some simple DR activities. The appliance is fairly simple to deploy, so I thought I’d just provide a few links to articles that I found useful. Firstly, esxi-guy has a very useful soup-to-nuts post on the steps required to deploy a replication environment, and the steps to recover a VM. You can check it out here. Secondly, here’s a link to the official vSphere Replication documentation in PDF and eBook formats – just the sort of thing you’ll want to read while on the treadmill or sitting on the bus on the way home from the salt mines. Finally, if you’re working in an environment that has a number of firewalls in play, this list of ports you need to open is pretty handy.

One problem we did have was that we’d forgotten what the password was on the appliance we’d deployed at each site. I’m not the greatest cracker in any case, and so we agreed that re-deploying the appliance would be the simplest course of action. So I deleted the VM at each site and went through the “Deploy from OVF” thing again. The only thing of note that happened was that it warned me I had previously deployed a vSphere Replication instance with that name and IP address previously, and that I should get rid of the stale version. I did that at each site and then joined them together again and was good to go. I’m now trying to convince the customer that SRM might be of some use to them too. But baby steps, right?

Note also that, if you want to deploy additional vSphere Replication VMs to assist with load-balancing in your environment, you need to use the vSphere_Replication_AddOn_OVF10.ovf file for the additional appliances.

VMware – SRM 5.8 – You had one job!

The Problem

A colleague of mine has been doing some data centre failover testing for a customer recently and ran into an issue with VMware’s Site Recovery Manager (SRM) 5.8 running on vSphere 5.5 U2. When attempting to perform a recovery, and you’re running Linked Mode, and the protected site is off-line, the recovery may fail. The upshot of this is “The user is unable to perform a recovery at the recovery site, in the event of a DR scenario”. Here’s what it looks like.

SRM1

 

The Reason and Resolution

You can read more about the problem in this VMware KB article: Performing a Recovery using the Web Client in VMware vCenter Site Recovery Manager 5.8 reports the error: Failed to connect Site Recovery Manager Server(s). In short, there’s a PowerShell script you can run to make the recovery happen.

SRM0

 

Conclusion

I don’t know what to say about this. I’d like to put the boot into whomever at VMware is responsible for this SNAFU, but I’m guessing that they’ve already had a hard time of it. At least, I guess, there’s a workaround, if not a fix. But you’d be a bit upset if this happened for the first time during a real failover. But that’s why we test before we handover. And what is it with everything going pear-shaped when Linked Mode is in use?

 

*Update – 29/10/2015*

Marcel van den Berg recently pointed out that updating to SRM 5.8.1 resolves this issue. Further detail can be found here.

VMware – SRM advanced setting for snap prefix

We haven’t been doing this in our production configurations, but if you want to change the behaviour of SRM with regards to the “snap-xxx” prefix on replica datastores, you need to modify an advanced setting in SRM. So, go to the vSphere client – SRM, and right-click on Site Recovery and select Advanced Settings. Under SanProvider, there’s an option called “SanProvider.fixRecoveredDatastoreNames” with a little checkbox that needs to be ticked to prevent the recovered datastores being renamed with the unsightly prefix.

You can also do this when manually mounting snapshots or mirrors with the help of the esxcfg-volume command – but that’s a story for another time.

EMC MirrorView – New Article

I wrote a brief article on configuring EMC MirrorView from scratch through to being ready for VMware SRM usage. It’s not really from scratch, because I don’t go through the steps required to load the MirrorView enabler on the frame. But I figured that’s more a CE-type activity in any case. I hope to follow it up with a brief article on the SRM side of things. You can find my other articles here. Enjoy!