VMware – VMworld 2016 – STO7973 – Architecting Site Recovery Manager to Meet Your Recovery Goals

Disclaimer: I recently attended VMworld 2016 – US.  My flights were paid for by myself, VMware provided me with a free pass to the conference and various bits of swag, and Tech Field Day picked up my hotel costs. There is no requirement for me to blog about any of the content presented and I am not compensated in any way for my time at the event.  Some materials presented were discussed under NDA and don’t form part of my blog posts, but could influence future discussions.

vmworld-2016-hero-US_950

Here are my rough notes from “STO7973 – Architecting Site Recovery Manager to Meet Your Recovery Goals” presented by GS Khalsa and Ivan Jordanov. This was a session I’d been looking forward to all week and it didn’t disappoint.

STO7973

 

Agenda

  • Protection Groups and Recovery Plans
  • Topologies
  • Impacts to RTO
  • Recommendations
  • Recap

(Terminology)

  • DR – Disaster Recovery
  • RPO – how much data you’re going to lose
  • RTO – how long does it take you to get back up and running

 

Protection Groups and Recovery Plans

What is a protection group?

  • Group of VMs that will be recovered together
  • Application
  • Department
  • System type
  • ?
  • Different depending on replication type
  • A VM can only belong to one Protection Group

vSphere Replication Protection Groups

  • Group VMs as desired into PGs
  • Storage they are located on doesn’t matter
  • Replicates only deltas, option for compression

Array-based PGs

  • Protect all the VMs on a datastore, not granular

Storage Policy PG

  • new style PG leveraging storage policies
  • high level of automation compared to traditional protection groups
  • policy based approach reduces OpEx
  • simpler integration of VM provisioning, migration and decommissioning

How should you organise your Protection Groups?

  • Majority of outages are partial (not the entire DC)
  • Design accordingly
  • Failover only what is needed
  • These requirements vary by customer – everyone’s different

Fewer PGs = smaller RTO = less flexibility

Topologies

  • Active-Passive failover
  • active-active failover
  • bi-directional failover
  • multi-site

Shared Protection and Shared Recovery Site

STO7973_SRM-Shared-Protection-Recovery

Image from here. vCenter and SRM deployed at your remote sites

Enhanced topology site

STO7973_SRM-Shared-Recovery-Central-VC

Image from here.

STO7973_SRM-3-site-config

Original image taken from here. Remember that each VM can be protected and replicated only once.

SRM and Stretched Storage

STO7973_SRM_Stretched

Original image from here.

  • Best of both worlds
  • zero downtime with orchestrated cross-VC vMotion
  • non-disruptive testing

 

SRM & Virtual SAN Stretched Cluster

  • VSAN 6.1 – stretched cluster support
  • Uses witness host (should be outside VSAN stretched cluster but accessible)

Enhanced Linked Mode

  • Join PSC nodes in a single SSO domain
  • you need this to do cross VC vMotion on top of stretched storage
  • Central point of management for vCenters in the SSO domain

Multi-site guidelines

  • keep it simple
  • each VM only protected once
  • each VM only replicated once
  • Use enhanced linked mode

 

Impacts to RTO

Decision Time

How long does it take to decide to failover? decision time plus failover time = RTO
IP Customisation

This will impact in the time it takes to recover (guest shutdown, customise IP, guest startup and wait for VMware Tools)

Alternatives? Stretched Layer-2 (eww), Move VLAN/Subnet
Priorities and Dependencies

Priorities only – think about how you need to organise things to achieve the goal that you want
Organisation for lower RTO

  • Fewer/larger NFS datastores/LUNs
  • Fewer Protection Groups (depending on your business goals)
  • Don’t replicate VM swap files
  • Fewer Recovery Plans

VM Configuration

  • VM Tools installed in all VMs – this will cause timeouts if not installed waiting for the heartbeat
  • Suspend VMs on recovery – do this as part of a recovery plan
  • PowerOff VMs – you can do this as part of a recovery plan as well

Recovery Site

  • vCenter sizing – it works harder than you think
  • Number of hosts – more is better
  • Enable DRS – why wouldn’t you?
  • Different recovery plans target different clusters

More good stuff

  • script timeouts – make use of them
  • Install and admin guides – good source for best practices
  • Separate vCenter and SRM DBs

 

Recommendations

Be clear with your business

What is/are their

  • RPOs?
  • RTOs?
  • Cost of downtime?
  • Application priorities?
  • Units of failover?
  • Externalities?

Do you have

  • Executive buy-in?

Service Level Agreements

  • Do you have documented SLAs?
  • Do your SLAs clearly communicate the RPO, RTO and availability of your service tiers?
  • Are your SLA documents readily available to everyone in the company?

Recommendations

  • Use Service Tiers
  • Minimal requirements/decisions

vRealize Infrastructure Navigator – good for mapping dependencies
Risk with infrequent DR Plan Testing

  • Parallel and cover tests provide the best verification but very resource intensive and time consuming
  • Cutover tests are disruptive, may take days to complete and leaves the business at risk

Frequent DR testing reduces risk

  • Increased confidence that the plan will work
  • Recovery can be tested at anytime without impact to production

Test network

  • Use VLAN or isolated network for test environment
    • default “Auto” setting does not allow VM communication between hosts
  • Different PortGroup can be specified in SRM for test vs actual run
    • specified in Network Mapping and/or Recovery Plan

SRM & Cross vCenter NSX 6.2
Test Network – Multiple Options

A lot of stuff can now be done via API

  • Don’t protect everything
  • Apps that provide their own protection (e.g. Active Directory)
  • Locate VMs where they do their work

Demos

Q&A

Great session and a highlight of the conference (but I do loves me some SRM). 5 stars.