VMware – VMworld 2017 – STO2063BU – Architecting Site Recovery Manager to Meet Your Recovery Goals

Disclaimer: I recently attended VMworld 2017 – US.  My flights were paid for by ActualTech Media, VMware provided me with a free pass to the conference and various bits of swag, and Tech Field Day picked up my hotel costs. There is no requirement for me to blog about any of the content presented and I am not compensated in any way for my time at the event.  Some materials presented were discussed under NDA and don’t form part of my blog posts, but could influence future discussions.

Here are my rough notes from “STO2063BU – Architecting Site Recovery Manager to Meet Your Recovery Goals” presented by GS Khalsa. You can grab a PDF version of them from here. It’s mainly bullet points, but I’m sure you know the drill.

 

Terminology

  • RPO – Last viable restore point
  • RTO – How long it will take before all functionality is recovered

You should break these down to an application, or a service tier level.

 

Protection Groups and Recovery Plans

What is a Protection Group?

Group of VMS that will be recovered together

  • Application
  • Department
  • System Type
  • ?

Different depending on replication type

A VM can only belong to one Protection Group

 

How do protection groups fit into Recovery Plans?

[Image via https://blogs.vmware.com/vsphere/2015/05/srm-protection-group-design.html]

 

vSphere Replication Protection Groups

  • Group VMs as desired into Protection Groups
  • What storage they are located on doesn’t matter

 

Array Based Protection Groups

If you want your protection groups to align to your applications – you’ll need to shuffle storage around

 

Policy Driven Protection

  • New style Protection Group leveraging storage profiles
  • High level of automation compared to traditional protection groups
  • Policy based approach reduces OpEx
  • Simpler integration of VM provisioning, migration, and decommissioning

This was introduced in SRM 6.1

 

How Should You Organise Your Protection Groups?

More Protection Groups

  • Higher RTO
  • Easier testing
  • Only what is needed
  • More granular and complex

Fewer Protection Groups

  • Lower RTO
  • Less granular, complex and flexible

This varies by customer and will be dictated by the appropriate combination of complexity and flexibility

 

Topologies

SRM Supports Multiple DR Topologies

Active-Passive Failover

  • Dedicated resources for recovery

Active-Active Failover

  • Run low priority apps on recovery infrastructure

Bi-directional Failover

  • Production applications at both sites
  • Each site acts as the recovery site for the other

Multi-site

  • Many-to-one failover
  • Useful for Remote Office / Branch Office

 

Enhanced Topology Support

There are a few different topologies that are supported.

[Images via https://blogs.vmware.com/virtualblocks/2016/07/28/srm-multisite/]

  • 10 SRM pairs per vCenter
  • A VM can only be protected once

 

SRM & Stretched Storage

[Image via https://blogs.vmware.com/virtualblocks/2015/09/01/srm-6-1-whats-new/]

Supported as of SRM 6.1

 

SRM and vSAN Stretched Cluster

[Image via https://blogs.vmware.com/virtualblocks/2015/08/31/whats-new-vmware-virtual-san-6-1/]

Failover to the third site (not the 2 sites comprising the cluster)

 

Enhanced Linked Mode

You can find more information on Enhanced Linked Mode here. It makes it easier to manage your environment and was introduced in vSphere 6.0.

 

Impacts to RTO

Decision Time

How long does it take to decide to failover?

 

IP Customisation

Workflow without customisation

  • Power on VM and wait for VMtools heartbeats

Workflow with IP customisation

  • Power on VM with network disconnected
  • Customise IP utilising VMtools
  • Power off VM
  • Power on VM and wait for VMtools heartbeats

Alternatives

  • Stretched Layer 2
  • Move VLAN / Subnet

It’s going to take some time to do when you failover a guest

 

Priorities and Dependencies vs Priorities Only

Organisation for lower RTO

  • Fewer / larger NFS datastore / LUNs
  • Fewer protection groups
  • Don’t replicate VM swap files
  • Fewer recovery plans

 

VM Configuration

  • VMware Tools installed in all VMs
  • Suspend VMS on Recovery vs PowerOff VMs
  • Array-based replication vs vSphere Replication

 

Recovery Site Sizing

  • vCenter sizing – it works harder than you think
  • Number of hosts – more is better
  • Enable DRS – why wouldn’t you?
  • Different recovery plans target different clusters

 

Recommendations

Be Clear with the Business

What is / are their

  • RPOs?
  • RTOs?
  • Cost of downtime?
  • Application priorities?
  • Units of failover?
  • Externalities?

Do you have Executive buy-in?

 

Risk with Infrequent DR Plan Testing

  • Parallel and cutover tests provide the best verification, but are very resource intensive and time consuming
  • Cutover tests are disruptive, may take days to complete and leaves the business at risk

 

Frequent DR Testing Reduces Risk

  • Increased confidence that the plan will work
  • Recovery can be tested at anytime without impact to production

 

Test Network

Use VLAN or isolated network for test environment

  • Default “auto” setting does not allow VM communication between hosts

Different PortGroup can be specified in SRM for test vs actual run

  • Specified in Network Mapping and / or Recovery Plan

 

Test Network – Multiple Options

Two Options

  • Disconnect NSX Uplink (this can be easily scripted)
  • Use NSX to create duplicate “Test” networks

RTO = dollars

*Demos

 

Conclusion and Further Reading

I enjoy these kind of sessions, as they provide a nice overview of the product capabilities that ties in well with business requirements. SRM is a pretty neat solution, and something you might consider using if you need to move workload from one DC to another. If you’re after a technical overview of Site Recovery Manager 6.5, this site is pretty good too. 4.5 stars.