Disclaimer: I recently attended VMworld 2017 – US. My flights were paid for by ActualTech Media, VMware provided me with a free pass to the conference and various bits of swag, and Tech Field Day picked up my hotel costs. There is no requirement for me to blog about any of the content presented and I am not compensated in any way for my time at the event. Some materials presented were discussed under NDA and don’t form part of my blog posts, but could influence future discussions.
Here are my rough notes from “STO2063BU – Architecting Site Recovery Manager to Meet Your Recovery Goals” presented by GS Khalsa. You can grab a PDF version of them from here. It’s mainly bullet points, but I’m sure you know the drill.
Terminology
- RPO – Last viable restore point
- RTO – How long it will take before all functionality is recovered
You should break these down to an application, or a service tier level.
Protection Groups and Recovery Plans
What is a Protection Group?
Group of VMS that will be recovered together
- Application
- Department
- System Type
- ?
Different depending on replication type
A VM can only belong to one Protection Group
How do protection groups fit into Recovery Plans?
[Image via https://blogs.vmware.com/vsphere/2015/05/srm-protection-group-design.html]
vSphere Replication Protection Groups
- Group VMs as desired into Protection Groups
- What storage they are located on doesn’t matter
Array Based Protection Groups
If you want your protection groups to align to your applications – you’ll need to shuffle storage around
Policy Driven Protection
- New style Protection Group leveraging storage profiles
- High level of automation compared to traditional protection groups
- Policy based approach reduces OpEx
- Simpler integration of VM provisioning, migration, and decommissioning
This was introduced in SRM 6.1
How Should You Organise Your Protection Groups?
More Protection Groups
- Higher RTO
- Easier testing
- Only what is needed
- More granular and complex
Fewer Protection Groups
- Lower RTO
- Less granular, complex and flexible
This varies by customer and will be dictated by the appropriate combination of complexity and flexibility
Topologies
SRM Supports Multiple DR Topologies
Active-Passive Failover
- Dedicated resources for recovery
Active-Active Failover
- Run low priority apps on recovery infrastructure
Bi-directional Failover
- Production applications at both sites
- Each site acts as the recovery site for the other
Multi-site
- Many-to-one failover
- Useful for Remote Office / Branch Office
Enhanced Topology Support
There are a few different topologies that are supported.
[Images via https://blogs.vmware.com/virtualblocks/2016/07/28/srm-multisite/]
- 10 SRM pairs per vCenter
- A VM can only be protected once
SRM & Stretched Storage
[Image via https://blogs.vmware.com/virtualblocks/2015/09/01/srm-6-1-whats-new/]
Supported as of SRM 6.1
SRM and vSAN Stretched Cluster
[Image via https://blogs.vmware.com/virtualblocks/2015/08/31/whats-new-vmware-virtual-san-6-1/]
Failover to the third site (not the 2 sites comprising the cluster)
Enhanced Linked Mode
You can find more information on Enhanced Linked Mode here. It makes it easier to manage your environment and was introduced in vSphere 6.0.
Impacts to RTO
Decision Time
How long does it take to decide to failover?
IP Customisation
Workflow without customisation
- Power on VM and wait for VMtools heartbeats
Workflow with IP customisation
- Power on VM with network disconnected
- Customise IP utilising VMtools
- Power off VM
- Power on VM and wait for VMtools heartbeats
Alternatives
- Stretched Layer 2
- Move VLAN / Subnet
It’s going to take some time to do when you failover a guest
Priorities and Dependencies vs Priorities Only
Organisation for lower RTO
- Fewer / larger NFS datastore / LUNs
- Fewer protection groups
- Don’t replicate VM swap files
- Fewer recovery plans
VM Configuration
- VMware Tools installed in all VMs
- Suspend VMS on Recovery vs PowerOff VMs
- Array-based replication vs vSphere Replication
Recovery Site Sizing
- vCenter sizing – it works harder than you think
- Number of hosts – more is better
- Enable DRS – why wouldn’t you?
- Different recovery plans target different clusters
Recommendations
Be Clear with the Business
What is / are their
- RPOs?
- RTOs?
- Cost of downtime?
- Application priorities?
- Units of failover?
- Externalities?
Do you have Executive buy-in?
Risk with Infrequent DR Plan Testing
- Parallel and cutover tests provide the best verification, but are very resource intensive and time consuming
- Cutover tests are disruptive, may take days to complete and leaves the business at risk
Frequent DR Testing Reduces Risk
- Increased confidence that the plan will work
- Recovery can be tested at anytime without impact to production
Test Network
Use VLAN or isolated network for test environment
- Default “auto” setting does not allow VM communication between hosts
Different PortGroup can be specified in SRM for test vs actual run
- Specified in Network Mapping and / or Recovery Plan
Test Network – Multiple Options
Two Options
- Disconnect NSX Uplink (this can be easily scripted)
- Use NSX to create duplicate “Test” networks
RTO = dollars
*Demos
Conclusion and Further Reading
I enjoy these kind of sessions, as they provide a nice overview of the product capabilities that ties in well with business requirements. SRM is a pretty neat solution, and something you might consider using if you need to move workload from one DC to another. If you’re after a technical overview of Site Recovery Manager 6.5, this site is pretty good too. 4.5 stars.