Disclaimer: I recently attended VMworld 2016 – US. My flights were paid for by myself, VMware provided me with a free pass to the conference and various bits of swag, and Tech Field Day picked up my hotel costs. There is no requirement for me to blog about any of the content presented and I am not compensated in any way for my time at the event. Some materials presented were discussed under NDA and don’t form part of my blog posts, but could influence future discussions.
Here are my rough notes from “STO7973 – Architecting Site Recovery Manager to Meet Your Recovery Goals” presented by GS Khalsa and Ivan Jordanov. This was a session I’d been looking forward to all week and it didn’t disappoint.
- Protection Groups and Recovery Plans
- Impacts to RTO
- DR – Disaster Recovery
- RPO – how much data you’re going to lose
- RTO – how long does it take you to get back up and running
Protection Groups and Recovery Plans
What is a protection group?
- Group of VMs that will be recovered together
- System type
- Different depending on replication type
- A VM can only belong to one Protection Group
vSphere Replication Protection Groups
- Group VMs as desired into PGs
- Storage they are located on doesn’t matter
- Replicates only deltas, option for compression
- Protect all the VMs on a datastore, not granular
Storage Policy PG
- new style PG leveraging storage policies
- high level of automation compared to traditional protection groups
- policy based approach reduces OpEx
- simpler integration of VM provisioning, migration and decommissioning
How should you organise your Protection Groups?
- Majority of outages are partial (not the entire DC)
- Design accordingly
- Failover only what is needed
- These requirements vary by customer – everyone’s different
Fewer PGs = smaller RTO = less flexibility
- Active-Passive failover
- active-active failover
- bi-directional failover
Shared Protection and Shared Recovery Site
Image from here. vCenter and SRM deployed at your remote sites
Enhanced topology site
Image from here.
Original image taken from here. Remember that each VM can be protected and replicated only once.
SRM and Stretched Storage
Original image from here.
- Best of both worlds
- zero downtime with orchestrated cross-VC vMotion
- non-disruptive testing
SRM & Virtual SAN Stretched Cluster
- VSAN 6.1 – stretched cluster support
- Uses witness host (should be outside VSAN stretched cluster but accessible)
Enhanced Linked Mode
- Join PSC nodes in a single SSO domain
- you need this to do cross VC vMotion on top of stretched storage
- Central point of management for vCenters in the SSO domain
- keep it simple
- each VM only protected once
- each VM only replicated once
- Use enhanced linked mode
Impacts to RTO
How long does it take to decide to failover? decision time plus failover time = RTO
This will impact in the time it takes to recover (guest shutdown, customise IP, guest startup and wait for VMware Tools)
Alternatives? Stretched Layer-2 (eww), Move VLAN/Subnet
Priorities and Dependencies
Priorities only – think about how you need to organise things to achieve the goal that you want
Organisation for lower RTO
- Fewer/larger NFS datastores/LUNs
- Fewer Protection Groups (depending on your business goals)
- Don’t replicate VM swap files
- Fewer Recovery Plans
- VM Tools installed in all VMs – this will cause timeouts if not installed waiting for the heartbeat
- Suspend VMs on recovery – do this as part of a recovery plan
- PowerOff VMs – you can do this as part of a recovery plan as well
- vCenter sizing – it works harder than you think
- Number of hosts – more is better
- Enable DRS – why wouldn’t you?
- Different recovery plans target different clusters
More good stuff
- script timeouts – make use of them
- Install and admin guides – good source for best practices
- Separate vCenter and SRM DBs
Be clear with your business
What is/are their
- Cost of downtime?
- Application priorities?
- Units of failover?
Do you have
- Executive buy-in?
Service Level Agreements
- Do you have documented SLAs?
- Do your SLAs clearly communicate the RPO, RTO and availability of your service tiers?
- Are your SLA documents readily available to everyone in the company?
- Use Service Tiers
- Minimal requirements/decisions
vRealize Infrastructure Navigator – good for mapping dependencies
Risk with infrequent DR Plan Testing
- Parallel and cover tests provide the best verification but very resource intensive and time consuming
- Cutover tests are disruptive, may take days to complete and leaves the business at risk
Frequent DR testing reduces risk
- Increased confidence that the plan will work
- Recovery can be tested at anytime without impact to production
- Use VLAN or isolated network for test environment
- default “Auto” setting does not allow VM communication between hosts
- Different PortGroup can be specified in SRM for test vs actual run
- specified in Network Mapping and/or Recovery Plan
SRM & Cross vCenter NSX 6.2
Test Network – Multiple Options
A lot of stuff can now be done via API
- Don’t protect everything
- Apps that provide their own protection (e.g. Active Directory)
- Locate VMs where they do their work
Great session and a highlight of the conference (but I do loves me some SRM). 5 stars.