VMware – VMworld 2017 – STO2063BU – Architecting Site Recovery Manager to Meet Your Recovery Goals

Disclaimer: I recently attended VMworld 2017 – US.  My flights were paid for by ActualTech Media, VMware provided me with a free pass to the conference and various bits of swag, and Tech Field Day picked up my hotel costs. There is no requirement for me to blog about any of the content presented and I am not compensated in any way for my time at the event.  Some materials presented were discussed under NDA and don’t form part of my blog posts, but could influence future discussions.

Here are my rough notes from “STO2063BU – Architecting Site Recovery Manager to Meet Your Recovery Goals” presented by GS Khalsa. You can grab a PDF version of them from here. It’s mainly bullet points, but I’m sure you know the drill.

 

Terminology

  • RPO – Last viable restore point
  • RTO – How long it will take before all functionality is recovered

You should break these down to an application, or a service tier level.

 

Protection Groups and Recovery Plans

What is a Protection Group?

Group of VMS that will be recovered together

  • Application
  • Department
  • System Type
  • ?

Different depending on replication type

A VM can only belong to one Protection Group

 

How do protection groups fit into Recovery Plans?

[Image via https://blogs.vmware.com/vsphere/2015/05/srm-protection-group-design.html]

 

vSphere Replication Protection Groups

  • Group VMs as desired into Protection Groups
  • What storage they are located on doesn’t matter

 

Array Based Protection Groups

If you want your protection groups to align to your applications – you’ll need to shuffle storage around

 

Policy Driven Protection

  • New style Protection Group leveraging storage profiles
  • High level of automation compared to traditional protection groups
  • Policy based approach reduces OpEx
  • Simpler integration of VM provisioning, migration, and decommissioning

This was introduced in SRM 6.1

 

How Should You Organise Your Protection Groups?

More Protection Groups

  • Higher RTO
  • Easier testing
  • Only what is needed
  • More granular and complex

Fewer Protection Groups

  • Lower RTO
  • Less granular, complex and flexible

This varies by customer and will be dictated by the appropriate combination of complexity and flexibility

 

Topologies

SRM Supports Multiple DR Topologies

Active-Passive Failover

  • Dedicated resources for recovery

Active-Active Failover

  • Run low priority apps on recovery infrastructure

Bi-directional Failover

  • Production applications at both sites
  • Each site acts as the recovery site for the other

Multi-site

  • Many-to-one failover
  • Useful for Remote Office / Branch Office

 

Enhanced Topology Support

There are a few different topologies that are supported.

[Images via https://blogs.vmware.com/virtualblocks/2016/07/28/srm-multisite/]

  • 10 SRM pairs per vCenter
  • A VM can only be protected once

 

SRM & Stretched Storage

[Image via https://blogs.vmware.com/virtualblocks/2015/09/01/srm-6-1-whats-new/]

Supported as of SRM 6.1

 

SRM and vSAN Stretched Cluster

[Image via https://blogs.vmware.com/virtualblocks/2015/08/31/whats-new-vmware-virtual-san-6-1/]

Failover to the third site (not the 2 sites comprising the cluster)

 

Enhanced Linked Mode

You can find more information on Enhanced Linked Mode here. It makes it easier to manage your environment and was introduced in vSphere 6.0.

 

Impacts to RTO

Decision Time

How long does it take to decide to failover?

 

IP Customisation

Workflow without customisation

  • Power on VM and wait for VMtools heartbeats

Workflow with IP customisation

  • Power on VM with network disconnected
  • Customise IP utilising VMtools
  • Power off VM
  • Power on VM and wait for VMtools heartbeats

Alternatives

  • Stretched Layer 2
  • Move VLAN / Subnet

It’s going to take some time to do when you failover a guest

 

Priorities and Dependencies vs Priorities Only

Organisation for lower RTO

  • Fewer / larger NFS datastore / LUNs
  • Fewer protection groups
  • Don’t replicate VM swap files
  • Fewer recovery plans

 

VM Configuration

  • VMware Tools installed in all VMs
  • Suspend VMS on Recovery vs PowerOff VMs
  • Array-based replication vs vSphere Replication

 

Recovery Site Sizing

  • vCenter sizing – it works harder than you think
  • Number of hosts – more is better
  • Enable DRS – why wouldn’t you?
  • Different recovery plans target different clusters

 

Recommendations

Be Clear with the Business

What is / are their

  • RPOs?
  • RTOs?
  • Cost of downtime?
  • Application priorities?
  • Units of failover?
  • Externalities?

Do you have Executive buy-in?

 

Risk with Infrequent DR Plan Testing

  • Parallel and cutover tests provide the best verification, but are very resource intensive and time consuming
  • Cutover tests are disruptive, may take days to complete and leaves the business at risk

 

Frequent DR Testing Reduces Risk

  • Increased confidence that the plan will work
  • Recovery can be tested at anytime without impact to production

 

Test Network

Use VLAN or isolated network for test environment

  • Default “auto” setting does not allow VM communication between hosts

Different PortGroup can be specified in SRM for test vs actual run

  • Specified in Network Mapping and / or Recovery Plan

 

Test Network – Multiple Options

Two Options

  • Disconnect NSX Uplink (this can be easily scripted)
  • Use NSX to create duplicate “Test” networks

RTO = dollars

*Demos

 

Conclusion and Further Reading

I enjoy these kind of sessions, as they provide a nice overview of the product capabilities that ties in well with business requirements. SRM is a pretty neat solution, and something you might consider using if you need to move workload from one DC to another. If you’re after a technical overview of Site Recovery Manager 6.5, this site is pretty good too. 4.5 stars.

VMware – VMworld 2017 – STO1179BU – Understanding the Availability Features of vSAN

Disclaimer: I recently attended VMworld 2017 – US.  My flights were paid for by ActualTech Media, VMware provided me with a free pass to the conference and various bits of swag, and Tech Field Day picked up my hotel costs. There is no requirement for me to blog about any of the content presented and I am not compensated in any way for my time at the event.  Some materials presented were discussed under NDA and don’t form part of my blog posts, but could influence future discussions.

Here are my rough notes from “STO1179BU – Understanding the Availability Features of vSAN”, presented by GS Khalsa (@gurusimran) and Jeff Hunter (@jhuntervmware). You can grab a PDF of the notes from here. Note that these posts don’t provide much in the way of opinion, analysis, or opinionalysis. They’re really just a way of providing you with a snapshot of what I saw. Death by bullet point if you will.

 

Components and Failure

vSAN Objects Consist of Components

VM

  • VM Home – multiple components
  • Virtual Disk – multiple components
  • Swap File – multiple components

vSAN has a cache tier and capacity tier (objects are stored here)

 

Quorum

Greater than 50% must be online to achieve quorum

  • Each component has one vote by default
  • Odd number of votes required to break tie – preserves data integrity
  • Greater than 50% of components (votes) must be online
  • Components can have more than one vote
  • Votes added by vSAN, if needed, to ensure odd number

 

Component Vote Counts Are Visible Using RVC CLI

/<vcenter>/datacenter/vms> vsan_vm_object_info <vm>

 

Storage Policy Determines Component Number and Placement

  • Primary level of failures to tolerate
  • Failure Tolerance Method

Primary level of failures to tolerate = 0 Means only one copy

  • Maximum component size is 255GB
  • vSAN will split bigger into smaller sized VMDKs
  • RAID-5/6 Erasure Coding Uses Stripes and Parity (need to be using all-flash)
  • Consumes less RAW capacity
  • Number of stripes also affects component counts

 

Each Host is an Implicit Fault Domain

  • Multiple components can end up in the same rack
  • Configure Fault Domains in the UI
  • Add at least one more host or fault domain for rebuilds

 

Component States Change as a Result of a Failure

  • Active
  • Absent
  • Degraded

vSAN selects most efficient way

Which is most efficient? Repair or Rebuild? It depends. Partial repairs are performed if full repair is not possible

 

vSAN Maintenance Mode

Three vSAN Options for Host Maintenance Mode

  • Evacuate all data to other hosts
  • Ensure data accessibility from other hosts
  • No data evacuation

 

Degraded Device Handling (DDH) in vSAN 6.6

  • vSAN 6.6 is more “intelligent”, builds on previous versions of DDH
  • When device is degraded, components are evaluated …
  • If component does not belong to last replica, mark as absent – “Lazy” evacuation since another replica of the object exists
  • If component belongs to last replica, start evacuation
  • Degraded devices will not be used for new component placement
  • Evacuation failures reported in UI

 

DDH and S.M.A.R.T.

Following items logged in vmkernel.log when drive is identified as unhealthy

  • Sectors successfully reallocated 0x05
  • Reported uncorrectable sectors 0xBB
  • Disk command timeouts 0xBC
  • Sector reallocation events 0xC4
  • Pending sector reallocations 0xC5
  • Uncorrectable sectors 0xC6

Helps GSS determine what to do with drive after evacuation

 

Stretched Clusters

Stretched Cluster Failure Scenarios

  • Extend the idea of fault domains from racks to sites
  • Witness component (tertiary site) – witness host
  • 5ms RTT (around 60 miles)
  • VM will have a preferred and secondary site
  • When component fails, starts rebuilding of preferred site

 

Stretched Cluster Local Failure Protection – new in vSAN 6.6

  • Redundancy against host failure and site failure
  • If site fails, vSAN maintains local redundancy in surviving site
  • No change in stretched cluster configuration steps
  • Optimised logic to minimise I/O traffic across sites
    • Local read, local resync
    • Single inter-site write for multiple replicas
  • RAID-1 between the sites, and then RAID-5 in the local sites

What happens during network partition or site failure?

  • HA Restart

Inter-site network disconnected (split brain)

  • HA Power-off

Witness Network Disconnected

  • Witness leaves cluster

VMs continue to operate normally. Very simple to redeploy a new one. Recommended host isolation response in a stretched cluster is power off

Witness Host Offline

  • Recover or redeploy witness host

New in 6.6 – change witness host

 

vSAN Backup, Replication and DR

Data Protection

  • vSphere APIs – Data Protection
  • Same as other datastore (VMFS, etc)
  • Verify support with backup vendor
  • Production and backup data on vSAN
    • Pros: Simple, rapid restore
    • Cons: Both copies lost if vSAN datastore is lost, can consume considerable capacity

 

Solutions …

  • Store backup data on another datastore
    • SAN or NAS
    • Another vSAN cluster
    • Local drives
  • Dell EMC Avamar and NetWorker
  • Veeam Backup and Replication
  • Cohesity
  • Rubrik
  • Others …

vSphere Replication included with Essentials Plus Kit and higher. With this you get per-VM RPOs as low as 5 minutes

 

Automated DR with Site Recovery Manager

  • HA with Stretched Cluster, Automated DR with SRM
  • SRM at the tertiary site

Useful session. 4 stars.

New book on VMware SRM now available

Good news, “Disaster Recovery Using VMware vSphere Replication and vCenter Site Recovery Manager – Second Edition” has just been released via Packt Publishing. It was written by Abhilash G B and I had the pleasure of serving as the technical reviewer. While SRM 6.5 has just been announced this is nonetheless a handy manual with some great guidance (and pictures!) on how to effectively use SRM with both array-based and vSphere Replication-based protection. There’s an ebook version available for purchase with a print copy also available for order.

6096en_5349_disasterrecoveryusingvmwarevspherereplicationandvcentersiterecoverymanagersecond

VMware – VMworld 2016 – STO7973 – Architecting Site Recovery Manager to Meet Your Recovery Goals

Disclaimer: I recently attended VMworld 2016 – US.  My flights were paid for by myself, VMware provided me with a free pass to the conference and various bits of swag, and Tech Field Day picked up my hotel costs. There is no requirement for me to blog about any of the content presented and I am not compensated in any way for my time at the event.  Some materials presented were discussed under NDA and don’t form part of my blog posts, but could influence future discussions.

vmworld-2016-hero-US_950

Here are my rough notes from “STO7973 – Architecting Site Recovery Manager to Meet Your Recovery Goals” presented by GS Khalsa and Ivan Jordanov. This was a session I’d been looking forward to all week and it didn’t disappoint.

STO7973

 

Agenda

  • Protection Groups and Recovery Plans
  • Topologies
  • Impacts to RTO
  • Recommendations
  • Recap

(Terminology)

  • DR – Disaster Recovery
  • RPO – how much data you’re going to lose
  • RTO – how long does it take you to get back up and running

 

Protection Groups and Recovery Plans

What is a protection group?

  • Group of VMs that will be recovered together
  • Application
  • Department
  • System type
  • ?
  • Different depending on replication type
  • A VM can only belong to one Protection Group

vSphere Replication Protection Groups

  • Group VMs as desired into PGs
  • Storage they are located on doesn’t matter
  • Replicates only deltas, option for compression

Array-based PGs

  • Protect all the VMs on a datastore, not granular

Storage Policy PG

  • new style PG leveraging storage policies
  • high level of automation compared to traditional protection groups
  • policy based approach reduces OpEx
  • simpler integration of VM provisioning, migration and decommissioning

How should you organise your Protection Groups?

  • Majority of outages are partial (not the entire DC)
  • Design accordingly
  • Failover only what is needed
  • These requirements vary by customer – everyone’s different

Fewer PGs = smaller RTO = less flexibility

Topologies

  • Active-Passive failover
  • active-active failover
  • bi-directional failover
  • multi-site

Shared Protection and Shared Recovery Site

STO7973_SRM-Shared-Protection-Recovery

Image from here. vCenter and SRM deployed at your remote sites

Enhanced topology site

STO7973_SRM-Shared-Recovery-Central-VC

Image from here.

STO7973_SRM-3-site-config

Original image taken from here. Remember that each VM can be protected and replicated only once.

SRM and Stretched Storage

STO7973_SRM_Stretched

Original image from here.

  • Best of both worlds
  • zero downtime with orchestrated cross-VC vMotion
  • non-disruptive testing

 

SRM & Virtual SAN Stretched Cluster

  • VSAN 6.1 – stretched cluster support
  • Uses witness host (should be outside VSAN stretched cluster but accessible)

Enhanced Linked Mode

  • Join PSC nodes in a single SSO domain
  • you need this to do cross VC vMotion on top of stretched storage
  • Central point of management for vCenters in the SSO domain

Multi-site guidelines

  • keep it simple
  • each VM only protected once
  • each VM only replicated once
  • Use enhanced linked mode

 

Impacts to RTO

Decision Time

How long does it take to decide to failover? decision time plus failover time = RTO
IP Customisation

This will impact in the time it takes to recover (guest shutdown, customise IP, guest startup and wait for VMware Tools)

Alternatives? Stretched Layer-2 (eww), Move VLAN/Subnet
Priorities and Dependencies

Priorities only – think about how you need to organise things to achieve the goal that you want
Organisation for lower RTO

  • Fewer/larger NFS datastores/LUNs
  • Fewer Protection Groups (depending on your business goals)
  • Don’t replicate VM swap files
  • Fewer Recovery Plans

VM Configuration

  • VM Tools installed in all VMs – this will cause timeouts if not installed waiting for the heartbeat
  • Suspend VMs on recovery – do this as part of a recovery plan
  • PowerOff VMs – you can do this as part of a recovery plan as well

Recovery Site

  • vCenter sizing – it works harder than you think
  • Number of hosts – more is better
  • Enable DRS – why wouldn’t you?
  • Different recovery plans target different clusters

More good stuff

  • script timeouts – make use of them
  • Install and admin guides – good source for best practices
  • Separate vCenter and SRM DBs

 

Recommendations

Be clear with your business

What is/are their

  • RPOs?
  • RTOs?
  • Cost of downtime?
  • Application priorities?
  • Units of failover?
  • Externalities?

Do you have

  • Executive buy-in?

Service Level Agreements

  • Do you have documented SLAs?
  • Do your SLAs clearly communicate the RPO, RTO and availability of your service tiers?
  • Are your SLA documents readily available to everyone in the company?

Recommendations

  • Use Service Tiers
  • Minimal requirements/decisions

vRealize Infrastructure Navigator – good for mapping dependencies
Risk with infrequent DR Plan Testing

  • Parallel and cover tests provide the best verification but very resource intensive and time consuming
  • Cutover tests are disruptive, may take days to complete and leaves the business at risk

Frequent DR testing reduces risk

  • Increased confidence that the plan will work
  • Recovery can be tested at anytime without impact to production

Test network

  • Use VLAN or isolated network for test environment
    • default “Auto” setting does not allow VM communication between hosts
  • Different PortGroup can be specified in SRM for test vs actual run
    • specified in Network Mapping and/or Recovery Plan

SRM & Cross vCenter NSX 6.2
Test Network – Multiple Options

A lot of stuff can now be done via API

  • Don’t protect everything
  • Apps that provide their own protection (e.g. Active Directory)
  • Locate VMs where they do their work

Demos

Q&A

Great session and a highlight of the conference (but I do loves me some SRM). 5 stars.

 

Brisbane VMUG – November 2015

The November Brisbane VMUG will be held on Thursday 19th November at EMC’s office in the city (Level 11, 345 Queen Street, Brisbane) from 4 – 6 pm. It’s sponsored by EMC and should be a ripper.

Here’s the agenda:

  • Latest SRM Features and VMworld Highlights – VMware
  • EMC VPLEX Metro with Site Recovery Manager Presentation
  • EMC Demonstration
  • Pizza and Refreshments (I promise)

You can find out more information and register for the event here. I hope to see you there. Also, if you’re interested in sponsoring one of these events, please get in touch with me and I can help make it happen.

VMware – Deploying vSphere Replication 5.8

As part of a recent vSphere 5.5 deployment, I installed a small vSphere Replication 5.8 proof-of-concept for the customer to trial site-to-site replication and get their minds around how they can do some simple DR activities. The appliance is fairly simple to deploy, so I thought I’d just provide a few links to articles that I found useful. Firstly, esxi-guy has a very useful soup-to-nuts post on the steps required to deploy a replication environment, and the steps to recover a VM. You can check it out here. Secondly, here’s a link to the official vSphere Replication documentation in PDF and eBook formats – just the sort of thing you’ll want to read while on the treadmill or sitting on the bus on the way home from the salt mines. Finally, if you’re working in an environment that has a number of firewalls in play, this list of ports you need to open is pretty handy.

One problem we did have was that we’d forgotten what the password was on the appliance we’d deployed at each site. I’m not the greatest cracker in any case, and so we agreed that re-deploying the appliance would be the simplest course of action. So I deleted the VM at each site and went through the “Deploy from OVF” thing again. The only thing of note that happened was that it warned me I had previously deployed a vSphere Replication instance with that name and IP address previously, and that I should get rid of the stale version. I did that at each site and then joined them together again and was good to go. I’m now trying to convince the customer that SRM might be of some use to them too. But baby steps, right?

Note also that, if you want to deploy additional vSphere Replication VMs to assist with load-balancing in your environment, you need to use the vSphere_Replication_AddOn_OVF10.ovf file for the additional appliances.

VMware – SRM 5.8 – You had one job!

The Problem

A colleague of mine has been doing some data centre failover testing for a customer recently and ran into an issue with VMware’s Site Recovery Manager (SRM) 5.8 running on vSphere 5.5 U2. When attempting to perform a recovery, and you’re running Linked Mode, and the protected site is off-line, the recovery may fail. The upshot of this is “The user is unable to perform a recovery at the recovery site, in the event of a DR scenario”. Here’s what it looks like.

SRM1

 

The Reason and Resolution

You can read more about the problem in this VMware KB article: Performing a Recovery using the Web Client in VMware vCenter Site Recovery Manager 5.8 reports the error: Failed to connect Site Recovery Manager Server(s). In short, there’s a PowerShell script you can run to make the recovery happen.

SRM0

 

Conclusion

I don’t know what to say about this. I’d like to put the boot into whomever at VMware is responsible for this SNAFU, but I’m guessing that they’ve already had a hard time of it. At least, I guess, there’s a workaround, if not a fix. But you’d be a bit upset if this happened for the first time during a real failover. But that’s why we test before we handover. And what is it with everything going pear-shaped when Linked Mode is in use?

 

*Update – 29/10/2015*

Marcel van den Berg recently pointed out that updating to SRM 5.8.1 resolves this issue. Further detail can be found here.

VMware – SRM advanced setting for snap prefix

We haven’t been doing this in our production configurations, but if you want to change the behaviour of SRM with regards to the “snap-xxx” prefix on replica datastores, you need to modify an advanced setting in SRM. So, go to the vSphere client – SRM, and right-click on Site Recovery and select Advanced Settings. Under SanProvider, there’s an option called “SanProvider.fixRecoveredDatastoreNames” with a little checkbox that needs to be ticked to prevent the recovered datastores being renamed with the unsightly prefix.

You can also do this when manually mounting snapshots or mirrors with the help of the esxcfg-volume command – but that’s a story for another time.

EMC MirrorView – New Article

I wrote a brief article on configuring EMC MirrorView from scratch through to being ready for VMware SRM usage. It’s not really from scratch, because I don’t go through the steps required to load the MirrorView enabler on the frame. But I figured that’s more a CE-type activity in any case. I hope to follow it up with a brief article on the SRM side of things. You can find my other articles here. Enjoy!

2009 and penguinpunk.net

It was a busy year, and I don’t normally do these type of posts, but I thought I’d try to do a year in review type thing so I can look back at the end of 2010 and see what kind of promises I’ve broken. Also, the Exchange Guy will no doubt enjoy the size comparison. You can see what I mean by that here.

In any case, here’re some broad stats on the site. In 2008 the site had 14966 unique visitors according to Advanced Web Statistics 6.5 (build 1.857). But in 2009, it had 15856 unique visitors – according to Advanced Web Statistics 6.5 (build 1.857). That’s an increase of some 890 unique visitors, also known as year-on-year growth of approximately 16.82%. I think. My maths are pretty bad at the best of times, but I normally work with storage arrays, not web statistics. In any case, most of the traffic is no doubt down to me spending time editing posts and uploading articles, but it’s nice to think that it’s been relatively consistent, if not a little lower than I’d hoped. This year (2010 for those of you playing at home), will be the site’s first full year using Google analytics, so assuming I don’t stuff things up too badly, I’ll have some prettier graphs to present this time next year. That said, MYOB / smartyhost are updating the web backend shortly so I can’t make any promises that I’ll have solid stats for this year, or even a website :)

What were the top posts? Couldn’t tell you. I do, however, have some blogging-type goals for the year:

1. Blog with more focus and frequency – although this doesn’t mean I won’t throw in random youtube clips at times.

2. Work more on the promotion of the site. Not that there’s a lot of point promoting something if it lacks content.

3. Revisit the articles section and revise where necessary. Add more articles to the articles page.

On the work front, I’m architecting the move of my current employer from a single data centre to a 2+1 active / active architecture (from a storage and virtualisation perspective). There’s more blades, more CLARiiON, more MV/S, some vSphere and SRM stuff, and that blasted Cisco MDS fabric stuff is involved too. Plus a bunch of stuff I’ve probably forgotten. So I think it will be a lot of fun, and a great achievement if we actually get anything done by June this year. I expect there’ll be some moments of sheer boredom as I work my way through 100s of incremental SAN Copies and sVMotions. But I also expect there will be moments of great excitement when we flick the switch on various things and watch a bunch of visio illustrations turn into something meaningful.

Or I might just pursue my dream of blogging about the various media streaming devices on the market. Not sure yet. In any case, thanks for reading, keep on reading, tell your friends, and click on the damn Google ads.