Dell EqualLogic MasterClass 2012 – Part 2

Disclaimer: I recently attended the Dell EqualLogic MasterClass 2012.  This was a free event delivered by Dell and I was provided with breakfast and lunch, however there was no requirement for me to blog about any of the content presented and I was not compensated in any way for my time at the event.

This is the second of three posts covering the Dell EqualLogic MasterClass that I recently attended. The first one is here.

EqualLogic MasterClass 201: Advanced Features

Dell described this as follows: “The EqualLogic MasterClass 201 course builds upon our 101 Core technology course and provides the opportunity for students to explore further the advanced features and software included in PS Series Storage Array, namely replication and SAN Headquarters Monitoring Tool. By understanding and utilising these advanced features you will be able to maximise your use of the storage array to fulfill business requirements today and well into the future.  In this session we will cover:

Disaster Recovery using replication

  • Auto-replication
  • Auto Snapshot Manager (Microsoft ed)
  • Auto Snapshot Manager (VMware ed)
  • Off-host backup techniques with EqualLogic
  • Visibility to your EqualLogic Storage by leveraging SAN HQ monitoring”

So here’re the rough notes from Session 2.

[Note: I missed the first 15 minutes of this session and had a coffee with @DanMoz instead.]

Snapshot Architecture

There are three different types of snapshots that you can do:

  • Crash-consistent (using Group Manager);
  • Application-aware; and
  • Hypervisor-aware.

Steven then went on to provide an in-depth look at how snapshots work using re-direct on write. Audience question – What if you’re referencing data from both A and A1 data? EQL will move data in the background to optimise for sequential reads. Reads may be proxied initially, (data movement is done when the arrays are not under load). Replicas use the same engine as snapshots, but with smaller segments (talking in KB, not MB). The default snapshot space % when you create a volume is dictated by the Group configuration. You can set this to 0% if you want, and then set snapshot space on each volume as required. You can set to no snapshots if you want, but this will cause VSS-based apps to fail (because there’s no snap space). The default Group configuration is 100% (like for like) snapshot space. With this you can guarantee 100% of the time that you will have at least one PiT copy of the data. The key in setting up snapshots is to know what the rate of change is. Start with a conservative figure, then look at RTO / RPO, how many on-line copies do you want? Think about the keep count – how many snaps you want to keep before the oldest ones get deleted? Audience question – Are there any tools to help identify the rate of change? Steven suggested that you could use SAN HQ to measure this by doing a daily snap and running a report every day to see how much snapshot reserve is being used. Audience question – is there a high watermark for snaps? Policy is to delete older snaps once 100% of reserved space is reached (or take the snapshot off-line). FW6 introduces the concept of snapshot borrowing. If you’re using thin-provisioned volumes, snapshots are thin-provisioned as well. Audience question – Are snapshots distributed across members in the same way as volumes are? Yes.

Manual Tiering

Steven then moved on to a discussion on manual tiering / RAID preferencing. When you create a volume the RAID preference is set to auto – you can’t change that until after it’s created. The RAID preference will either be honoured or not honoured. RAID preferencing can be used to override the capacity load balancer. What happens if you have more volumes asking for a certain RAID type than you have space of that RAID type? At that point the array will randomly choose which volume gets the preference – this can’t be configured. You can’t do a tier within a tier (ie These volumes definitely RAID 1/0, these ones nice to be RAID 1/0). If you have 80% RAID 1/0 capacity, the capacity load balancer will take over and stripe it 50/50 instead, not 80/20. There is one other method, via the CLI, using the bind command. This is generally not recommended. Binds a particular volume to a particular member. RAID preferencing vs binding? If you had two R1/0 members in your pool, RAID preferencing would give you wide-striping across both R1/0 members, binding would not. Audience question – what happens when the member you’ve bound a volume to fails? Volumes affected by a member failure are unavailable until they’re bought back on-line. You don’t lose the data, but they’re unavailable. Audience question – what if you’ve bound a volume to a member and then evacuate that member? It should give you a warning about unbinding first. Last method of tiering is to use different pools (the old way of doing things) – a pool of fast disk and a tier of slow disk, for example.

Brief discussion about HVS – a production-ready, EqualLogic-supported storage VM running on ESX – which is due for release late next year. Some of the suggested use cases for this include remote office deployments, or for cloud providers combining an on-premise hardware and off-premise virtual solution.

Snapshots for VMware

CentOS appliance in ovf format. Talks to the EQL Group and talks to vCenter. Tells vCenter to place a VM in snapshot-mode (*snapshot.vmdk), tells EQL to take a volume (hardware) snap, tells vCenter to put the VM back in production mode, then merges the snap vmdk with the original. Steven noted that the process leverages vCenter capabilities and is not a proprietary process. Uses snapshot reserve space on the array, as opposed to native VMware snapshots which only use space on the datastore.

Demo time

VMware Virtual Storage Manager 3.5 (this was an early production release – General Availability is in the next few weeks). This can now address multiple EQL groups in the back-end. An audience member member noted that you still need 2 VSMs if you’re running vCenter in linked mode. Apparently linked mode is not recommended by EQL support for the VSM. Previously you needed (for replication) a vCenter and VSM appliance at each site to talk to each site’s EQL environment. Now you only need one vCenter and one appliance to talk to multiple groups (ie Prod and DR). Now supports NFS datastore provisioning. Now supports thin-provision stun (with FW6 and ESX5). When a thin-provisioned datastore runs out of space, puts VMs on the datastore in a paused state, caches in memory, notifies the admin, who can remediate, and then brings the VMs back on-line. This a free download from Dell – so get on it.   The remote setup wizard is really now only for array initialisation tasks. Previously you could initialize the array, setup MPIO and configure PS Group access – the last two tasks have been moved to ASM/ME. If your VSS snapshots have been failing, maybe the host isn’t able to properly access the VSS volume presented by EQL?

Asynchronous Replication

Talked briefly about the difference between sync and async (async’s smallest window is every 5 minutes). With synchronous replication the array needs an acknowledgement from secondary array (as well as primary) to tell the host I/O has been received. Async replication works between two groups – bi-directional replication is supported. A single volume can only have one partner. IP replication is over iSCSI. What about the MTU? EQL autosenses replication traffic and brings it down to 1500. The local reserve space is defined at the volume level (%) – normally temporary – 5% minimum – up to 200% can be reserved. If the minimum is exceeded, space can be “borrowed”. The remote reserve space is defined at the volume level (%) – eg 100GB, minimum 105GB (one full copy plus 5% of changes) at the remote site. Delegated space on the target is a Group parameter. Holds replicas from other partners and is defined by GB/TB. For “Fast failback”, increase the local reserve space and incorporate fast failback snapshot space. This keeps a last good copy of a volume (pages, not just the page table) on the source site. For example, if you’ve got a volume in sync, you fail it over to DR, do something at Prod, change some stuff on the volume at DR, then failback, pages are still resident at the original source site (Prod), only the deltas are sent back. To do this you’ll need 100% or more of local reserve space configured. If you don’t use a fast failback snapshot, you need to send the whole lot back when doing a failback. Consider how fast you want to failback, as this feature will have an impact on the space you consume to achieve this. A manual transfer utility exists (enables you to use a portable device to seed the data on the DR array before it’s shipped to DR, for example).

More info on EqualLogic Array Software can be found here. For more information on EqualLogic Synchronous Replication (SyncRep), have a look at this Tech Report.