VMware – VMworld 2016 – STO7875 – A Day in the Life of a VSAN I/O

Disclaimer: I recently attended VMworld 2016 – US.  My flights were paid for by myself, VMware provided me with a free pass to the conference and various bits of swag, and Tech Field Day picked up my hotel costs. There is no requirement for me to blog about any of the content presented and I am not compensated in any way for my time at the event.  Some materials presented were discussed under NDA and don’t form part of my blog posts, but could influence future discussions.

vmworld-2016-hero-US_950

Here are my rough notes from “STO7875 – A Day in the Life of a VSAN I/O”, presented by Duncan Epping and John Nicholson (Senior Technical Marketing Manager).

STO7875

 

Agenda

  • Introduction
  • Virtual SAN, what is it?
  • Virtual SAN, a bit of a deeper dive
  • What about failures? (John)
  • etc

 

Introduction

The SDDC – two of the big challenges we’ve had have been networking challenges and storage.

  • “hardware evolution started the infrastructure revolution” – this is a pretty good point, and that I think is too often overlooked.
  • a lazy admin is the best admin
  • simplicity – operational/management
  • the hypervisor is the strategic high ground (VMware vSphere)

Storage policy-based management provides application-centric automation. The cool thing is this gives you:

  • Intelligent placement
  • Fine control of services at VM level
  • Automation at scale through policy
  • Need new services for VM? change current policy on the fly, attach new policy on the fly

 

Virtual SAN Primer

  • HCI
  • SDS
  • Distributed, scale-out architecture
  • Integrated with vSphere platform
  • Ready for today’s vSphere use cases

Comprised of:

  • Generic x86 hardware
  • Integrated with hypervisor
  • Leveraging local storage resources
  • Exposing a single shared datastore

There are currently over 5000 customers using VSAN. The use cases are increasing as well:

  • Business critical apps;
  • End user computing
  • DMZ – I think this is a great idea, given the struggles I’ve had with InfoSec teams and their requirement to keep workloads on discrete storage.
  • DR/DA
  • Test/Dev
  • Management
  • Staging
  • ROBO

VSAN comes in tiered hybrid and all-flash varieties.
all writes and the vast majority of reads are served by flash storage

1. write back buffer (30%) (or 100% in all-flash)

  • writes acknowledged as soon as they are persisted on flash (on all replicas)

2. Read cache (70%)

  • active data set always in flash, hot data replace cold data
  • cache miss – read data from HDD and put in cache

 

VSAN (Deeper)

The VM is treated as a set of objects on VSAN

  • Define a policy first
  • Each object has multiple components – allows you  to meet availability and performance requirements
  • Data is distributed on storage policy
  • Number of failures to tolerate
  • Number of disk stripes per object
  • Fault domains, increasing availability through rack awareness

 

What about failures?

1 host isolated – HA restart

select response as power off the VM, not shutdown

2 hosts partitioned – HA restart

4 hosts partitioned – HA restart

 

VSAN IO flow – write acknowledgement

VSAN mirrors write IOs to all active mirrors

these are acknowledged when they hit the write buffer

[..]

 

Anatomy of a hybrid read

1. guest issues a read on virtual disk

2. owner chooses replica to read from

  • load balance across replicas
  • not necessarily local replica (if one)
  • a block always reads from the same replica

3. At chosen replica, read data from flash read chase or client cache, if present

4. otherwise, read from HDD and place data in flash Read cache – replace cold data

5. return data to owner

6. complete read and return data to VM

 

Anatomy of an all-flash read

1. guest OS issues a read on virtual disk

2. Owner chooses replica to read from

  • load glance across replicas
  • not necessarily local replica (if one)
  • a block always read from same replica

3. at chosen replica, read data from (write) flash Cache or client cache, if present

4. Otherwise, read from capacity flash device

5. return data to owner

6. complete read and return data to VM

 

Client cache

  • Always local
  • Up to 1GB of memory per host
  • Memory latency < network latency
  • Horizon 7 testing – 75% fewer read IOPS, 25% better latency
  • Compliments Content Based Read Cache (CBRC)
  • Enabled by default in 6.2

 

Anatomy of checksum

1. guest OS issues a write on virtual disk

2. host generates checksum before it leaves host

3. transferred over network

4. checksum verified on host where it will write to disk

5. ACK is returned to the VM

6. on read the checksum is verified by the host with the VM. If any component fails it is repaired from the other copy or parity.

7. scrubs of cold data performance done once a year (this is adjustable)

 

Deduplication and compression for space efficiency

  • deduplication and compression per disk group level
    • enabled on a cluster level
    • fixed block length deduplication (4KB blocks)
  • compression after deduplication
    • LZ4 is used, low CPU
    • single feature, no schedules required
    • file system stripes all IO across disk group

 

Deduplication and compression disk group stripes

  • deduplication and compression per disk group level
    • data stripes across the disk group
  • Fault domain isolated to disk group
    • fault of device leads to rebuild of disk group
    • stripes reduce hotspots
    • endurance/throughput impact

 

Deduplication and compression (IO path)

  • avoids inline or post process downsides
  • performed at disk group level
  • 4KB fixed block
  • LZ4 compression after deduplication

1. VM issues write

2. Write acknowledged by cache

3. cold data to memory

4. deduplication

5. compression

6. data written to capacity

 

Deduplication process (all-flash only)

  • SHA-1 is Fast
  • Hash map not fully in memory
  • Avoids fragmentation

1. VM issues write

2. write acknowledged by cache

3. cold data to memory

4. deduplication

5. compression

6. data written to capacity

Compression process (all-flash only)

  • LZ4 is fast
  • avoid compress duplicate data
  • uncompressible data?

 

RAID 5/6

  • can turn on on a per component basis

 

RAID 5 inline erasure coding

  • When Number of Failures to Tolerate = 1 and failure Tolerance Method = Capacity -> RAID 5
    • 3+1 (4 host minimum)
    • 1.33x  instead of 2x overhead (20GB disk consumes 40GB with RAID 1, now consumes ~27GB with RAID 5)

 

Swap placement

(new in 6.2)

  • Sparse Swap
  • reclaim space used by memory swap
  • host advanced option enables setting

How to set it?

  • esxcfg-advcfg -g /VSAN/SwapThickProvisionDisabled
  • https://github.com/jasemccarty/SparseSwap

 

Snapshots for VSAN

(new in 6.0)

  • Not using VMFS redo logs
  • writes allocated into 4MB allocations
  • snapshot metadata cache (avoids read amplification)
  • Performs pre-fetch of metadata cache
  • Maximum 31

 

Wrapping Up

The recently launched a new portal for Storage and Availability technical documents. You should also check out the Virtually Speaking podcast.

Good session – 4.5 stars