Disclaimer: I recently attended VMworld 2017 – US. My flights were paid for by ActualTech Media, VMware provided me with a free pass to the conference and various bits of swag, and Tech Field Day picked up my hotel costs. There is no requirement for me to blog about any of the content presented and I am not compensated in any way for my time at the event. Some materials presented were discussed under NDA and don’t form part of my blog posts, but could influence future discussions.
Here are my rough notes from “MGT3342BUS – Architecting Data Protection with Rubrik” presented by Rebecca Fitzhugh and Andrew Miller at VMworld US 2017. You can download my rough notes from here. Here’s a proof of life shot of Rebecca and Andrew.
Why bother with Data Protection?
There’s one big reason. Your stuff is important. However, the business expectations of a company’s DR / data protection frequently != the IT capabilities for DR / data protection.
What are you really protecting yourself against?
- Lost or postponed sales and income
- Regulatory fines
- Delay of new business plans
- Loss of contractual bonuses
- Customer dissatisfaction
- Timing and duration of disruption
- Increased expenses such as overtime labor and outsourcing
- Employee burnout
Disaster – what does that really look like?
- Natural – tornadoes, earthquakes, etc; and
- Man-made – power loss, human error.
Where do we begin? How do we deal with this?
What is a Business Impact Analysis (BIA)? Something you need to do if you haven’t done it already.
A process to understand:
- What is the monetary impact of a disaster or failure?
- What are the most time-critical and information-critical business processes?
- How does the business REALLY rely upon IT service and application availability?
- What availability ore recoverability capabilities are justifiable based on these requirements, potential impact and costs?
Composed of two components
- Technical discovery – data gathering
- Human conversation – talk to people!
Example output – recovery priority tiers.
What is an SLA?
A contract between an external service provider and its customers or between an IT department and internal business units it services
- Two 9s – 99% = 3.65 days of downtime per year (easy to achieve, less expensive)
- Three 9s – 99.9% = 8.76 hours of downtime per year
- Four 9s – 99.99% = 52.6 minutes of downtime per year
- Five 9s – 99.999% = 5.26 minutes of downtime per year (difficult to achieve, expensive!)
DR – key measures
- RPO: how much data can I lose?
- RTO: Targeted amount of time to restart a business service after a disaster event
The smaller your RTOs and RPOs – the more money you’ll spend
BC vs DR vs OR – Say What?
- All goes on as normal despite and incident
- Could lose a site and have no impact on business operations (active/active sites)
- To cope with and recover from an IT crisis that moves work to an alternative system in a non-routine way
- A real “disaster” is large in scope and impact
- DR typically implies failure of the primary data centre and recovery to an alternate site
- Addresses more “routine” types of failure (server, network, storage, etc)
- Events are smaller in scope and impact than a full disaster
- Typically implies recovering to alternate equipment within the primary DC
Each should have its own clearly defined objectives – at minimum you should know the difference.
Where Rubrik Helps
Complexity is the enemy. Whatever you do. Whatever you buy. Simplify your architecture & expect more.
Key Evaluation Criteria
What Rubrik have seen that makes a difference:
1. Reliability of data recovery
- Simplicity of setup and day 2 operations – SLA policies!
- Immutability – is your data there when you need it?
2. Speed of data recovery
- Search and Live Mount
- API usage / automation to enhance restore capabilities
Not a lot has changed in data management since the 1990s. Last decade we introduced disk-based backup and deduplication. The problem is we added capabilities to existing architectures. This didn’t necessarily make things simpler.
Rubrik Cloud Data Management
Software fabric for orchestrating apps and data across clouds. No forklift upgrades.
How it Works
- Quick start – Rack and go. auto discovery.
- Rapid Ingest – Flash-optimized, parallel ingest accelerates snapshots and eliminates stun. Content-aware dedupe. One global namespace.
- Automate – Intelligent SLA policy engine for effortless management.
- Instant Recovery – Live mount VMs and SQL. Instant search and file restore.
- Secure – end-to-end encryption. Immutability to fight ransomware.
- Cloud – “CloudOut” instantly accessible with global search. Launch apps with “CloudOn” for DR or test/dev. Run apps in cloud.
Data Management in the Cloud
SLAs are important, and you’ll likely need to consider the following aspects.
- Availability Duration (Retention)
- When to archive (RTO)
- Replication Schedule (DR)
Under the hood – Interface, Logic, Core.
“Simple is hard”
Use an API-first platform to create powerful automation workflows
“Don’t Backup. Go Forward”
It should be no secret that I’m quite a fan of the Rubrik architecture and approach to data protection. I’ve written about them before on this blog. I like when data protection firms talk to me about what’s important to the business and the kinds of scenarios they protect against. I also like the focus on BIA and SLAs. Rubrik have made some great strides in the marketplace and are delivering new features at a rapid clip. If you haven’t had time to look at the them and you’re looking for a new approach to data protection, I recommend you look into their solution.