Come And Splash Around In NetApp’s Data Lake

Disclaimer: I recently attended Storage Field Day 15.  My flights, accommodation and other expenses were paid for by Tech Field Day. There is no requirement for me to blog about any of the content presented and I am not compensated in any way for my time at the event.  Some materials presented were discussed under NDA and don’t form part of my blog posts, but could influence future discussions.

NetApp recently presented at Storage Field Day 15. You can see videos of their presentation here, and download my rough notes from here.

 

You Say Day-ta, I Say Dar-ta

Santosh Rao (Senior Technical Director, Workloads and Ecosystems) took us through some of the early big data platform challenges NetApp are looking to address.

 

Early Generation Big Data Analytics Platform

These were designed to deliver initial analytics solutions and were:

  • Implemented as Proof of concept; and
  • Solved a point project need.

The primary considerations of these solutions were usually cost and agility. The focus was to:

  • Limit up front costs and get the system operational quickly; and
  • Scalability, availability, and governance were afterthoughts

A typical approach to this was to use cloud or commodity infrastructure. This ended up becoming the final architecture. The problem with this approach, according to NetApp, is that it lead to unpredictable behaviour as copies manifested. You’d end up with 3-5 replicas of data copied across lines of business and various functions. Not a great situation.

 

Early Generation Analytics Platform Challenges

Other challenges with this architecture included:

  • Unpredictable performance;
  • Inefficient storage utilisation;
  • Media and node failures;
  • Total cost of ownership;
  • Not enterprise ready; and
  • Storage and compute tied (creating imbalance).

 

Next Generation Data Pipeline

So what do we really need from a data pipeline? According to NetApp, the key is “Unified Insights across LoBs and Functions”. By this they mean:

  • A unified enterprise data lake;
  • Federated data sources across the 2nd and 3rd platforms;
  • In-place access to the data pipeline (copy avoidance);
  • Spanned across edge, core and cloud; and
  • Future proofed to allow shifts in architecture.

Another key consideration is the deployment. The first proof of concept is performed by the business unit, but it needs to scale for production use.

  • Scale edge, core and cloud as a single pipeline
  • Predictable availability
  • Governance, data protection, security on data pipeline

This provides for a lower TCO over the life of the solution.

 

Data Pipeline Requirements

We’re not just playing in the core any more, or exclusively in the cloud. This stuff is everywhere. And everywhere you look the requirements differ as well.

Edge

  • Massive data (few TB/device/day)
  • Real-time Edge Analytics / AI
  • Ultra Low Latency
  • Network Bandwidth
  • Smart Data Movement

Core

  • Ultra high IO bandwidth (20 – 200+ GBps)
  • Ultra-low latency (micro – nanosecond)
  • Linear scale (1 – 128 node AI)
  • Overall TCO for 1-100+ PB

Cloud

  • Cloud analytics, AI/DL/ML
  • Consume and not operate
  • Cloud vendor vs on-premises stack
  • Cost-effective archive
  • Need to avoid cloud lock-in

Here’s picture of what the data pipeline looks like for NetApp.

[Image courtesy of NetApp]

 

NetApp provided the following overview of what the data pipeline looks like for AI / Deep Learning environments. You can read more about that here.

[Image courtesy of NetApp]

 

What Does It All Mean?

NetApp have a lot of tools at their disposal, and a comprehensive vision for meeting the requirements of big data, AI and deep learning workloads from a number of different angles. It’s not just about performance, it’s about understanding where the data needs to be to be considered useful to the business. I think there’s a good story to tell here with NetApp’s Data Fabric, but it felt a little like there remains some integration work to do. Big data, AI and deep learning means different things to different people, and there’s sometimes a reluctance to change the way people do things for the sake of adopting a new product. NetApp’s biggest challenge will be demonstrating the additional value they bring to the table, and the other ways in which they can help enterprise succeed.

NetApp, like some of the other Tier 1 storage vendors, has a broad portfolio of products at its disposal. The Data Fabric play is a big bet on being able to tie this all together in a way that their competitors haven’t managed to do yet. Ultimately, the success of this strategy will rely on NetApp’s ability to listen to customers and continue to meet their needs. As a few companies have found out the hard way, it doesn’t matter how cool you think your idea is, or how technically innovative it is, if you’re not delivering results for the business you’re going to struggle to gain traction in the market. At this stage I think NetApp are in a good place, and hopefully they can stay there by continuing to listen to their existing (and potentially new) customers.

For an alternative perspective, I recommend reading Chin-Fah’s thoughts from Storage Field Day 15 here.