Primary Data – Seeing the Future

It’s that time of year when public relations companies send out a heap of “What’s going to happen in 2018” type press releases for us blogger types to take advantage of. I’m normally reluctant to do these “futures” based posts, as I’m notoriously bad at seeing the future (as are most people). These types of articles also invariably push the narrative in a certain direction based on whatever the vendor being represented is selling. That said I have a bit of a soft spot for Lance Smith and the team at Primary Data, so I thought I’d entertain the suggestion that I at least look at what’s on his mind. Unfortunately, scheduling difficulties meant that we couldn’t talk in person about what he’d sent through, so this article is based entirely on the paragraphs I was sent, and Lance hasn’t had the opportunity to explain himself :)


SDS, What Else?

Here’s what Lance had to say about software-defined storage (SDS). “Few IT professionals admit to a love of buzzwords, and one of the biggest offenders in the last few years is the term, “software-defined storage.” With marketers borrowing from the successes of “software-defined-networking”, the use of “SDS” attempts all kinds of claims. Yet the term does little to help most of us to understand what a specific SDS product can do. Despite the well-earned dislike of the phrase, true software-defined storage solutions will continue to gain traction because they try to bridge the gap between legacy infrastructure and modern storage needs. In fact, even as hardware sales declines, IDC forecasts that the SDS market will grow at a rate of 13.5% from 2017 – 2021, growing to a $16.2B market by the end of the forecast period.”

I think Lance raises an interesting point here. There’re a lot of companies claiming to deliver software-defined storage solutions in the marketplace. Some of these, however, are still heavily tied to particular hardware solutions. This isn’t always because they need the hardware to deliver functionality, but rather because the company selling the solution also sells hardware. This is fine as far as it goes, but I find myself increasingly wary of SDS solutions that are tied to a particular vendor’s interpretation of what off the shelf hardware is.

The killer feature of SDS is the idea that you can do policy-based provisioning and management of data storage in a programmatic fashion, and do this independently of the underlying hardware. Arguably, with everything offering some kind of RESTful API capability, this is the case. But I think it’s the vendors who are thinking beyond simply dishing up NFS mount points or S3-compliant buckets that will ultimately come out on top. People want to be able to run this stuff anywhere – on crappy whitebox servers and in the public cloud – and feel comfortable knowing that they’ll be able to manage their storage based on a set of business-focused rules, not a series of constraints set out by a hardware vendor. I think we’re close to seeing that with a number of solutions, but I think there’s still some way to go.


HCI As Silo. Discuss.

His thoughts on HCI were, in my opinion, a little more controversial. “Hyperconverged infrastructure (HCI) aims to meet data’s changing needs through automatic tiering and centralized management. HCI systems have plenty of appeal as a fast fix to pay as you grow, but in the long run, these systems represent just another larger silo for enterprises to manage. In addition, since hyperconverged systems frequently require proprietary or dedicated hardware, customer choice is limited when more compute or storage is needed. Most environments don’t require both compute and storage in equal measure, so their budget is wasted when only more CPU or more capacity is really what applications need. Most HCI architecture rely on layers of caches to ensure good storage performance.  Unfortunately, performance is not guaranteed when a set of applications running in a compute node overruns a caches capacity.  As IT begins to custom-tailor storage capabilities to real data needs with metadata management software, enterprises will begin to move away from bulk deployments of hyperconverged infrastructure and instead embrace a more strategic data management role that leverages precise storage capabilities on premises and into the cloud.”

There’re are a few nuggets in this one that I’d like to look at further. Firstly, the idea that HCI becomes just another silo to manage is an interesting one. It’s true that HCI as a technology is a bit different to the traditional compute / storage / network paradigm that we’ve been managing for the last few decades. I’m not convinced, however, that it introduces another silo of management. Or maybe, what I’m thinking is that you don’t need to let it become another silo to manage. Rather, I’ve been encouraging enterprises to look at their platform management at a higher level, focusing on the layer above the compute / storage / network to deliver automation, orchestration and management. If you build that capability into your environment, then whether you consume compute via rackmount servers, blade or HCI becomes less and less relevant. It’s easier said than done, of course, as it takes a lot of time and effort to get that layer working well. But the sweat investment is worth it.

Secondly, the notion that “[m]ost environments don’t require both compute and storage in equal measure, so their budget is wasted when only more CPU or more capacity is really what applications need” is accurate, but most HCI vendors are offering a way to expand storage or compute now without necessarily growing the other components (think Nutanix with their storage-only nodes and NetApp’s approach to HCI). I’d posit that architectures have changed enough with the HCI market leaders to the point that this is no longer a real issue.

Finally, I’m not convinced that “performance is not guaranteed when a set of applications running in a compute node overruns a caches capacity” is as much of a problem as it was a few years ago. Modern hypervisors have a lot of smarts built into them in terms of service quality and the modelling for capacity and performance sizing has improved significantly.



I like Lance, and I like what Primary Data bring to the table with their policy-based SDS solution. I don’t necessarily agree with him on some of these points (particularly as I think HCI solutions have matured a bunch in the last few years) but I do enjoy the opportunity to think about some of these ideas when I otherwise wouldn’t. So what will 2018 bring in my opinion? No idea, but it’s going to be interesting, that’s for sure.

Primary Data Announces Survey Results (And A Very Useful Tool)

Survey Says …

Primary Data conducted a “Storage Census” at VMworld this year, and I had the opportunity to talk about the results with Lance Smith (CEO, Primary Data). You can read the press release here.


The Enterprise Responds

The majority of the 313 respondents worked at large enterprises with over 1,000 to 5,000 and more employees, and their companies were mostly over ten years old.

[image courtesy of Primary Data]


Performance and Cold Data

The survey also found that:

  • 38 percent of respondents selected performance as their biggest challenge for IT, making it the most pressing IT issue named in the survey;
  • Data migrations were the second most common headache for 36 percent of those surveyed;
  • Budget challenges (35 percent) and cloud adoption (27 percent) closely followed among the top concerns for respondents;
  • The majority of organizations estimate that at least 60 percent of their data is cold; and
  • 27 percent of organizations manage twenty or more storage systems, 44 percent manage ten or more storage systems.

Smith said that “[p]erformance remains a noisy problem, and the lack of insight into data forces IT to overprovision and overspend even though the professionals surveyed know that the majority of their data is actually sitting idle”. This leads me to the next part of Primary Data’s announcement: Data Profiler.


Data Profiler Could Be A Very Useful Tool

The team then ran me through a demo of a new tool available via DataSphere called Data Profiler. The Data Profiler provides a number of data points that change according to the chosen objectives, including

  • Cost per storage tier,
  • Number of files per tier,
  • Capacity per tier; and
  • Total cost of the analyzed global storage ecosystem.

Data tiers can be easily added to the Data Profiler to evaluate policies using the resources available in each customer’s unique environment.

[image courtesy of Primary Data]

The cool thing about this tool is you can do all your own modelling by plugging in various tiers of storage, cost per MB/GB/TB, etc and have it give you a clear view of potential cost savings, assuming that your data is, indeed, cold. You can then build these policies (Objectives) in DataSphere and have it move everything around to align with the policy you want to use. You want your hot data to live in the performance-oriented tiers, and you want the cold stuff to live in the cheap and deep parts. This is a simple thing to achieve when you have one array. But when you have a range of arrays from different vendors it becomes a little more challenging.



I wasn’t entirely surprised by the results of the survey, as a number of my enterprise customers have reported the same concerns. Enterprise storage can be hard to get right, and it’s clear that a lot of people don’t really understand the composition of their storage environment from an active data perspective.

I’ve been a fan of Pd’s approach to data virtualisation for some time, having observed the development of the company over the course of a number of Storage Field Day events. Any tool that you can use to get a better idea of how to best deploy your storage, particularly across heterogeneous environments, is a good thing in my book. I was advised by the team that this is built into DataSphere at the moment, although it’s easy enough to run a small script to gather the data to send back to their SEs for analysis. I believe there may also be an option to install the tool locally, but I’d recommend engaging with Primary Data’s team to find out more about that. As with most things in IT, the solution isn’t for everyone. If you’re running a single, small array, you may have a good idea of what it costs to use various tiers of storage, how old your data is and how frequently it’s being accessed. But there are plenty of enterprises out there with a painful variety of storage solutions deployed for any number of (usually) good reasons. These places could potentially benefit greatly from some improved insights into their environments.

I’ve said it before and I’ll keep saying it – Enterprise IT can be hard to do right. Any tools you can use to make things easier and potentially more cost effective should be on your list of things to investigate.

Primary Data Attacks Application Ignorance

Disclaimer: I recently attended Storage Field Day 13.  My flights, accommodation and other expenses were paid for by Tech Field Day and Pure Storage. There is no requirement for me to blog about any of the content presented and I am not compensated in any way for my time at the event.  Some materials presented were discussed under NDA and don’t form part of my blog posts, but could influence future discussions.


Primary Data recently presented at Storage Field Day 13. You can see videos of their presentation here, and download my rough notes from here.


My Applications Are Ignorant

I had the good fortune of being in a Primary Data presentation at Storage Field Day. I’ve written about them a few times (here, here and here) so I won’t go into the basic overview again. What I would like to talk about is the idea raised by Primary Data that “applications are unaware”. Unaware of what exactly? Usually it’s the underlying infrastructure. When you deploy applications they generally want to run as fast as they can, or as fast as they need to. When you think about it though, it’s unusual that they can determine the platform they run on and only run as fast as the supporting infrastructure allows. This is different to phone applications, for example, that are normally written to operate within the constraints of the hardware.

An application’s ignorance has an impact. According to Primary Data, this impact can be in terms of performance, cost, protection (or all three). The cost of unawareness can have the following impact on your environment:

  • Bottlenecks hinder performance
  • Cold data hogs hot capacity
  • Over provisioning creates significant overspending
  • Migration headaches keep data stuck until retirements
  • Vendor lock-in limits agility and adds more cost

As well as this, the following trends are being observed in the data centre:

  • Cost: Budgets are getting smaller;
  • Time: We never have enough; and
  • Resources: We have limited resources and infrastructure to run all of this stuff.

Put this all together and you’ve got a problem on your hands.


Primary Data In The Mix

Primary Data tells us that DataSphere is solving the main pain points through:

They do this with DataSphere, “a metadata engine that automates the flow of data across the enterprise infrastructure and the cloud to meet evolving application demands”. It:

  • Is storage and vendor agnostic;
  • Virtualises the view of data;
  • Automates the flow of data; and
  • Solves the inefficiency of traditional storage and compute architectures.


Can We Be More Efficient?

Probably. But the traditional approach of architecting infrastructure for various workloads isn’t really working as well as we’d like. I like the way Primary Data are solving the problem of application ignorance. But I think it’s treating a symptom, rather than providing a cure. I’m not suggesting that I think what Pd are doing is wrong by any stretch, but rather that my applications will still remain ignorant. They’re still not going to have an appreciation of the infrastructure they’re running on, and they’re still going to run at the speed they want to run at. That said, with the approach that Primary Data takes to managing data, I have a better chance of having applications running with access to the storage resources they need.

Application awareness means different things to different people. For some people, it’s about understanding how the application is going to behave based on the constraints it was designed within, and what resources they think it will need to run as expected. For other people, it’s about learning the behavior of the application based on past experiences of how the application has run and providing infrastructure that can accommodate that behaviour. And some people want their infrastructure to react to the needs of the application in real time. I think this is probably the nirvana of infrastructure and application interaction.

Ultimately, I think Primary Data provides a really cool way of managing various bits of heterogeneous storage in a way that aligns with some interesting notions of how applications should behave. I think the way they pick up on the various behaviours of applications within the infrastructure and move data around accordingly is also pretty neat. I think we’re still some ways away from running the kind of infrastructure that interacts intelligently with applications at the right level, but Primary Data’s solution certainly helps with some of the pain of running ignorant applications.

You can read more about DataSphere Extended Services (DSX) here, and the DataSphere metadata engine here.

Data Virtualisation is More Than Just Migration for Primary Data

Disclaimer: I recently attended Storage Field Day 10.  My flights, accommodation and other expenses were paid for by Tech Field Day. There is no requirement for me to blog about any of the content presented and I am not compensated in any way for my time at the event.  Some materials presented were discussed under NDA and don’t form part of my blog posts, but could influence future discussions.


Before I get started, you can find a link to my raw notes on Primary Data’s presentation here. You can also see videos of the presentation here. I’ve seen Primary Data present at SFD7 and SFD8, and I’ve typically been impressed with their approach to Software-Defined Storage (SDS) and data virtualisation generally. And I’m also quite a fan of David Flynn‘s whiteboarding chops.



Data Virtualisation is More Than Just Migration

Primary Data spent  some time during their presentation at SFD10 talking about Data Migration vs Data Mobility.


[image courtesy of Primary Data]

Data migration can be a real pain to manage. It’s quite often a manual process and is invariably tied to the capabilities of the underlying storage platform hosting the data. The cool thing about Primary Data’s solution is that it offers dynamic data mobility, aligning “data’s needs (objectives) with storage capabilities (service levels) through automated mobility, arbitrated by economic value and reported as compliance”. Sounds like a mouthful, but it’s a nice way of defining pretty much what everyone’s been trying to achieve with storage virtualisation solutions for the last decade or longer.

What I like about this approach is that it’s a data-centric, rather than employing a storage platform focused approach. Primary Data supports “anything that can be presented to Linux as a block device”, so the options to deploy this stuff are fairly broad. Once you’ve presented your data to DSX, there’s some smart service level objectives (SLOs) that can be applied to the data. These can be broken down into the categories of protection, performance, and price/penalty:


  • Durability
  • Availability
  • Recoverability – Security
  • Priority
  • Sovereignty


  • IOPS / Bandwidth / Latency – Read / Write
  • Sustained / Burst

Price / Penalty

  • Per File
  • Per Byte
  • Per Operation

Access Control can also be applied to your data. With Primary Data, “[e]very storage container is a landlord with floorspace to lease and utilities available (capacity and performance)”.


Further Reading and Final Thoughts

I like the approach to data virtualisation that Primary Data have taken. There are a number of tools on the market that claim to fully virtualise storage and offer mobility across platforms. Some of them do it well, and some focus more on the benefits provided around ease of migration from one platform to another.

That said, there’s certainly some disagreement in the market place on whether Primary Data could be considered a fully-fledged SDS solution. Be that as it may, I really like the focus on data, rather than silos of storage. I’m also a big fan of applying SLOs to data, particularly when it can be automated to improve the overall performance of the solution and make the data more accessible and, ultimately, more valuable.

Primary Data has a bunch of use cases that extend beyond data mobility as well, including deployment options ranging from Hyperconverged, software-defined NAS and clustering across existing storage platforms. Primary Data want to “do for storage what VMware did for compute”. I think the approach they’ve taken has certainly gotten them on the right track, and the platform has matured greatly in the last few years.

If you’re after some alternative (and better thought out) posts on Primary Data, you can read Jon‘s post here. Max also did a good write-up here, while Chris M.Evans did a nice preview post on Primary Data that you can find here.

Primary Data – Because we all want our storage to do well

Disclaimer: I recently attended Storage Field Day 8.  My flights, accommodation and other expenses were paid for by Tech Field Day. There is no requirement for me to blog about any of the content presented and I am not compensated in any way for my time at the event.  Some materials presented were discussed under NDA and don’t form part of my blog posts, but could influence future discussions.

For each of the presentations I attended at SFD8, there are a few things I want to include in the post. Firstly, you can see video footage of the Primary Data presentation here. You can also download my raw notes from the presentation here. Finally, here’s a link to the Primary Data website that covers some of what they presented.


Primary Data presented at Storage Field Day 7 – and I did a bit of a write-up here. At the time they had no shipping product, but now they do. Primary Data were big on putting data in the right place, and they’ve continued that theme with DataSphere.


Because we all want our storage to do well

I talk to my customers a lot about the concept of service catalogues for their infrastructure. Everyone is talking about X as a Service and the somewhat weird concept of applying consumer behaviour to the enterprise. But for a long time this approach has been painful, at least with mid-range storage products, because coming up with classifications of performance and availability for these environments is a non-trivial task. In larger environments, it’s also likely you won’t have consistent storage types across applications, with buckets of data being stored all over the place, and accessible via a bunch of different protocols. The following image demonstrates nicely the different kinds of performance levels you might apply to your environment, as not all applications were created equal. Neither are storage arrays, come to think of it.


[image courtesy of Primary Data]

Primary Data say that “every single capability and characteristic of the storage system can be thought of in terms of whether the data needs it or not”. As I said before, you need to look at your application requirements in terms of both:

  • Performance (reads vs writes, IOPS, bandwidth, latency); and
  • Protection (data durability, availability, security).

Of course, this isn’t simple when you then attempt to apply compute requirements and network requirements to the applications as well, but let’s just stick with storage requirements for the time being. Once you understand the client, and the value of the objectives being met on the data, you can start to apply Objectives and “Smart Objectives” (Objectives applied to particular types of data) to the data. With this approach, you can begin to understand the cost of specific performance and protection levels. Everyone wants Platinum, until they start having to pay for it. These costs can then be translated and presented as Service Level Agreements in your organisation’s service catalogue for consumption by various applications.


Closing Thoughts and Further Reading

Primary Data has a lot of smart people involved in the product, and they always put on one hell of  whiteboard session when I see them present. To my mind, the key thing to understand with DataSphere isn’t that it will automatically transform your storage environment into a fully optimised, service catalogue enabled, application performance nirvana. Rather, it’s simply providing the smarts to leverage the insights that you provide. If you don’t have a service catalogue, or a feeling in your organisation that this might be a good thing to have, then you’re not going to get the full value out of Primary Data’s offering.

And while you’re at it, check out Cormac’s post for a typically thorough overview of the technology involved.



Storage Field Day 7 – Day 1 – Primary Data

Disclaimer: I recently attended Storage Field Day 7.  My flights, accommodation and other expenses were paid for by Tech Field Day. There is no requirement for me to blog about any of the content presented and I am not compensated in any way for my time at the event.  Some materials presented were discussed under NDA and don’t form part of my blog posts, but could influence future discussions.

For each of the presentations I attended at SFD7, there are a few things I want to include in the post. Firstly, you can see video footage of the Primary Data presentation here. You can also download my raw notes from the presentation here. Finally, here’s a link to the Primary Data website that covers some of what they presented.

Company Overview

Here’s a slightly wonky photo of Lance Smith providing the company overview.



If you haven’t heard of Primary Data before, they came out of stealth in November. Their primary goal is “Automated data mobility through data virtualisation” with a focus on intelligent, policy-driven automation for storage environments. Some of the key principles driving the development of the product are:

  • Dynamic Data Mobility – see and place data across all storage resources within a single global data space;
  • Policy-driven agility – non-disruptive, policy-based data movement;
  • Intelligent automation – automated real-time infrastructure dynamically aligns supply demand;
  • Linear scalability – performance and capacity scales linearly and incrementally; and
  • Global compatibility – single hardware-agnostic solution enhances coexisting legacy it and modern scale-out and hybrid cloud architectures.



David Flynn then launched into an engaging whiteboard session on the Primary Data architecture.


With storage, you have three needs – Performance, Price, Protection (Fast, Safe, Cheap). As most of us know but few of us wish to admit, you can’t have all three at the same time. This isn’t going to change, according to David. Indeed, the current approach of “Managing data via the storage container that holds it. This is the tail wagging the dog.”

So how does Primary Data get around the problem? Separate the metadata from the data.

Primary Data:

  • Uses pNFS client;
  • Offers file on file, on block, on object, on DAS;
  • Block as file;
  • Object; and
  • Splits the metadata and control path off to the side.

Primary Data also claim that 80% of IOPS to primary storage is to storage that doesn’t need to exist after a crash (temp, scratch, swap, etc).

David talked about when VMware first did virtualisation, there were a few phases:

1. Utilisation – This was the “doorknocker” use case that got people interested in virtualisation.

2. Manageability – this is what got people sticking with virtualisation.

Now along comes Primary Data, doing Data Virtualisation that also offers:

3. Performance.

Because, once you’ve virtualised the data, the problem becomes setting the objectives for the storage and the needs of the data. This is where Primary Data claim that their policy-based automation really helps organisations get the most from their storage platforms, and thus, their applications and data.


Closing Thoughts and Further Reading

Primary Data have some great pedigree and a lot of prior experience in the storage industry. There’s a lot more to the product than I’ve covered here, and it’s worth your while revisiting the video presentation they did at SFD7. They’ve taken an interesting approach, and I’m looking forward to hearing more about how it goes for them when they start shipping GA code (which they expect to do later this year).


Mark has a good write-up here, while Keith’s preview blog post is here and his excellent post-presentation discussion post can be found here.