Storage Field Day – I’ll Be At Storage Field Day 18

Here’s some good news for you. I’ll be heading to the US in late February for another Storage Field Day event. If you haven’t heard of the very excellent Tech Field Day events, you should check them out. I’m looking forward to time travel and spending time with some really smart people for a few days. It’s also worth checking back on the Storage Field Day 18 website during the event (February 27 – March 1) as there’ll be video streaming and updated links to additional content. You can also see the list of delegates and event-related articles that have been published.

I think it’s a great line-up of both delegates and presenting companies (including a “secret company”) this time around. I know them all pretty well, but there may also still be a few companies added to the line-up. I’ll update this if and when they’re announced.

I’d like to publicly thank in advance the nice folks from Tech Field Day who’ve seen fit to have me back, as well as my employer for letting me take time off to attend these events. Also big thanks to the companies presenting. It’s going to be a lot of fun. Seriously. If you’re in the Bay Area and want to catch up prior to the event, please get in touch. I’ll have some free time, so perhaps we could check out a Warriors game on the 23rd and discuss the state of the industry? ;)

OpenMediaVault – Good Times With mdadm

Happy 2019. I’ve been on holidays for three full weeks and it was amazing. I’ll get back to writing about boring stuff soon, but I thought I’d post a quick summary of some issues I’ve had with my home-built NAS recently and what I did to fix it.

Where Are The Disks Gone?

I got an email one evening with the following message.

I do enjoy the “Faithfully yours, etc” and the post script is the most enlightening bit. See where it says [UU____UU]? Yeah, that’s not good. There are 8 disks that make up that device (/dev/md0), so it should look more like [UUUUUUUU]. But why would 4 out of 8 disks just up and disappear? I thought it was a little odd myself. I had a look at the ITX board everything was attached to and realised that those 4 drives were plugged in to a PCI SATA-II card. It seems that either the slot on the board or the card are now failing intermittently. I say “seems” because that’s all I can think of, as the S.M.A.R.T. status of the drives is fine.

Resolution, Baby

The short-term fix to get the filesystem back on line and useable was the classic “assemble” switch with mdadm. Long time readers of this blog may have witnessed me doing something similar with my QNAP devices from time to time. After panic rebooting the box a number of times (a silly thing to do, really), it finally responded to pings. Checking out /proc/mdstat wasn’t good though.

dan@openmediavault:~$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
unused devices: <none>

Notice the lack of, erm, devices there? That’s non-optimal. The fix requires a forced assembly of the devices comprising /dev/md0.

dan@openmediavault:~$ sudo mdadm --assemble --force --verbose /dev/md0 /dev/sd[abcdefhi]
[sudo] password for dan:
mdadm: looking for devices for /dev/md0
mdadm: /dev/sda is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdb is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdc is identified as a member of /dev/md0, slot 3.
mdadm: /dev/sdd is identified as a member of /dev/md0, slot 2.
mdadm: /dev/sde is identified as a member of /dev/md0, slot 5.
mdadm: /dev/sdf is identified as a member of /dev/md0, slot 4.
mdadm: /dev/sdh is identified as a member of /dev/md0, slot 7.
mdadm: /dev/sdi is identified as a member of /dev/md0, slot 6.
mdadm: forcing event count in /dev/sdd(2) from 40639 upto 40647
mdadm: forcing event count in /dev/sdc(3) from 40639 upto 40647
mdadm: forcing event count in /dev/sdf(4) from 40639 upto 40647
mdadm: forcing event count in /dev/sde(5) from 40639 upto 40647
mdadm: clearing FAULTY flag for device 3 in /dev/md0 for /dev/sdd
mdadm: clearing FAULTY flag for device 2 in /dev/md0 for /dev/sdc
mdadm: clearing FAULTY flag for device 5 in /dev/md0 for /dev/sdf
mdadm: clearing FAULTY flag for device 4 in /dev/md0 for /dev/sde
mdadm: Marking array /dev/md0 as 'clean'
mdadm: added /dev/sdb to /dev/md0 as 1
mdadm: added /dev/sdd to /dev/md0 as 2
mdadm: added /dev/sdc to /dev/md0 as 3
mdadm: added /dev/sdf to /dev/md0 as 4
mdadm: added /dev/sde to /dev/md0 as 5
mdadm: added /dev/sdi to /dev/md0 as 6
mdadm: added /dev/sdh to /dev/md0 as 7
mdadm: added /dev/sda to /dev/md0 as 0
mdadm: /dev/md0 has been started with 8 drives.

In this example you’ll see that /dev/sdg isn’t included in my command. That device is the SSD I use to boot the system. Sometimes Linux device conventions confuse me too. If you’re in this situation and you think this is just a one-off thing, then you should be okay to unmount the filesystem, run fsck over it, and re-mount it. In my case, this has happened twice already, so I’m in the process of moving data off the NAS onto some scratch space and have procured a cheap little QNAP box to fill its role.

 

Conclusion

My rush to replace the homebrew device with a QNAP isn’t a knock on the OpenMediaVault project by any stretch. OMV itself has been very reliable and has done everything I needed it to do. Rather, my ability to build semi-resilient devices on a budget has simply proven quite poor. I’ve seen some nasty stuff happen with QNAP devices too, but at least any issues will be covered by some kind of manufacturer’s support team and warranty. My NAS is only covered by me, and I’m just not that interested in working out what could be going wrong here. If I’d built something decent I’d get some alerting back from the box telling me what’s happened to the card that keeps failing. But then I would have spent a lot more on this box than I would have wanted to.

I’ve been lucky thus far in that I haven’t lost any data of real import (the NAS devices are used to store media that I have on DVD or Blu-Ray – the important documents are backed up using Time Machine and Backblaze). It is nice, however, that a tool like mdadm can bring you back from the brink of disaster in a pretty efficient fashion.

Incidentally, if you’re a macOS user, you might have a bunch of .ds_store files on your filesystem. Or stuff like .@Thumb or some such. These things are fine, but macOS doesn’t seem to like them when you’re trying to move folders around. This post provides some handy guidance on how to get rid of a those files in a jiffy.

As always, if the data you’re storing on your NAS device (be it home-built or off the shelf) is important, please make sure you back it up. Preferably in a number of places. Don’t get yourself in a position where this blog post is your only hope of getting your one copy of your firstborn’s pictures from the first day of school back.

Random Short Take #10

Here are a few links to some random news items and other content that I found interesting. You might find it interesting too. Maybe. This will be the last one for this year. I hope you and yours have a safe and merry Christmas / holiday break.

  • Scale Computing have finally entered the Aussie market in partnership with Amnesium. You can read more about that here
  • Alastair is back in the classroom, teaching folks about AWS. He published a bunch of very useful notes from a recent class here.
  • The folks at Backblaze are running a “Refer-A-Friend” promotion. If you’re looking to become a new Backblaze customer and sign up with my referral code, you’ll get some free time on your account. And I will too! Hooray! I’ve waxed lyrical about Backblaze before, and I recommend it. The offer runs out on January 6th 2019, so get a move on.
  • Howard did a nice article on VVols that I recommend checking out.
  • GDPR has been a challenge (within and outside the EU), but I enjoyed Mark Browne‘s take on Cohesity’s GDPR compliance.
  • I’m quite a fan of the Netflix Tech Blog, and this article on the Netflix Media Database was a ripper.
  • From time to time I like to poke fun at my friends in the US for what seems like an excessive amount of shenanigans happening in that country, but there’s plenty of boneheaded stuff happening in Australia too. Read Preston’s article on the recently passed anti-encryption laws to get a feel for the heady heights of stupidity that we’ve been able to reach recently.

 

Updated Articles Page

I recently had the opportunity to upgrade my Cohesity lab environment using Helios and thought I’d run through the basics. There’s a new document outlining the process on the articles page.

Google WiFi – A Few Notes

Like a lot of people who work in IT as their day job, the IT situation at my house is a bit of a mess. I think the real reason for this is because, once the working day is done, I don’t want to put any thought into doing this kind of stuff. As a result, like a lot of tech folk, I have way more devices and blinking lights in my house than I really need. And I’m always sure to pile on a good helping of technical debt any time I make any changes at home. It wouldn’t be any fun without random issues to deal with from time to time.

Some Background – Apple Airport

I’ve been running an Apple Airport Extreme and a number of Airport Express devices in my house for a while in a mesh network configuration. Our house is 2 storeys and it was too hard to wire up properly with Ethernet after we bought it. I liked the Apple devices primarily because of the easy to use interface (via browser or phone), and Airplay, in my mind at least, was a killer feature. So I’ve stuck with these things for some time, despite the frequent flakiness I experienced with the mesh network (I’d often end up connected to an isolated access point with no network access – a reboot of the base station seemed to fix this) and the sometimes frustrating lack of visibility into what was going on in the network. 

Enter Google Wifi

I had some Frequent Flier points available that meant I could get a 3-pack of Google access points for under $200 AU (I think that’s about $15 in US currency). I’d already put up the Christmas tree, so I figured I could waste a few hours on re-doing the home network. I’m not going to do a full review of the Google Wifi solution, but if you’re interested in that kind of thing, Josh Odgers does a great job of that here. In short, it took me about an hour to place the three access points in the house and get everything connected. I have about 30 – 40 devices running, some of which are hardwired to a switch connected to my ISP’s NBN gateway, and most of which connect wirelessly. 

So What’s The Problem?

The problem was that I’d kind of just jammed the primary Google Wifi point into the network (attached to a dumb switch downstream of the modem). As a result, everything connecting wirelessly via the Google network had an IP range of 192.168.86.x, and all of my other devices were in the existing 10.x.x.x range. This wasn’t a massive problem, as the Google solution does a great job of routing stuff between the “wan” and “lan” subnets, but I started to notice that my pi-hole device wasn’t picking up hostnames properly, and some devices were getting confused about which DNS to use. Oh, and my port mapping for Plex was a bit messed up too. I also had wired devices (i.e. my desktop machine) that couldn’t see Airplay devices on the wireless network without turning on Wifi.

The Solution?

After a lot of Googling, I found part of the solution via this Reddit thread. Basically, what I needed to do was follow a more structured topology, with my primary Google device hanging off my ISP’s switch (and connected via the “wan” port on the Google Wifi device). I then connected the “lan” port on the Google device to my downstream switch (the one with the pi-hole, NAS devices, and other stuff connected to it). 

Now the pi-hole could play nicely on the network, and I could point my devices to it as the DNS server via the Google interface. I also added a few more reservations into my existing list of hostnames on the pi-hole (instructions here) so that it could correctly identify any non-DHCP clients. I also changed the DHCP range on the Google Wifi to a single IP address (the one used by the pi-hole) and made sure that there was a reservation set for the pi-hole on the Google side of things. The reason for this (I think) is that you can’t disable DHCP on the Google Wifi device. To solve the Plex port mapping issue, I set a manual port mapping on my ISP modem and pointed it to the static IP address of the primary Google Wifi device. I then created a port mapping on the Google side of things to point to my Plex Media Server. It took a little while, but eventually everything started to work. 

It’s also worth noting that I was able to reconfigure the Airport Express devices connected to speakers to join the new Wifi network and I can still use Airplay around the house as I did before.

Conclusion 

This seems like a lot of mucking about for what is meant to be a plug and play wireless solution. In Google’s defence though, my home network topology is a bit more fiddly than the average punter’s would be. If I wasn’t so in love with pi-hole, and didn’t have devices that I wanted to use static IP addresses and DNS, then I wouldn’t have had as many problems as I did with the setup. From a performance and usability standpoint, I think the Google solution is excellent. Of course, this might all go to hell in a hand basket when I ramp up IPv6 in the house, but for now it’s been working well. Coupled with the fact that my networking skills are pretty subpar and we should all just be happy I was able to post this article on the Internet from my house.

Elastifile Announces Cloud File Service

Elastifile recently announced a partnership with Google to deliver a fully-managed file service delivered via the Google Cloud Platform. I had the opportunity to speak with Jerome McFarland and Dr Allon Cohen about the announcement and thought I’d share some thoughts here.

 

What Is It?

Elastifile Cloud File Service delivers a self-service SaaS experience, providing the ability to consume scalable file storage that’s deeply integrated with Google infrastructure. You could think of it as similar to Amazon’s EFS.

[image courtesy of Elastifile]

 

Benefits

Easy to Use

Why would you want to use this service? It:

  • Eliminates manual infrastructure management;
  • Provisions turnkey file storage capacity in minutes; and
  • Can be delivered in any zone, and any region.

 

Elastic

It’s also cloudy in a lot of the right ways you want things to be cloudy, including:

  • Pay-as-you-go, consumption-based pricing;
  • Flexible pricing tiers to match workflow requirements; and
  • The ability to start small and scale out or in as needed and on-demand.

 

Google Native

One of the real benefits of this kind of solution though, is the deep integration with Google’s Cloud Platform.

  • The UI, deployment, monitoring, and billing are fully integrated;
  • You get a single bill from Google; and
  • The solution has been co-engineered to be GCP-native.

[image courtesy of Elastifile]

 

What About Cloud Filestore?

With Google’s recently announced Cloud Filestore, you get:

  • A single storage tier selection, being Standard or SSD;
  • It’s available in-cloud only; and
  • Grow capacity or performance up to a tier capacity.

With Elastifile’s Cloud File Service, you get access to the following features:

  • Aggregates performance & capacity of many VMs
  • Elastically scale-out or -in; on-demand
  • Multiple service tiers for cost flexibility
  • Hybrid cloud, multi-zone / region and cross-cloud support

You can also use ClearTier to perform tiering between file and object without any application modification.

 

Thoughts

I’ve been a fan of Elastifile for a little while now, and I thought their 3.0 release had a fair bit going for it. As you can see from the list of features above, Elastifile are really quite good at leveraging all of the cool things about cloud – it’s software only (someone else’s infrastructure), reasonably priced, flexible, and scalable. It’s a nice change from some vendors who have focussed on being in the cloud without necessarily delivering the flexibility that cloud solutions have promised for so long. Coupled with a robust managed service and some preferential treatment from Google and you’ve got a compelling solution.

Not everyone will want or need a managed service to go with their file storage requirements, but if you’re an existing GCP and / or Elastifile customer, this will make some sense from a technical assurance perspective. The ability to take advantage of features such as ClearTier, combined with the simplicity of keeping it all under the Google umbrella, has a lot of appeal. Elastifile are in the box seat now as far as these kinds of offerings are concerned, and I’m keen to see how the market responds to the solution. If you’re interested in this kind of thing, the Early Access Program opens December 11th with general availability in Q1 2019. In the meantime, if you’d like to try out ECFS on GCP – you can sign up here.

Cisco IT Blog Awards

I’m very happy to announce that this blog is a finalist in the 2018 Cisco IT Blog Awards under the category of “Most Entertaining”. Voting is open until January 4th 2019, so if you’ve felt entertained at any point this year when reading my witty articles please go to http://cs.co/itblogawards and pop in a vote for “PenguinPunk”.

And if you are not entertained, check out some of the other entrants in any case – they’re pretty ace.

Big Switch Announces AWS Public Cloud Monitoring

Big Switch Networks recently announced Big Mon for AWS. I had the opportunity to speak with Prashant Gandhi (Chief Product Officer) about the announcement and thought I’d share some thoughts here.

The Announcement

Big Switch describe Big Monitoring Fabric Public Cloud (it’s real product name) as “a seamless deep packet monitoring solution that enables workload monitoring within customer specified Virtual Private Clouds (VPCs). All components of the solution are virtual, with elastic scale-out capability based on traffic volumes.”

[image courtesy of Big Switch]

There are some real benefits to be had, including:

  • Complete AWS Visibility;
  • Multi-VPC support;
  • Elastic scaling; and
  • Consistent with the On-Prem offering.

Capabilities

  • Centralised packet and flow-based monitoring of all VPCs of a user account
  • Visibility-related traffic is kept local for security purposes and cost savings
  • Monitoring and security tools are centralised and tagged within the dedicated VPC for ease of configuration
  • Role-based access control enables multiple teams to operate Big Mon 
  • Supports centralised AWS VPC tool farm to reduce monitoring cost
  • Integrated with Big Switch’s Multi-Cloud Director for centralised hybrid cloud management

Thoughts and Further Reading

It might seem a little odd that I’m covering news from a network platform vendor on this blog, given the heavy focus I’ve had over the years on storage and virtualisation technologies. But the world is changing. I work for a Telco now and cloud is dominating every infrastructure and technology conversation I’m having. Whether it’s private or public or hybrid, cloud is everywhere, and networks are a bit part of that cloud conversation (much as it has been in the data centre), as is visibility into those networks. 

Big Switch have been around for under 10 years, but they’ve already made some decent headway with their switching platform and east-west monitoring tools. They understand cloud networking, and particularly the challenges facing organisations leveraging complicated cloud networking topologies. 

I’m the first guy to admit that my network chops aren’t as sharp as they could be (if you watched me setup some Google WiFi devices over the weekend, you’d understand). But I also appreciate that visibility is key to having control over what can sometimes be an overly elastic / dynamic infrastructure. It’s been hard to see traffic between availability zones, between instances, and contained in VPNs. I also like that they’ve focussed on a consistent experience between the on-premises offering and the public cloud offering. 

If you’re interested in learning more about Big Switch Networks, I also recommend checking out their labs.

Cohesity – Cohesity Cluster Virtual Edition ESXi – A Few Notes

I’ve covered the Cohesity appliance deployment in a howto article previously. I’ve also made use of the VMware-compatible Virtual Edition in our lab to test things like cluster to cluster replication and cloud tiering. The benefits of virtual appliances are numerous. They’re generally easy to deploy, don’t need dedicated hardware, can be re-deployed quickly when you break something, and can be a quick and easy way to validate a particular process or idea. They can also be a problem with regards to performance, and are at the mercy of the platform administrator to a point. But aren’t we all? With 6.1, Cohesity have made available a clustered virtual edition (the snappily titled Cohesity Cluster Virtual Edition ESXi). If you have access to the documentation section of the Cohesity support site, there’s a PDF you can download that explains everything. I won’t go into too much detail but there are a few things to consider before you get started.

 

Specifications

Base Appliance 

Just like the non-clustered virtual edition, there’s a small and large configuration you can choose from. The small configuration supports up to 8TB for the Data disk, while the large configuration supports up to 16TB for the Data disk. The small config supports 4 vCPUs and 16GB of memory, while the large configuration supports 8 vCPUs and 32GB of memory.

Disk Configuration

Once you’ve deployed the appliance, you’ll need to add the Metadata disk and Data disk to each VM. The Metadata disk should be between 512GB and 1TB. For the large configuration, you can also apparently configure 2x 512GB disks, but I haven’t tried this. The Data disk needs to be between 512GB and 8TB for the small configuration and up to 16TB for the large configuration (with support for 2x 8TB disks). Cohesity recommends that these are formatted as Thick Provision Lazy Zeroed and deployed in Independent – Persistent mode. Each disk should be attached to its own SCSI controller as well, so you’ll have the system disk on SCSI 0:0, the Metadata disk on SCSI 1:0, and so on.

I did discover a weird issue when deploying the appliance on a Pure Storage FA-450 array in the lab. In vSphere this particular array’s datastore type is identified by vCenter as “Flash”. For my testing I had a 512GB Metadata disk and 3TB Data disk configured on the same datastore, with the three nodes living on three different datastores on the FlashArray. This caused errors with the cluster configuration, with the configuration wizard complaining that my SSD volumes were too big.

I moved the Data disk (with storage vMotion) to an all flash Nimble array (that for some reason was identified by vSphere as “HDD”) and the problem disappeared. Interestingly I didn’t have this problem with the single node configuration of 6.0.1 deployed with the same configuration. I raised a ticket with Cohesity support and they got back to me stating that this was expected behaviour in 6.1.0a. They tell me, however, that they’ve modified the behaviour of the configuration routine in an upcoming version so fools like me can run virtualised secondary storage on primary storage.

Erasure Coding

You can configure the appliance for increased resiliency at the Storage Domain level as well. If you go to Platform – Cluster – Storage Domains you can modify the DefaultStorageDomain (and other ones that you may have created). Depending on the size of the cluster you’ve deployed, you can choose the number of failures to tolerate and whether or not you want erasure coding enabled.

You can also decide whether you want EC to be a post-process activity or something that happens inline.

 

Process

Once you’ve deployed (a minimum) 3 copies of the Clustered VE, you’ll need to manually add Metadata and Data disks to each VM. The specifications for these are listed above. Fire up the VMs and go to the IP of one of the nodes. You’ll need to log in as the admin user with the appropriate password and you can then start the cluster configuration.

This bit is pretty much the same as any Cohesity cluster deployment, and you’ll need to specify things like a hostname for the cluster partition. As always, it’s a good idea to ensure your DNS records are up to date. You can get away with using IP addresses but, frankly, people will talk about you behind your back if you do.

At this point you can also decide to enable encryption at the cluster level. If you decide not to enable it you can do this on a per Domain basis later.

Click on Create Cluster and you should see something like the following screen.

Once the cluster is created, you can hit the virtual IP you’ve configured, or any one of the attached nodes, to log in to the cluster. Once you log in, you’ll need to agree to the EULA and enter a license key.

 

Thoughts

The availability of virtual appliance versions for storage and data protection solutions isn’t a new idea, but it’s certainly one I’m a big fan of. These things give me an opportunity to test new code releases in a controlled environment before pushing updates into my production environment. It can help with validating different replication topologies quickly, and validating other configuration ideas before putting them into the wild (or in front of customers). Of course, the performance may not be up to scratch for some larger environments, but for smaller deployments and edge or remote office solutions, you’re only limited by the available host resources (which can be substantial in a lot of cases). The addition of a clustered version of the virtual edition for ESXi and Hyper-V is a welcome sight for those of us still deploying on-premises Cohesity solutions (I think the Azure version has been clustered for a few revisions now). It gets around the main issue of resiliency by having multiple copies running, and can also address some of the performance concerns associated with running virtual versions of the appliance. There are a number of reasons why it may not be the right solution for you, and you should work with your Cohesity team to size any solution to fit your environment. But if you’re running Cohesity in your environment already, talk to your account team about how you can leverage the virtual edition. It really is pretty neat. I’ll be looking into the resiliency of the solution in the near future and will hopefully be able to post my findings in the next few weeks.

Random Short Take #9

Here are a few links to some random news items and other content that I found interesting. You might find it interesting too. Maybe.