I frequently get briefing invitations from various companies focused on storage and data centre infrastructure. Sometimes their product isn’t directly related to things I might write about, but I like to take these briefings when I can because it gives me something new to learn. Whilst infrastructure and application monitoring plays a big part of in the data centre, it’s not something I write about with any great frequency. All this is a long way of saying that I took a briefing with Uila recently and was pleasantly surprised.
What’s a Uila Then?
Pronounced “wee-luh”, Uila is focused on full-stack visibility. They aim to provide you with the ability to:
Troubleshoot complex issues to root cause quickly
Monitor end user, application performance, availability and infrastructure health
Perform planning, optimisation and issue prevention
The cool thing is they don’t just focus on virtualised workloads.
[image courtesy of Uila]
Application and Network Intelligence
The key to Uila is the network-centric approach to monitoring. This is done primarily via the Virtual Smart Traffic Taps (vST):
Distributed VMs (vST) sniff packets from (D)vSwitch
Deep packet inspection
Network performance and flow analysis
4000+ application identification and meta data analysis
Application transaction response time & volume tracking
Compute & Storage & OS Process Intelligence
The Virtual Information Controller (vIC) takes care of all the integration pieces, offering:
API integration with cloud virtualisation system;
SNMP integration with network switches;
SSH & WMI API integration with application server; and
Service availability monitoring via active tests.
Management & Analytics System
There’s also a “Management Analytics System” available as either a SaaS offering from Uila or on-premises. It offers:
Scalability & Redundancy with Hadoop/Hbase;
Full stack correlation for root cause identification; and
An analytics visualisation engine.
IT is Hard
IT operations can be hard at the best of times. At any given time in all but the most mature infrastructure organisations something is on fire. Sometimes literally. Understanding where to look for the problems is difficult. It’s also difficult to identify the root cause of these issues in a fast and efficient manner. The first reaction is often to treat the cause, not the system. Another thing I’ve noticed is that the various silos of support staff (storage, virtualisation, OS support, network, security, etc.) all like to use their own tools to do their troubleshooting. I once worked in a place that had deployed 4 or 5 monitoring platforms in various states of usefulness. When there was a problem it took hours just to get everyone to look in the same place at the issue.
As much as I’m reluctant to trust a lot of what networking folks say, I think Uila’s approach to monitoring and root cause analysis is a smart one. This isn’t the nineties, and networks are everywhere in your enterprise nowadays. Why not leverage that pervasiveness and get a real feel for what exactly is going on in your environment? But it’s not just about collecting data, it’s about what you do with that data. And this is where I think Uila shines, based on the demonstration I saw and what I’ve read thus far. Having a bunch of data at hand is great, but oftentimes we need to get to the root cause of the problem to understand what’s really happening (and how to fix the problem). Uila are heavily focused on making this a quick and easy process. I’m looking forward to looking at their offering some more in the future (when I get my act together and put the lab back into production).
This is part 3 of a series where I will go into a little more detail on how you use HeatMap Analyzer. In this article I will look at the different sort of charts that can be drawn using the Analyzer component.
This is a standard Line chart. Line style charts (Line, Stacked and Line by Day) are all comprised of several sections – the main chart (upper) and the control section (lower). The control section allows you to zoom and pan the chart. If, for example, you are viewing several months worth of performance data (as above) then analysing specific points in time becomes quite difficult. Dragging the sliders in the control section closer together will allow you to view the data in greater detail.
Hovering over a data point will display a tool-tip with the details of that point.
Line by Day Chart
The Line by Day chart breaks down a standard Line chart and overlays each day. This can be useful in finding time related trends in data sets. For example, looking at a particular SP’s utilization over the past 3 weeks might look something like this:
Trying to work out time based trends in this data is visually difficult, however if we chart it using a Line by Day chart it allows you to analyze the data more intuitively. You could make the following conclusions:
1. Utilization over the weekend is generally lower.
2. There is consistently a peak in utilization at 5 am.
3. The highest peak during business hours is at 9 am.
4. Utilization of the array begins to increase from around 6 pm.
Bear in mind that charting multiple objects in the one chart will potentially make analysis more difficult, and while tool tips are shown when you hover over a data point, the time will be correct however the date is all converted back to 1st January 2000.
A stacked chart aggregates the data points for the selected objects, and shows the total of the attribute charted for a particular time. Only certain types of Attributes can be charted in a Stacked chart (and Pie charts). For example it doesn’t make a lot of sense to draw a Stacked chart of SP utilization. However, where the unit of measurement is “aggregatable” like Total Bandwidth (measured in MB/s) or Total Throughput ( measured in IOPS ), it can be charted this way.
If you really want to be able to chart these other unit types using Stacked charts, you can modify the units aggregatable flag in the Configuration tab under the “Attribute Units” section, removing the unit and re-adding it with the aggregatable flag ticked. This should enable this chart type for those attributes.
When selecting a Pie Chart, you are unable to select which objects are to be drawn (all objects are included in the chart calculations). Specifying Max Slices will limit the chart to display n-1 objects, the objects that are shown will be ordered largest to smallest, and all other objects will be combined under the “Other” slice.
A distribution chart allows you to determine the percentage of data sets that fall within particular bands. The above chart shows that 46% of SP A’s response times are in the 0-3 ms band, whereas 45% of SP B’s response times are in the 6-9ms band. When drawing a Distribution chart leaving the Minimum and Maximum blank will leave the script to determine the ranges that will contain the entire data set. Using the Minimum and Maximum you can change the range that is charted, and changing the interval will change the number of bands displayed (by default 10).
Changing the Minimum to 0, Maximum to 20 would result in the above chart, breaking the distribution down into smaller bands
This option displays a text table of the Attribute and Objects that you have selected, you can select all the data and paste it into you favorite analytic tool
Connectivity charts show the hierarchical connectivity of the array (currently this chart is only available for CLARiiON arrays)
Front End Ports → Storage Processor →LUN → MetaLUN Component → RAID Group / Pool → Private RAID Group → Disk
Modifying the depth field will change the depth to which the chart is drawn. Modifying the Attribute field will change the Attribute that is represented on the chart (Connectivity is a special attribute that weights each link the same). Link size is relative to the average for the selected attribute for the particular object, where the object does not expose that attribute (i.e. Ports don’t expose Utilization) will show a thin link.
NOTE: Arrays with more complex configurations and larger number of disks don’t necessarily display very clearly in this format.
In Part 4 we will look at the Configuration tab and how to Automate NAR file collection.
This is part 2 of a series where I will go into a little more detail on how you use HeatMap Analyzer.
Being able to filter and order objects can be useful, especially when there are a lot of them and you are trying to find where a problem may be originating.
Expanding the Filter option will display the following window.
This can be broken into four main sections. Each of these sections is effectively AND together when the filter is executed.
Filter by Attribute: If this section is enabled it allows you to retrieve the top|bottom “n” objects ordered by avg|min|max attribute.
Object Type: For objects that expose a “Type” attribute (currently this is only LUNs and Disks). This option allows you to show only specific Object Types. For example if you only wanted to view Public RAID Group LUNs and Public Pool LUNs then selecting these options will filter out all other types of LUNs. NOTE only object types are are present in the array will be available in the list, and the label that particular Object Types are given may vary depending on the version of Flare code.
Include: This option allows you to include objects based on whether the extended data for that object contain the text entered in this field, this may include Name, Owner, Type, Host (depending on what extended data is available for that object). This can be a comma separated list and is case insensitive. NOTE leading or trailing spaces will be included in the search.
Exclude: This option allows you to exclude objects based on whether the extended data for that object contains the text entered in this field, this may include Name, Owner, Type, Host (depending on what extended data is available for that object). This can be a comma separated list and is case insensitive. NOTE leading or trailing spaces will be included in the search.
So the following filter selections would display the Top 4 by average Queue Length Public Pool LUNs where the extended information includes SP A and excludes Hypervisor.
Once you have filtered the object list you are able to chart the objects against any of the attributes that are exposed by that object type.
In Part 3 we will look at the different chart types and what they may be used for.
Mat has agreed to do some posts on the basics of using the EMC HMA. In this episode, he’s looking at manually loading NAR files and other cool stuff. Enjoy.
This is part 1 of a series where I will go into a little more detail on how you use HeatMap Analyzer. In this article we go through the process of manually loading a NAR file, basic charting and HeatMaps.
At this point you should have the HeatMap Analyzer appliance configured and you may be wondering what you can begin to do with it. I wrote the original HeatMap tool so I could visualize what was occurring over an entire array or parts thereof at any one point in time. This evolution of that script is HMA. It now allows you to look closer at individual components of the array and while this is something that can be done with the current EMC Analyzer tool set, it is limited by the amount of data that you can load at any one time.
Manually Load NAR File
Point your browser at the appliance http://18.104.22.168/HMA.html and go to the Configuration tab.
Expand the “Manual Load NAR Files” section and select the “Choose Files” button. Here you can select one or more NAR files from a single or multiple arrays. Depending on the number of “objects” in the NAR file (disks, LUNs, Pools, Ports, etc), processing of each NAR file may take a reasonable amount of time. As a rough guide for every 100 objects in the NAR file it will take about 30 seconds to process (assuming the standard 300 data points for each object). Some of my CX4-960 arrays that have 3000 odd objects take 10-15 minutes to process each file. NOTE: if you are processing large NAR files or lots of smaller NAR files then the browser may currently timeout waiting for the processing to finish. If this is the case you can monitor the process through the Server Status tab or from the command line of the appliance (see below).
Note: you should let all processing complete before you begin to draw HeatMaps or charts under the Analyzer tab, if you don’t you may get some unexpected results due to the non-granular locking that SQLite uses while updating the database.
Monitoring the NAR file load process
If you want to monitor the processes that are working on your NAR file(s) there are some tools available on the Server Status tab.
For basic process monitoring expand the “Process Monitoring” section, and tick the “Auto Update” option. If you are currently processing a NAR file you will see the ProcessNAR.pl script running.
For more detailed monitoring you can view the logs created by each script. On the Server Status tab expand the “Logs Viewer” section, select the log file that you want to monitor and select “Auto Update” and “Auto Scroll”. You may want to combine this with the basic process monitoring to get a better picture of what is occurring.
Note: it’s worthwhile turning off Auto Update after you have completed using these tools as leaving it running will place unnecessary load on the appliance
To draw HeatMaps for a particular array go to the HeatMaps tab.
Array: If you have multiple arrays you can select which array that you want to draw a HeatMap for.
Array Type: This gives a description of the array type (currently only CLARiiON and Data Domain types are supported).
Start/End: This is the Start and End date of the HeatMap. NOTE: Selecting a long duration for the HeatMap may take a while to process, and it is not recommended that you draw a HeatMap that spans more than several days without increasing the Average Interval.
Avg Interval: The number of seconds between each chart point.
Object List: This is a list of objects that you can select attributes from to chart and may include.
All LUNs / Storage Processors / Disks / RAID Group / Port / Thin Pool / CPU / Asynchronous Mirror / Snap Session.
The object that are shown will depend on what you have configured on your array (if you don’t have any Asynchronous Mirrors defined then you wont see this option).
Selecting one of these drop down lists will show a list of attributes that can be charted for this object.
Select an attribute and it will be added to the list of object/attributes that will be drawn in the HeatMap. Select which object / attribute combinations that you want to display in the HeatMap and then press the “Draw HeatMap” button. The HeatMap configuration pane can be hidden by clicking on the top arrow in the top left corner.
The way the HeatMap works is by defining a minimum, a maximum and a skew for each object and attribute. These can be modified in the Configuration Tab under the HeatMap Attribute Normalization.
Analyzing the Data
For more detailed analysis of individual objects in the array go to the Analyzer tab. On this tab you will initially see three panes. The first left hand pane allows you to configure and draw charts for various objects, the second upper right pane gives global details for the array and the third lower right pane gives you details for the selected object.
Array: If you have multiple arrays you can select which array that you want to draw a HeatMap for.
Array Type: This gives a description of the array type (currently only CLARiiON and Data Domain types are supported).
Object Type: The type of object to chart (ie Storage Processor, LUN, Disk, etc)
Start/End: This is the Start and End date of the chart. .
Filter: Filter options that allow you to include / exclude objects based on set criteria (I’ll go into more detail about this later).
Object: The object(s) that you want to chart.
Attribute: The metric that you want to chart for the object, each object type exposes a different set of attributes.
Chart dependent options:
Line Charts, Stacked Charts and Tables
Avg Interval: The number of seconds to average the data points out to, leaving as zero or blank will process the data as is.
Max Slices: The Maximum number of slices to draw in a Pie chart. If there are more than “n-1” objects then extra objects will be grouped under “Other”.
Minimum: The minimum number to chart.
Maximum: The maximum number to chart.
Intervals: The number of intervals to chart.
Line: A standard line chart, multiple object can be selected and charted against one another.
Line by Day: A standard line chart where each day is stacked against each other, while you can select multiple object it’s not recommended as it makes the chart difficult to interpret.
Stacked: A stacked line chart, this option is available where the attribute being graphed is able to be aggregated (ie MB, KB or GB rather than % Utilization), again multiple objects can be selected.
Pie: A pie chart, this will draw a pie chart for all of the objects for the selected “Object Type”.
Distribution: A bar chart showing the % of values that fall between certain ranges.
Table: A text table of the data.
Connectivity: A Connectivity / SanKey chart. This show how each of the objects in the array are connected. It is only avaliable for CLARiiON devices.
Once you have selected the Object Type, Object(s) and Attribute that you want to chart, click the “Graph” button and chart will be drawn for that selection.
Depending on the type of chart drawn you may have various controls that allow you to explore the data more closely. Line style charts have the slider control at the bottom which allow you to focus on a subset of the data. Tables allow you to select all the data so it can be copied. Connectivity charts allow you to control the depth of the chart. And the attribute that controls the weight of each of the connections, all chart types except Tables allow you to save the chart as a JPEG.
In the next article I will go into more detail regarding Filtering and the different Chart Types
It’s alive. Mat has been coding like crazy and enhancing the HeatMap script and turning it into like, an appliance kind of thing. You can grab it from the Utilities page and it comes in two parts – the core code and third-party scripts package. While the combined package size is small, it saves redistributing stuff that hasn’t changed. In any case, download it, give it a spin and let us know your thoughts. Obviously, it’s still a bit ugly, and still a bit version 0.1, but that’s what you get for free. Tell your friends.
Mat’s been doing some useful scripting again. This time it’s a small PERL script that identifies the allocation owner and default owner of a pool LUN on a CX4 or VNX and lets you know whether the LUN is “non-optimal” or not. For those of you playing along at home, I found the following information on this (but can’t remember where I found it). “Allocation owner of a pool LUN is the SP that owns and maintains the metadata for that LUN. It is not advised to trespass the LUNs to an SP that is not the allocation owner. This introduces lag. The SP that provides the best performance for the pool LUN. The allocation owner SP is set by the system to match the default SP owner when you create the LUN. You cannot change the allocation owner after the LUN is created. If you change the default owner for the LUN, the software will display a warning that a performance penalty will occur if you continue.”
There’s a useful article by Jithin Nadukandathil on the ECN site, as well as a most excellent writeup by fellow EMC Elect member Jon Klaushere. In short, if you identify NonOptimal LUN ownership, your best option is to create a new LUN and migrate the data to that LUN via the LUN Migration tool. You can download a copy of the script here. Feel free to look at the other scripts that are on offer as well. Here’s what the output looks like.
Mat has updated the DIY Heatmaps for EMC CLARiiON and VNX arrays to version 4.01. You can get it from the Utilities page. Any and all feedback welcome.
Updates and Changes to the script
Add database storage / retrieval for performance stats. The database size will be approximately 2.1 x the size of the NAR file based on the default interval of 30 minutes. On my PC it took a bit over 9 hours to process 64 NAR files into a database, the NAR files were 1.95GB and the resulting database was 4.18GB. However running the script over the database to produce a heatmap only takes seconds.
Changed to use temporary tables for transitional data. This should slightly reduce the size of the database file, as the temporary data is not written to disk.
Changed the way the script processes multiple NAR files, the script previously bunched all NAR files into a single naviseccli process, this was problematic if you were processing multiple large NAR files, the script now processes them one at a time.
Add command line options:
–output_db Output the processed NAR file to the nominated database
–input_db Use the nominated database as the source of data for the heatmap
–s_date Specify a start date/time must be in the format (with quotes if specifying date and time “mm/dd/yyyy hh:mm:ss”
–e_date Specify an end date/time
–retrieve_all_nar When retrieving NAR files from the array, you can now retireve all nar files (it wont overwrite files already downloaded)
–process_only_new If you are downloading NAR files, only process files that haven’t been downloaded previously
–max_nar_files Set the maximum number of files to download and process
Please let us know if you find any bugs or problems with the script, or if you have any further suggestions for changes and enhancements.