[gpfsug-discuss] Aggregating filesystem performance

Oesterlin, Robert Robert.Oesterlin at nuance.com
Mon Jul 18 01:39:29 BST 2016


OK, after a bit of a delay due to a hectic travel week, here is some more information on my GPFS performance collection. At the bottom, I have links to my server and client zimon config files and a link to my presentation at SSUG Argonne in June. I didn't actually present it but included it in case there was interest.

I used to do a home brew system of period calls to mmpmon to collect data, sticking them into a kafka database. This was a bit cumbersome and when SS 4.2 arrived, I switched over to the built in performance sensors (zimon) to collect the data. IBM has a "as-is" bridge between Grafana and the Zimon collector that works reasonably well - they were supposed to release it but it's been delayed - I will ask about it again and post more information if I get it.

My biggest struggle with the zimon configuration is the large memory requirement of the collector with large clusters (many clients, file systems, NSDs). I ended up deploying a 6 collector federation of 16gb per collector for my larger clusters -0 even then I have to limit the number of stats and amount of time I retain it. IBM is aware of the memory issue and I believe they are looking at ways to reduce it.

As for what specific metrics I tend to look at:

gpfs_fis_bytes_read (written) - aggregated file system read and write stats
gpfs_nsdpool_bytes_read (written) - aggregated pool stats, as I have data and metadata split
gpfs_fs_tot_disk_wait_rd (wr) - NSD disk wait stats

These seem to make the most sense for me to get an overall sense of how things are going. I have a bunch of other more details dashboards for individual file systems and clients that help me get details. The built-in SS GUI is pretty good for small clusters, and is getting some improvements in 4.2.1 that might make me take a closer look at it again.

I also look at the RPC waiters stats - no present in 4.2.0 grafana, but I hear are coming in 4.2.1

My SSUG Argonne Presentation (I didn't talk due to time constraints): http://files.gpfsug.org/presentations/2016/anl-june/SSUG_Nuance_PerfTools.pdf

Zimon server config file: https://www.dropbox.com/s/gvtfhhqfpsknfnh/ZIMonSensors.cfg.server?dl=0
Zimon client config file: https://www.dropbox.com/s/k5i6rcnaco4vxu6/ZIMonSensors.cfg.client?dl=0


Bob Oesterlin
Sr Storage Engineer, Nuance HPC Grid


From: <gpfsug-discuss-bounces at spectrumscale.org> on behalf of Brian Marshall <mimarsh2 at vt.edu>
Reply-To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date: Wednesday, July 13, 2016 at 8:43 AM
To: "gpfsug-discuss at spectrumscale.org" <gpfsug-discuss at spectrumscale.org>
Subject: [EXTERNAL] Re: [gpfsug-discuss] Aggregating filesystem performance (Oesterlin, Robert)

Robert,

1) Do you see any noticeable performance impact by running the performance monitoring?

2) Can you share the zimon configuration that you use? i.e. what metrics do you find most useful?

Thank you,
Brian Marshall
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160718/4f96b2af/attachment-0002.htm>


More information about the gpfsug-discuss mailing list