[gpfsug-discuss] Problem Determination
Simon Thompson (Research Computing - IT Services)
S.J.Thompson at bham.ac.uk
Fri Oct 2 17:58:41 BST 2015
I agree on docs, particularly on mmdiag, I think things like --lroc are not documented. I'm also not sure that --network always gives accurate network stats. (we were doing some ha failure testing where we have split site in and fabrics, yet the network counters didn't change even when the local ib nsd servers were shut down).
It would be nice also to have a set of Icinga/Nagios plugins from IBM, maybe in samples whcich are updated on each release with new feature checks.
And not problem determination, but id really like to see an inflight non disruptive upgrade path. Particularly as we run vms off gpfs, its bot always practical or possible to move vms, so would be nice to have upgrade in flight (not suggesting this would be a quick thing to implement).
From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of Oesterlin, Robert [Robert.Oesterlin at nuance.com]
Sent: 01 October 2015 13:39
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] Problem Determination
I was going to mail you directly – but this may help spark some discussion in this area. GPFS (pardon the use of the “old school" term – You need something easier to type that Spectrum Scale) problem determination is one of those areas that is (sometimes) more of an art than a science. IBM publishes a PD guide, and it’s a good start but doesn’t cover all the bases.
- In the GPFS log (/var/mmfs/gen/mmfslog) there are a lot of messages generated. I continue to come across ones that are not documented – or documented poorly. EVERYTHING that ends up in ANY log needs to be documented.
- The PD guide gives some basic things to look at for many of the error messages, but doesn’t go into alternative explanation for many errors. Example: When a node gets expelled, the PD guide tells you it’s a communication issue, when it fact in may be related to other things like Linux network tuning. Covering all the possible causes is hard, but you can improve this.
- GPFS waiter information – understanding and analyzing this is key to getting to the bottom of many problems. The waiter information is not well documented. You should include at least a basic guide on how to use waiter information in determining cluster problems. Related: Undocumented config options. You can come across some by doing “mmdiag —config”. Using some of these can help you – or get you in trouble in the long run. If I can see the option, document it.
- Make sure that all information I might come across online is accurate, especially on those sites managed by IBM. The Developerworks wiki has great information, but there is a lot of information out there that’s out of date or inaccurate. This leads to confusion.
- The automatic deadlock detection implemented in 4.1 can be useful, but it also can be problematic in a large cluster when you get into problems. Firing off traces and taking dumps in an automated manner can cause more problems if you have a large cluster. I ended up turning it off.
- GPFS doesn’t have anything setup to alert you when conditions occur that may require your attention. There are some alerting capabilities that you can customize, but something out of the box might be useful. I know there is work going on in this area.
mmces – I did some early testing on this but haven’t had a chance to upgrade my protocol nodes to the new level. Upgrading 1000’s of node across many cluster is – challenging :-) The newer commands are a great start. I like the ability to list out events related to a particular protocol.
I could go on… Feel free to contact me directly for a more detailed discussion: robert.oesterlin @ nuance.com
Sr Storage Engineer, Nuance Communications
From: <gpfsug-discuss-bounces at gpfsug.org<mailto:gpfsug-discuss-bounces at gpfsug.org>> on behalf of Patrick Byrne
Reply-To: gpfsug main discussion list
Date: Thursday, October 1, 2015 at 5:09 AM
To: "gpfsug-discuss at gpfsug.org<mailto:gpfsug-discuss at gpfsug.org>"
Subject: [gpfsug-discuss] Problem Determination
As I'm sure some of you aware, problem determination is an area that we are looking to try and make significant improvements to over the coming releases of Spectrum Scale. To help us target the areas we work to improve and make it as useful as possible I am trying to get as much feedback as I can about different problems users have, and how people go about solving them.
I am interested in hearing everything from day to day annoyances to problems that have caused major frustration in trying to track down the root cause. Where possible it would be great to hear how the problems were dealt with as well, so that others can benefit from your experience. Feel free to reply to the mailing list - maybe others have seen similar problems and could provide tips for the future - or to me directly if you'd prefer (patbyrne at uk.ibm.com<mailto:patbyrne at uk.ibm.com>).
On a related note, in 4.1.1 there was a component added that monitors the state of the various protocols that are now supported (NFS, SMB, Object). The output from this is available with the 'mmces state' and 'mmces events' CLIs and I would like to get feedback from anyone who has had the chance make use of this. Is it useful? How could it be improved? We are looking at the possibility of extending this component to cover more than just protocols, so any feedback would be greatly appreciated.
Thanks in advance,
IBM Spectrum Scale - Development Engineer
IBM Systems - Manchester Lab
IBM UK Limited
More information about the gpfsug-discuss