[gpfsug-discuss] Problem Determination

Fri Oct 2 17:44:24 BST 2015

I would like to strongly echo what Bob has stated, especially the documentation or wrong documentation, and I have in-lining some comments below.

I liken GPFS to a critical care patient at the hospital.  You have to check on the state regularly, know the running heart rate (e.g. waiters), the response of every component from disk, to networks, to server load, etc.  When a problem occurs, running tests (such as nsdperf)  to help isolate the problem quickly is crucial.  Capturing GPFS trace data is also very important if the problem isn’t obvious.  But then you have to wait for IBM support to parse the information and give you their analysis of the situation.  It would be great to get an advanced troubleshooting document that describes how to read the output of `mmfsadm dump` commands and the GPFS trace report that is generated.

Cheers,
-Bryan

From: gpfsug-discuss-bounces at gpfsug.org [mailto:gpfsug-discuss-bounces at gpfsug.org] On Behalf Of Oesterlin, Robert
Sent: Thursday, October 01, 2015 7:39 AM
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] Problem Determination

Hi Patrick

I was going to mail you directly – but this may help spark some discussion in this area.  GPFS (pardon the use of the “old school" term – You need something easier to type that Spectrum Scale) problem determination is one of those areas that is (sometimes) more of an art than a science. IBM publishes a PD guide, and it’s a good start but doesn’t cover all the bases.

- In the GPFS log (/var/mmfs/gen/mmfslog) there are a lot of messages generated. I continue to come across ones that are not documented – or documented poorly. EVERYTHING that ends up in ANY log needs to be documented.
- The PD guide gives some basic things to look at for many of the error messages, but doesn’t go into alternative explanation for many errors. Example: When a node gets expelled, the PD guide tells you it’s a communication issue, when it fact in may be related to other things like Linux network tuning. Covering all the possible causes is hard, but you can improve this.
- GPFS waiter information – understanding and analyzing this is key to getting to the bottom of many problems. The waiter information is not well documented. You should include at least a basic guide on how to use waiter information in determining cluster problems. Related: Undocumented config options. You can come across some by doing “mmdiag —config”. Using some of these can help you – or get you in trouble in the long run. If I can see the option, document it.
                [Bryan: Also please, please provide a way to check whether or not the configuration parameters need to be changed.  I assume that there is a `mmfsadm dump` command that can tell you whether the config parameter needs to be changed, if not make one!  Just stating something like “This could be increased to XX value for very large clusters” is not very helpful.

- Make sure that all information I might come across online is accurate, especially on those sites managed by IBM. The Developerworks wiki has great information, but there is a lot of information out there that’s out of date or inaccurate. This leads to confusion.
                [Bryan: I know that Scott Fadden is a busy man, so I would recommend helping distribute the workload of maintaining the wiki documentation.  This data should be reviewed on a more regular basis, at least once for each major release I  would hope, and updated or deleted if found to be out of date.]

- The automatic deadlock detection implemented in 4.1 can be useful, but it also can be problematic in a large cluster when you get into problems. Firing off traces and taking dumps in an automated manner  can cause more problems if you have a large cluster. I ended up turning it off.
                [Bryan: From what I’ve heard, IBM is actively working to make the deadlock amelioration logic better.  I agree that firing off traces can cause more problems, and we have turned off the automated collection as well.  We are going to work on enabling the collection of some data during these events to help ensure we get enough data for IBM to analyze the problem.]

- GPFS doesn’t have anything setup to alert you when conditions occur that may require your attention. There are some alerting capabilities that you can customize, but something out of the box might be useful. I know there is work going on in this area.
                [Bryan: The GPFS callback facilities are very useful for setting up alerts, but not well documented or advertised by the GPFS manuals.  I hope to see more callback capabilities added to help monitor all aspects of the GPFS cluster and file systems]

mmces – I did some early testing on this but haven’t had a chance to upgrade my protocol nodes to the new level. Upgrading 1000’s of node across many cluster is – challenging :-) The newer commands are a great start. I like the ability to list out events related to a particular protocol.

I could go on… Feel free to contact me directly for a more detailed discussion: robert.oesterlin @ nuance.com

Bob Oesterlin
Sr Storage Engineer, Nuance Communications

From: <gpfsug-discuss-bounces at gpfsug.org<mailto:gpfsug-discuss-bounces at gpfsug.org>> on behalf of Patrick Byrne
Reply-To: gpfsug main discussion list
Date: Thursday, October 1, 2015 at 5:09 AM
To: "gpfsug-discuss at gpfsug.org<mailto:gpfsug-discuss at gpfsug.org>"
Subject: [gpfsug-discuss] Problem Determination

Hi all,

As I'm sure some of you aware, problem determination is an area that we are looking to try and make significant improvements to over the coming releases of Spectrum Scale. To help us target the areas we work to improve and make it as useful as possible I am trying to get as much feedback as I can about different problems users have, and how people go about solving them.

I am interested in hearing everything from day to day annoyances to problems that have caused major frustration in trying to track down the root cause. Where possible it would be great to hear how the problems were dealt with as well, so that others can benefit from your experience. Feel free to reply to the mailing list - maybe others have seen similar problems and could provide tips for the future - or to me directly if you'd prefer (patbyrne at uk.ibm.com<mailto:patbyrne at uk.ibm.com>).

On a related note, in 4.1.1 there was a component added that monitors the state of the various protocols that are now supported (NFS, SMB, Object). The output from this is available with the 'mmces state' and 'mmces events' CLIs and I would like to get feedback from anyone who has had the chance make use of this. Is it useful? How could it be improved? We are looking at the possibility of extending this component to cover more than just protocols, so any feedback would be greatly appreciated.

Thanks in advance,

Patrick Byrne
IBM Spectrum Scale - Development Engineer
IBM Systems - Manchester Lab
IBM UK Limited

________________________________

Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20151002/d47c9d15/attachment-0002.htm>