[gpfsug-discuss] filesystem manager crashes every time mmdelsnapshot (from either the filesystem manager or some other nsd/client) is called

Bryan Banister bbanister at jumptrading.com
Fri May 30 14:35:55 BST 2014


This sounds like a serious problem and you should open a PMR with IBM to get direct guidance.

I normally will take a GPFS trace during a problem like this from all of the nodes that are affected or directly involved in the operation.

Hope that helps,
-Bryan

From: gpfsug-discuss-bounces at gpfsug.org [mailto:gpfsug-discuss-bounces at gpfsug.org] On Behalf Of Sabuj Pattanayek
Sent: Thursday, May 29, 2014 8:34 PM
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] filesystem manager crashes every time mmdelsnapshot (from either the filesystem manager or some other nsd/client) is called

This is still happening in 3.5.0.18 and when a snapshot is being deleted it slows NFS read speeds to a crawl (but not gpfs and not NFS writes).

On Thu, May 15, 2014 at 7:48 AM, Sabuj Pattanayek <sabujp at gmail.com<mailto:sabujp at gmail.com>> wrote:
Hi all,

We're running 3.5.0.17 now and it looks like the filesystem manager automatically reboots (and sometimes fails to automatically reboot) after mmdelsnapshot is called, either from the filesystem manager itself or from some other nsd/node . It didn't start happening immediately after we updated to 17, but we never had this issue when we were at 3.5.0.11 . The error mmdelsnapshot throws at some point is :

Lost connection to file system daemon.

mmdelsnapshot: An internode connection between GPFS nodes was disrupted.

mmdelsnapshot: Command failed.  Examine previous error messages to

determine cause.

It also causes an mmfs generic error and or a kernel: BUG: soft lockup - CPU#15 stuck for 67s! [mmfsd:39266], the latter causes the system to not reboot itself (which is actually worse), but the former does.



It also causes some havoc with CNFS file locking even after the filesystem manager is rebooted and has come up :



May 15 07:10:12 mako-nsd1 sm-notify[19387]: Failed to bind RPC socket:

Address already in use



May 15 07:21:03 mako-nsd1 sm-notify[11052]: Invalid bind address or port

for RPC socket: Name or service not known



Saw some snapshot related fixes in 3.5.0.18, anyone seen this behavior or know if it's fixed in 18?



Thanks,

Sabuj



________________________________

Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140530/d06dcb69/attachment-0003.htm>


More information about the gpfsug-discuss mailing list