[gpfsug-discuss] Monitor NSD server queue?

Yuri L Volobuev volobuev at us.ibm.com
Wed Aug 17 21:34:57 BST 2016


Unfortunately, at the moment there's no safe mechanism to show the usage
statistics for different NSD queues.  "mmfsadm saferdump nsd" as
implemented doesn't acquire locks when parsing internal data structures.
Now, NSD data structures are fairly static, as much things go, so the risk
of following a stale pointer and hitting a segfault isn't particularly
significant.  I don't think I remember ever seeing mmfsd crash with NSD
dump code on the stack.  That said, this isn't code that's tested and known
to be safe for production use.  I haven't seen a case myself where an mmfsd
thread gets stuck running this dump command, either, but Bob has.  If that
condition ever reoccurs, I'd be interested in seeing debug data.

I agree that there's value in giving a sysadmin insight into the inner
workings of the NSD server machinery, in particular the queue dynamics.
mmdiag should be enhanced to allow this.  That'd be a very reasonable (and
doable) RFE.

yuri



From:	"Oesterlin, Robert" <Robert.Oesterlin at nuance.com>
To:	gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>,
Date:	08/17/2016 04:45 AM
Subject:	Re: [gpfsug-discuss] Monitor NSD server queue?
Sent by:	gpfsug-discuss-bounces at spectrumscale.org



Hi Aaron

You did a perfect job of explaining a situation I've run into time after
time - high latency on the disk subsystem causing a backup in the NSD
queues. I was doing what you suggested not to do - "mmfsadm saferdump nsd'
and looking at the queues. In my case 'mmfsadm saferdump" would usually
work or hang, rather than kill mmfsd. But - the hang usually resulted it a
tied up thread in mmfsd, so that's no good either.

I wish I had better news - this is the only way I've found to get
visibility to these queues. IBM hasn't seen fit to gives us a way to safely
look at these. I personally think it's a bug that we can't safely dump
these structures, as they give insight as to what's actually going on
inside the NSD server.

Yuri, Sven - thoughts?


Bob Oesterlin
Sr Storage Engineer, Nuance HPC Grid



From: <gpfsug-discuss-bounces at spectrumscale.org> on behalf of "Knister,
Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]" <aaron.s.knister at nasa.gov>
Reply-To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date: Tuesday, August 16, 2016 at 8:46 PM
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Subject: [EXTERNAL] [gpfsug-discuss] Monitor NSD server queue?

Hi Everyone,

We ran into a rather interesting situation over the past week. We had a job
that was pounding the ever loving crap out of one of our filesystems
(called dnb02) doing about 15GB/s of reads. We had other jobs experience a
slowdown on a different filesystem (called dnb41) that uses entirely
separate backend storage. What I can't figure out is why this other
filesystem was affected. I've checked IB bandwidth and congestion, Fibre
channel bandwidth and errors, Ethernet bandwidth congestion, looked at the
mmpmon nsd_ds counters (including disk request wait time), and checked out
the disk iowait values from collectl. I simply can't account for the
slowdown on the other filesystem. The only thing I can think of is the high
latency on dnb02's NSDs caused the mmfsd NSD queues to back up.

Here's my question-- how can I monitor the state of th NSD queues? I can't
find anything in mmdiag. An mmfsadm saferdump NSD shows me the queues and
their status. I'm just not sure calling saferdump NSD every 10 seconds to
monitor this data is going to end well. I've seen saferdump NSD cause mmfsd
to die and that's from a task we only run every 6 hours that calls
saferdump NSD.

Any thoughts/ideas here would be great.

Thanks!

-Aaron_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160817/877b227d/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160817/877b227d/attachment-0002.gif>


More information about the gpfsug-discuss mailing list