[gpfsug-discuss] strange waiters + filesystem deadlock

Fri Mar 24 18:08:44 GMT 2017

I believe it was created with -n 5000. Here's the exact command that was 
used:

/usr/lpp/mmfs/bin/mmcrfs dnb03 -F ./disc_mmcrnsd_dnb03.lst -T 
/gpfsm/dnb03 -j cluster -B 1M -n 5000 -N 20M -r1 -R2 -m2 -M2 -A no -Q 
yes -v yes -i 512 --metadata-block-size=256K -L 8388608

-Aaron

On 3/24/17 2:05 PM, Sven Oehme wrote:
> was this filesystem creates with -n 5000 ? or was that changed later
> with mmchfs ?
> please send the mmlsconfig/mmlscluster output to me at oehmes at us.ibm.com
> <mailto:oehmes at us.ibm.com>
>
>
>
> On Fri, Mar 24, 2017 at 10:58 AM Aaron Knister <aaron.s.knister at nasa.gov
> <mailto:aaron.s.knister at nasa.gov>> wrote:
>
>     I feel a little awkward about posting wlists of IP's and hostnames on
>     the mailing list (even though they're all internal) but I'm happy to
>     send to you directly. I've attached both an lsfs and an mmdf output of
>     the fs in question here since that may be useful for others to see. Just
>     a note about disk d23_02_021-- it's been evacuated for several weeks now
>     due to a hardware issue in the disk enclosure.
>
>     The fs is rather full percentage wise (93%) but in terms of capacity
>     there's a good amount free. 93% full of a 7PB filesystem still leaves
>     551T. Metadata, as you'll see, is 31% free (roughly 800GB).
>
>     The fs has 40M inodes allocated and 12M free.
>
>     -Aaron
>
>     On 3/24/17 1:41 PM, Sven Oehme wrote:
>     > ok, that seems a different problem then i was thinking.
>     > can you send output of mmlscluster, mmlsconfig, mmlsfs all ?
>     > also are you getting close to fill grade on inodes or capacity on any of
>     > the filesystems ?
>     >
>     > sven
>     >
>     >
>     > On Fri, Mar 24, 2017 at 10:34 AM Aaron Knister <aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>
>     > <mailto:aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>>> wrote:
>     >
>     >     Here's the screenshot from the other node with the high cpu utilization.
>     >
>     >     On 3/24/17 1:32 PM, Aaron Knister wrote:
>     >     > heh, yep we're on sles :)
>     >     >
>     >     > here's a screenshot of the fs manager from the deadlocked filesystem. I
>     >     > don't think there's an nsd server or manager node that's running full
>     >     > throttle across all cpus. There is one that's got relatively high CPU
>     >     > utilization though (300-400%). I'll send a screenshot of it in a sec.
>     >     >
>     >     > no zimon yet but we do have other tools to see cpu utilization.
>     >     >
>     >     > -Aaron
>     >     >
>     >     > On 3/24/17 1:22 PM, Sven Oehme wrote:
>     >     >> you must be on sles as this segfaults only on sles to my knowledge :-)
>     >     >>
>     >     >> i am looking for a NSD or manager node in your cluster that runs at 100%
>     >     >> cpu usage.
>     >     >>
>     >     >> do you have zimon deployed to look at cpu utilization across your nodes ?
>     >     >>
>     >     >> sven
>     >     >>
>     >     >>
>     >     >>
>     >     >> On Fri, Mar 24, 2017 at 10:08 AM Aaron Knister <aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>
>     <mailto:aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>>
>     >     >> <mailto:aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>
>     <mailto:aaron.s.knister at nasa.gov
>     <mailto:aaron.s.knister at nasa.gov>>>> wrote:
>     >     >>
>     >     >>     Hi Sven,
>     >     >>
>     >     >>     Which NSD server should I run top on, the fs manager? If so the
>     >     >> CPU load
>     >     >>     is about 155%. I'm working on perf top but not off to a great
>     >     >> start...
>     >     >>
>     >     >>     # perf top
>     >     >>         PerfTop:    1095 irqs/sec  kernel:61.9%  exact:  0.0% [1000Hz
>     >     >>     cycles],  (all, 28 CPUs)
>     >     >>
>     >     >> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>     >     >>
>     >     >>     Segmentation fault
>     >     >>
>     >     >>     -Aaron
>     >     >>
>     >     >>     On 3/24/17 1:04 PM, Sven Oehme wrote:
>     >     >>     > while this is happening  run top and see if there is very high cpu
>     >     >>     > utilization at this time on the NSD Server.
>     >     >>     >
>     >     >>     > if there is , run perf top (you might need to install perf
>     >     >> command) and
>     >     >>     > see if the top cpu contender is a spinlock . if so send a
>     >     >> screenshot of
>     >     >>     > perf top as i may know what that is and how to fix.
>     >     >>     >
>     >     >>     > sven
>     >     >>     >
>     >     >>     >
>     >     >>     > On Fri, Mar 24, 2017 at 9:43 AM Aaron Knister
>     >     >> <aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>
>     <mailto:aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>>
>     >     <mailto:aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>
>     <mailto:aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>>>
>     >     >>     > <mailto:aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>
>     <mailto:aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>>
>     >     >> <mailto:aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>
>     <mailto:aaron.s.knister at nasa.gov
>     <mailto:aaron.s.knister at nasa.gov>>>>> wrote:
>     >     >>     >
>     >     >>     >     Since yesterday morning we've noticed some deadlocks on one
>     >     >> of our
>     >     >>     >     filesystems that seem to be triggered by writing to it. The
>     >     >> waiters on
>     >     >>     >     the clients look like this:
>     >     >>     >
>     >     >>     >     0x19450B0 (   6730) waiting 2063.294589599 seconds,
>     >     >> SyncHandlerThread:
>     >     >>     >     on ThCond 0x1802585CB10 (0xFFFFC9002585CB10)
>     >     >> (InodeFlushCondVar), reason
>     >     >>     >     'waiting for the flush flag to commit metadata'
>     >     >>     >     0x7FFFDA65E200 (  22850) waiting 0.000246257 seconds,
>     >     >>     >     AllocReduceHelperThread: on ThCond 0x7FFFDAC7FE28
>     >     >> (0x7FFFDAC7FE28)
>     >     >>     >     (MsgRecordCondvar), reason 'RPC wait' for
>     >     >> allocMsgTypeRelinquishRegion
>     >     >>     >     on node 10.1.52.33 <c0n3271>
>     >     >>     >     0x197EE70 (   6776) waiting 0.000198354 seconds,
>     >     >>     >     FileBlockWriteFetchHandlerThread: on ThCond 0x7FFFF00CD598
>     >     >>     >     (0x7FFFF00CD598) (MsgRecordCondvar), reason 'RPC wait' for
>     >     >>     >     allocMsgTypeRequestRegion on node 10.1.52.33 <c0n3271>
>     >     >>     >
>     >     >>     >     (10.1.52.33/c0n3271 <http://10.1.52.33/c0n3271>
>     <http://10.1.52.33/c0n3271>
>     >     <http://10.1.52.33/c0n3271>
>     >     >>     <http://10.1.52.33/c0n3271> is the fs manager
>     >     >>     >     for the filesystem in question)
>     >     >>     >
>     >     >>     >     there's a single process running on this node writing to the
>     >     >> filesystem
>     >     >>     >     in question (well, trying to write, it's been blocked doing
>     >     >> nothing for
>     >     >>     >     half an hour now). There are ~10 other client nodes in this
>     >     >> situation
>     >     >>     >     right now. We had many more last night before the problem
>     >     >> seemed to
>     >     >>     >     disappear in the early hours of the morning and now its back.
>     >     >>     >
>     >     >>     >     Waiters on the fs manager look like this. While the
>     >     >> individual waiter is
>     >     >>     >     short it's a near constant stream:
>     >     >>     >
>     >     >>     >     0x7FFF60003540 (   8269) waiting 0.001151588 seconds, Msg
>     >     >> handler
>     >     >>     >     allocMsgTypeRequestRegion: on ThMutex 0x1802163A2E0
>     >     >> (0xFFFFC9002163A2E0)
>     >     >>     >     (AllocManagerMutex)
>     >     >>     >     0x7FFF601C8860 (  20606) waiting 0.001115712 seconds, Msg
>     >     >> handler
>     >     >>     >     allocMsgTypeRelinquishRegion: on ThMutex 0x1802163A2E0
>     >     >>     >     (0xFFFFC9002163A2E0) (AllocManagerMutex)
>     >     >>     >     0x7FFF91C10080 (  14723) waiting 0.000959649 seconds, Msg
>     >     >> handler
>     >     >>     >     allocMsgTypeRequestRegion: on ThMutex 0x1802163A2E0
>     >     >> (0xFFFFC9002163A2E0)
>     >     >>     >     (AllocManagerMutex)
>     >     >>     >     0x7FFFB03C2910 (  12636) waiting 0.000769611 seconds, Msg
>     >     >> handler
>     >     >>     >     allocMsgTypeRequestRegion: on ThMutex 0x1802163A2E0
>     >     >> (0xFFFFC9002163A2E0)
>     >     >>     >     (AllocManagerMutex)
>     >     >>     >     0x7FFF8C092850 (  18215) waiting 0.000682275 seconds, Msg
>     >     >> handler
>     >     >>     >     allocMsgTypeRelinquishRegion: on ThMutex 0x1802163A2E0
>     >     >>     >     (0xFFFFC9002163A2E0) (AllocManagerMutex)
>     >     >>     >     0x7FFF9423F730 (  12652) waiting 0.000641915 seconds, Msg
>     >     >> handler
>     >     >>     >     allocMsgTypeRequestRegion: on ThMutex 0x1802163A2E0
>     >     >> (0xFFFFC9002163A2E0)
>     >     >>     >     (AllocManagerMutex)
>     >     >>     >     0x7FFF9422D770 (  12625) waiting 0.000494256 seconds, Msg
>     >     >> handler
>     >     >>     >     allocMsgTypeRequestRegion: on ThMutex 0x1802163A2E0
>     >     >> (0xFFFFC9002163A2E0)
>     >     >>     >     (AllocManagerMutex)
>     >     >>     >     0x7FFF9423E310 (  12651) waiting 0.000437760 seconds, Msg
>     >     >> handler
>     >     >>     >     allocMsgTypeRelinquishRegion: on ThMutex 0x1802163A2E0
>     >     >>     >     (0xFFFFC9002163A2E0) (AllocManagerMutex)
>     >     >>     >
>     >     >>     >     I don't know if this data point is useful but both yesterday
>     >     >> and today
>     >     >>     >     the metadata NSDs for this filesystem have had a constant
>     >     >> aggregate
>     >     >>     >     stream of 25MB/s 4kop/s reads during each episode (very low
>     >     >> latency
>     >     >>     >     though so I don't believe the storage is a bottleneck here).
>     >     >> Writes are
>     >     >>     >     only a few hundred ops and didn't strike me as odd.
>     >     >>     >
>     >     >>     >     I have a PMR open for this but I'm curious if folks have
>     >     >> seen this in
>     >     >>     >     the wild and what it might mean.
>     >     >>     >
>     >     >>     >     -Aaron
>     >     >>     >
>     >     >>     >     --
>     >     >>     >     Aaron Knister
>     >     >>     >     NASA Center for Climate Simulation (Code 606.2)
>     >     >>     >     Goddard Space Flight Center
>     >     >>     >     (301) 286-2776 <tel:(301)%20286-2776> <tel:(301)%20286-2776>
>     <tel:(301)%20286-2776>
>     >     <tel:(301)%20286-2776>
>     >     >>     >     _______________________________________________
>     >     >>     >     gpfsug-discuss mailing list
>     >     >>     >     gpfsug-discuss at spectrumscale.org <http://spectrumscale.org> <http://spectrumscale.org>
>     >     >> <http://spectrumscale.org> <http://spectrumscale.org>
>     >     >>     >     http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>     >     >>     >
>     >     >>     >
>     >     >>     >
>     >     >>     > _______________________________________________
>     >     >>     > gpfsug-discuss mailing list
>     >     >>     > gpfsug-discuss at spectrumscale.org <http://spectrumscale.org>
>     <http://spectrumscale.org> <http://spectrumscale.org>
>     >     >>     > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>     >     >>     >
>     >     >>
>     >     >>     --
>     >     >>     Aaron Knister
>     >     >>     NASA Center for Climate Simulation (Code 606.2)
>     >     >>     Goddard Space Flight Center
>     >     >>     (301) 286-2776 <tel:(301)%20286-2776> <tel:(301)%20286-2776>
>     <tel:(301)%20286-2776>
>     >     >>     _______________________________________________
>     >     >>     gpfsug-discuss mailing list
>     >     >>     gpfsug-discuss at spectrumscale.org <http://spectrumscale.org>
>     <http://spectrumscale.org> <http://spectrumscale.org>
>     >     >>     http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>     >     >>
>     >     >>
>     >     >>
>     >     >> _______________________________________________
>     >     >> gpfsug-discuss mailing list
>     >     >> gpfsug-discuss at spectrumscale.org <http://spectrumscale.org> <http://spectrumscale.org>
>     >     >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>     >     >>
>     >     >
>     >     >
>     >     >
>     >     > _______________________________________________
>     >     > gpfsug-discuss mailing list
>     >     > gpfsug-discuss at spectrumscale.org <http://spectrumscale.org> <http://spectrumscale.org>
>     >     > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>     >     >
>     >
>     >     --
>     >     Aaron Knister
>     >     NASA Center for Climate Simulation (Code 606.2)
>     >     Goddard Space Flight Center
>     >     (301) 286-2776 <tel:(301)%20286-2776> <tel:(301)%20286-2776>
>     >     _______________________________________________
>     >     gpfsug-discuss mailing list
>     >     gpfsug-discuss at spectrumscale.org <http://spectrumscale.org> <http://spectrumscale.org>
>     >     http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>     >
>     >
>     >
>     > _______________________________________________
>     > gpfsug-discuss mailing list
>     > gpfsug-discuss at spectrumscale.org <http://spectrumscale.org>
>     > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>     >
>
>     --
>     Aaron Knister
>     NASA Center for Climate Simulation (Code 606.2)
>     Goddard Space Flight Center
>     (301) 286-2776 <tel:(301)%20286-2776>
>     _______________________________________________
>     gpfsug-discuss mailing list
>     gpfsug-discuss at spectrumscale.org <http://spectrumscale.org>
>     http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>

-- 
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776