<div dir="ltr">was this filesystem creates with -n 5000 ? or was that changed later with mmchfs ? <div>please send the mmlsconfig/mmlscluster output to me at <a href="mailto:oehmes@us.ibm.com">oehmes@us.ibm.com</a></div><div><br></div><div><div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr">On Fri, Mar 24, 2017 at 10:58 AM Aaron Knister <<a href="mailto:aaron.s.knister@nasa.gov">aaron.s.knister@nasa.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I feel a little awkward about posting wlists of IP's and hostnames on<br class="gmail_msg">
the mailing list (even though they're all internal) but I'm happy to<br class="gmail_msg">
send to you directly. I've attached both an lsfs and an mmdf output of<br class="gmail_msg">
the fs in question here since that may be useful for others to see. Just<br class="gmail_msg">
a note about disk d23_02_021-- it's been evacuated for several weeks now<br class="gmail_msg">
due to a hardware issue in the disk enclosure.<br class="gmail_msg">
<br class="gmail_msg">
The fs is rather full percentage wise (93%) but in terms of capacity<br class="gmail_msg">
there's a good amount free. 93% full of a 7PB filesystem still leaves<br class="gmail_msg">
551T. Metadata, as you'll see, is 31% free (roughly 800GB).<br class="gmail_msg">
<br class="gmail_msg">
The fs has 40M inodes allocated and 12M free.<br class="gmail_msg">
<br class="gmail_msg">
-Aaron<br class="gmail_msg">
<br class="gmail_msg">
On 3/24/17 1:41 PM, Sven Oehme wrote:<br class="gmail_msg">
> ok, that seems a different problem then i was thinking.<br class="gmail_msg">
> can you send output of mmlscluster, mmlsconfig, mmlsfs all ?<br class="gmail_msg">
> also are you getting close to fill grade on inodes or capacity on any of<br class="gmail_msg">
> the filesystems ?<br class="gmail_msg">
><br class="gmail_msg">
> sven<br class="gmail_msg">
><br class="gmail_msg">
><br class="gmail_msg">
> On Fri, Mar 24, 2017 at 10:34 AM Aaron Knister <<a href="mailto:aaron.s.knister@nasa.gov" class="gmail_msg" target="_blank">aaron.s.knister@nasa.gov</a><br class="gmail_msg">
> <mailto:<a href="mailto:aaron.s.knister@nasa.gov" class="gmail_msg" target="_blank">aaron.s.knister@nasa.gov</a>>> wrote:<br class="gmail_msg">
><br class="gmail_msg">
> Here's the screenshot from the other node with the high cpu utilization.<br class="gmail_msg">
><br class="gmail_msg">
> On 3/24/17 1:32 PM, Aaron Knister wrote:<br class="gmail_msg">
> > heh, yep we're on sles :)<br class="gmail_msg">
> ><br class="gmail_msg">
> > here's a screenshot of the fs manager from the deadlocked filesystem. I<br class="gmail_msg">
> > don't think there's an nsd server or manager node that's running full<br class="gmail_msg">
> > throttle across all cpus. There is one that's got relatively high CPU<br class="gmail_msg">
> > utilization though (300-400%). I'll send a screenshot of it in a sec.<br class="gmail_msg">
> ><br class="gmail_msg">
> > no zimon yet but we do have other tools to see cpu utilization.<br class="gmail_msg">
> ><br class="gmail_msg">
> > -Aaron<br class="gmail_msg">
> ><br class="gmail_msg">
> > On 3/24/17 1:22 PM, Sven Oehme wrote:<br class="gmail_msg">
> >> you must be on sles as this segfaults only on sles to my knowledge :-)<br class="gmail_msg">
> >><br class="gmail_msg">
> >> i am looking for a NSD or manager node in your cluster that runs at 100%<br class="gmail_msg">
> >> cpu usage.<br class="gmail_msg">
> >><br class="gmail_msg">
> >> do you have zimon deployed to look at cpu utilization across your nodes ?<br class="gmail_msg">
> >><br class="gmail_msg">
> >> sven<br class="gmail_msg">
> >><br class="gmail_msg">
> >><br class="gmail_msg">
> >><br class="gmail_msg">
> >> On Fri, Mar 24, 2017 at 10:08 AM Aaron Knister <<a href="mailto:aaron.s.knister@nasa.gov" class="gmail_msg" target="_blank">aaron.s.knister@nasa.gov</a> <mailto:<a href="mailto:aaron.s.knister@nasa.gov" class="gmail_msg" target="_blank">aaron.s.knister@nasa.gov</a>><br class="gmail_msg">
> >> <mailto:<a href="mailto:aaron.s.knister@nasa.gov" class="gmail_msg" target="_blank">aaron.s.knister@nasa.gov</a> <mailto:<a href="mailto:aaron.s.knister@nasa.gov" class="gmail_msg" target="_blank">aaron.s.knister@nasa.gov</a>>>> wrote:<br class="gmail_msg">
> >><br class="gmail_msg">
> >> Hi Sven,<br class="gmail_msg">
> >><br class="gmail_msg">
> >> Which NSD server should I run top on, the fs manager? If so the<br class="gmail_msg">
> >> CPU load<br class="gmail_msg">
> >> is about 155%. I'm working on perf top but not off to a great<br class="gmail_msg">
> >> start...<br class="gmail_msg">
> >><br class="gmail_msg">
> >> # perf top<br class="gmail_msg">
> >> PerfTop: 1095 irqs/sec kernel:61.9% exact: 0.0% [1000Hz<br class="gmail_msg">
> >> cycles], (all, 28 CPUs)<br class="gmail_msg">
> >><br class="gmail_msg">
> >> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------<br class="gmail_msg">
> >><br class="gmail_msg">
> >> Segmentation fault<br class="gmail_msg">
> >><br class="gmail_msg">
> >> -Aaron<br class="gmail_msg">
> >><br class="gmail_msg">
> >> On 3/24/17 1:04 PM, Sven Oehme wrote:<br class="gmail_msg">
> >> > while this is happening run top and see if there is very high cpu<br class="gmail_msg">
> >> > utilization at this time on the NSD Server.<br class="gmail_msg">
> >> ><br class="gmail_msg">
> >> > if there is , run perf top (you might need to install perf<br class="gmail_msg">
> >> command) and<br class="gmail_msg">
> >> > see if the top cpu contender is a spinlock . if so send a<br class="gmail_msg">
> >> screenshot of<br class="gmail_msg">
> >> > perf top as i may know what that is and how to fix.<br class="gmail_msg">
> >> ><br class="gmail_msg">
> >> > sven<br class="gmail_msg">
> >> ><br class="gmail_msg">
> >> ><br class="gmail_msg">
> >> > On Fri, Mar 24, 2017 at 9:43 AM Aaron Knister<br class="gmail_msg">
> >> <<a href="mailto:aaron.s.knister@nasa.gov" class="gmail_msg" target="_blank">aaron.s.knister@nasa.gov</a> <mailto:<a href="mailto:aaron.s.knister@nasa.gov" class="gmail_msg" target="_blank">aaron.s.knister@nasa.gov</a>><br class="gmail_msg">
> <mailto:<a href="mailto:aaron.s.knister@nasa.gov" class="gmail_msg" target="_blank">aaron.s.knister@nasa.gov</a> <mailto:<a href="mailto:aaron.s.knister@nasa.gov" class="gmail_msg" target="_blank">aaron.s.knister@nasa.gov</a>>><br class="gmail_msg">
> >> > <mailto:<a href="mailto:aaron.s.knister@nasa.gov" class="gmail_msg" target="_blank">aaron.s.knister@nasa.gov</a> <mailto:<a href="mailto:aaron.s.knister@nasa.gov" class="gmail_msg" target="_blank">aaron.s.knister@nasa.gov</a>><br class="gmail_msg">
> >> <mailto:<a href="mailto:aaron.s.knister@nasa.gov" class="gmail_msg" target="_blank">aaron.s.knister@nasa.gov</a> <mailto:<a href="mailto:aaron.s.knister@nasa.gov" class="gmail_msg" target="_blank">aaron.s.knister@nasa.gov</a>>>>> wrote:<br class="gmail_msg">
> >> ><br class="gmail_msg">
> >> > Since yesterday morning we've noticed some deadlocks on one<br class="gmail_msg">
> >> of our<br class="gmail_msg">
> >> > filesystems that seem to be triggered by writing to it. The<br class="gmail_msg">
> >> waiters on<br class="gmail_msg">
> >> > the clients look like this:<br class="gmail_msg">
> >> ><br class="gmail_msg">
> >> > 0x19450B0 ( 6730) waiting 2063.294589599 seconds,<br class="gmail_msg">
> >> SyncHandlerThread:<br class="gmail_msg">
> >> > on ThCond 0x1802585CB10 (0xFFFFC9002585CB10)<br class="gmail_msg">
> >> (InodeFlushCondVar), reason<br class="gmail_msg">
> >> > 'waiting for the flush flag to commit metadata'<br class="gmail_msg">
> >> > 0x7FFFDA65E200 ( 22850) waiting 0.000246257 seconds,<br class="gmail_msg">
> >> > AllocReduceHelperThread: on ThCond 0x7FFFDAC7FE28<br class="gmail_msg">
> >> (0x7FFFDAC7FE28)<br class="gmail_msg">
> >> > (MsgRecordCondvar), reason 'RPC wait' for<br class="gmail_msg">
> >> allocMsgTypeRelinquishRegion<br class="gmail_msg">
> >> > on node 10.1.52.33 <c0n3271><br class="gmail_msg">
> >> > 0x197EE70 ( 6776) waiting 0.000198354 seconds,<br class="gmail_msg">
> >> > FileBlockWriteFetchHandlerThread: on ThCond 0x7FFFF00CD598<br class="gmail_msg">
> >> > (0x7FFFF00CD598) (MsgRecordCondvar), reason 'RPC wait' for<br class="gmail_msg">
> >> > allocMsgTypeRequestRegion on node 10.1.52.33 <c0n3271><br class="gmail_msg">
> >> ><br class="gmail_msg">
> >> > (<a href="http://10.1.52.33/c0n3271" rel="noreferrer" class="gmail_msg" target="_blank">10.1.52.33/c0n3271</a> <<a href="http://10.1.52.33/c0n3271" rel="noreferrer" class="gmail_msg" target="_blank">http://10.1.52.33/c0n3271</a>><br class="gmail_msg">
> <<a href="http://10.1.52.33/c0n3271" rel="noreferrer" class="gmail_msg" target="_blank">http://10.1.52.33/c0n3271</a>><br class="gmail_msg">
> >> <<a href="http://10.1.52.33/c0n3271" rel="noreferrer" class="gmail_msg" target="_blank">http://10.1.52.33/c0n3271</a>> is the fs manager<br class="gmail_msg">
> >> > for the filesystem in question)<br class="gmail_msg">
> >> ><br class="gmail_msg">
> >> > there's a single process running on this node writing to the<br class="gmail_msg">
> >> filesystem<br class="gmail_msg">
> >> > in question (well, trying to write, it's been blocked doing<br class="gmail_msg">
> >> nothing for<br class="gmail_msg">
> >> > half an hour now). There are ~10 other client nodes in this<br class="gmail_msg">
> >> situation<br class="gmail_msg">
> >> > right now. We had many more last night before the problem<br class="gmail_msg">
> >> seemed to<br class="gmail_msg">
> >> > disappear in the early hours of the morning and now its back.<br class="gmail_msg">
> >> ><br class="gmail_msg">
> >> > Waiters on the fs manager look like this. While the<br class="gmail_msg">
> >> individual waiter is<br class="gmail_msg">
> >> > short it's a near constant stream:<br class="gmail_msg">
> >> ><br class="gmail_msg">
> >> > 0x7FFF60003540 ( 8269) waiting 0.001151588 seconds, Msg<br class="gmail_msg">
> >> handler<br class="gmail_msg">
> >> > allocMsgTypeRequestRegion: on ThMutex 0x1802163A2E0<br class="gmail_msg">
> >> (0xFFFFC9002163A2E0)<br class="gmail_msg">
> >> > (AllocManagerMutex)<br class="gmail_msg">
> >> > 0x7FFF601C8860 ( 20606) waiting 0.001115712 seconds, Msg<br class="gmail_msg">
> >> handler<br class="gmail_msg">
> >> > allocMsgTypeRelinquishRegion: on ThMutex 0x1802163A2E0<br class="gmail_msg">
> >> > (0xFFFFC9002163A2E0) (AllocManagerMutex)<br class="gmail_msg">
> >> > 0x7FFF91C10080 ( 14723) waiting 0.000959649 seconds, Msg<br class="gmail_msg">
> >> handler<br class="gmail_msg">
> >> > allocMsgTypeRequestRegion: on ThMutex 0x1802163A2E0<br class="gmail_msg">
> >> (0xFFFFC9002163A2E0)<br class="gmail_msg">
> >> > (AllocManagerMutex)<br class="gmail_msg">
> >> > 0x7FFFB03C2910 ( 12636) waiting 0.000769611 seconds, Msg<br class="gmail_msg">
> >> handler<br class="gmail_msg">
> >> > allocMsgTypeRequestRegion: on ThMutex 0x1802163A2E0<br class="gmail_msg">
> >> (0xFFFFC9002163A2E0)<br class="gmail_msg">
> >> > (AllocManagerMutex)<br class="gmail_msg">
> >> > 0x7FFF8C092850 ( 18215) waiting 0.000682275 seconds, Msg<br class="gmail_msg">
> >> handler<br class="gmail_msg">
> >> > allocMsgTypeRelinquishRegion: on ThMutex 0x1802163A2E0<br class="gmail_msg">
> >> > (0xFFFFC9002163A2E0) (AllocManagerMutex)<br class="gmail_msg">
> >> > 0x7FFF9423F730 ( 12652) waiting 0.000641915 seconds, Msg<br class="gmail_msg">
> >> handler<br class="gmail_msg">
> >> > allocMsgTypeRequestRegion: on ThMutex 0x1802163A2E0<br class="gmail_msg">
> >> (0xFFFFC9002163A2E0)<br class="gmail_msg">
> >> > (AllocManagerMutex)<br class="gmail_msg">
> >> > 0x7FFF9422D770 ( 12625) waiting 0.000494256 seconds, Msg<br class="gmail_msg">
> >> handler<br class="gmail_msg">
> >> > allocMsgTypeRequestRegion: on ThMutex 0x1802163A2E0<br class="gmail_msg">
> >> (0xFFFFC9002163A2E0)<br class="gmail_msg">
> >> > (AllocManagerMutex)<br class="gmail_msg">
> >> > 0x7FFF9423E310 ( 12651) waiting 0.000437760 seconds, Msg<br class="gmail_msg">
> >> handler<br class="gmail_msg">
> >> > allocMsgTypeRelinquishRegion: on ThMutex 0x1802163A2E0<br class="gmail_msg">
> >> > (0xFFFFC9002163A2E0) (AllocManagerMutex)<br class="gmail_msg">
> >> ><br class="gmail_msg">
> >> > I don't know if this data point is useful but both yesterday<br class="gmail_msg">
> >> and today<br class="gmail_msg">
> >> > the metadata NSDs for this filesystem have had a constant<br class="gmail_msg">
> >> aggregate<br class="gmail_msg">
> >> > stream of 25MB/s 4kop/s reads during each episode (very low<br class="gmail_msg">
> >> latency<br class="gmail_msg">
> >> > though so I don't believe the storage is a bottleneck here).<br class="gmail_msg">
> >> Writes are<br class="gmail_msg">
> >> > only a few hundred ops and didn't strike me as odd.<br class="gmail_msg">
> >> ><br class="gmail_msg">
> >> > I have a PMR open for this but I'm curious if folks have<br class="gmail_msg">
> >> seen this in<br class="gmail_msg">
> >> > the wild and what it might mean.<br class="gmail_msg">
> >> ><br class="gmail_msg">
> >> > -Aaron<br class="gmail_msg">
> >> ><br class="gmail_msg">
> >> > --<br class="gmail_msg">
> >> > Aaron Knister<br class="gmail_msg">
> >> > NASA Center for Climate Simulation (Code 606.2)<br class="gmail_msg">
> >> > Goddard Space Flight Center<br class="gmail_msg">
> >> > <a href="tel:(301)%20286-2776" value="+13012862776" class="gmail_msg" target="_blank">(301) 286-2776</a> <tel:(301)%20286-2776> <tel:(301)%20286-2776><br class="gmail_msg">
> <tel:(301)%20286-2776><br class="gmail_msg">
> >> > _______________________________________________<br class="gmail_msg">
> >> > gpfsug-discuss mailing list<br class="gmail_msg">
> >> > gpfsug-discuss at <a href="http://spectrumscale.org" rel="noreferrer" class="gmail_msg" target="_blank">spectrumscale.org</a> <<a href="http://spectrumscale.org" rel="noreferrer" class="gmail_msg" target="_blank">http://spectrumscale.org</a>><br class="gmail_msg">
> >> <<a href="http://spectrumscale.org" rel="noreferrer" class="gmail_msg" target="_blank">http://spectrumscale.org</a>> <<a href="http://spectrumscale.org" rel="noreferrer" class="gmail_msg" target="_blank">http://spectrumscale.org</a>><br class="gmail_msg">
> >> > <a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss" rel="noreferrer" class="gmail_msg" target="_blank">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a><br class="gmail_msg">
> >> ><br class="gmail_msg">
> >> ><br class="gmail_msg">
> >> ><br class="gmail_msg">
> >> > _______________________________________________<br class="gmail_msg">
> >> > gpfsug-discuss mailing list<br class="gmail_msg">
> >> > gpfsug-discuss at <a href="http://spectrumscale.org" rel="noreferrer" class="gmail_msg" target="_blank">spectrumscale.org</a> <<a href="http://spectrumscale.org" rel="noreferrer" class="gmail_msg" target="_blank">http://spectrumscale.org</a>> <<a href="http://spectrumscale.org" rel="noreferrer" class="gmail_msg" target="_blank">http://spectrumscale.org</a>><br class="gmail_msg">
> >> > <a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss" rel="noreferrer" class="gmail_msg" target="_blank">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a><br class="gmail_msg">
> >> ><br class="gmail_msg">
> >><br class="gmail_msg">
> >> --<br class="gmail_msg">
> >> Aaron Knister<br class="gmail_msg">
> >> NASA Center for Climate Simulation (Code 606.2)<br class="gmail_msg">
> >> Goddard Space Flight Center<br class="gmail_msg">
> >> <a href="tel:(301)%20286-2776" value="+13012862776" class="gmail_msg" target="_blank">(301) 286-2776</a> <tel:(301)%20286-2776> <tel:(301)%20286-2776><br class="gmail_msg">
> >> _______________________________________________<br class="gmail_msg">
> >> gpfsug-discuss mailing list<br class="gmail_msg">
> >> gpfsug-discuss at <a href="http://spectrumscale.org" rel="noreferrer" class="gmail_msg" target="_blank">spectrumscale.org</a> <<a href="http://spectrumscale.org" rel="noreferrer" class="gmail_msg" target="_blank">http://spectrumscale.org</a>> <<a href="http://spectrumscale.org" rel="noreferrer" class="gmail_msg" target="_blank">http://spectrumscale.org</a>><br class="gmail_msg">
> >> <a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss" rel="noreferrer" class="gmail_msg" target="_blank">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a><br class="gmail_msg">
> >><br class="gmail_msg">
> >><br class="gmail_msg">
> >><br class="gmail_msg">
> >> _______________________________________________<br class="gmail_msg">
> >> gpfsug-discuss mailing list<br class="gmail_msg">
> >> gpfsug-discuss at <a href="http://spectrumscale.org" rel="noreferrer" class="gmail_msg" target="_blank">spectrumscale.org</a> <<a href="http://spectrumscale.org" rel="noreferrer" class="gmail_msg" target="_blank">http://spectrumscale.org</a>><br class="gmail_msg">
> >> <a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss" rel="noreferrer" class="gmail_msg" target="_blank">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a><br class="gmail_msg">
> >><br class="gmail_msg">
> ><br class="gmail_msg">
> ><br class="gmail_msg">
> ><br class="gmail_msg">
> > _______________________________________________<br class="gmail_msg">
> > gpfsug-discuss mailing list<br class="gmail_msg">
> > gpfsug-discuss at <a href="http://spectrumscale.org" rel="noreferrer" class="gmail_msg" target="_blank">spectrumscale.org</a> <<a href="http://spectrumscale.org" rel="noreferrer" class="gmail_msg" target="_blank">http://spectrumscale.org</a>><br class="gmail_msg">
> > <a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss" rel="noreferrer" class="gmail_msg" target="_blank">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a><br class="gmail_msg">
> ><br class="gmail_msg">
><br class="gmail_msg">
> --<br class="gmail_msg">
> Aaron Knister<br class="gmail_msg">
> NASA Center for Climate Simulation (Code 606.2)<br class="gmail_msg">
> Goddard Space Flight Center<br class="gmail_msg">
> <a href="tel:(301)%20286-2776" value="+13012862776" class="gmail_msg" target="_blank">(301) 286-2776</a> <tel:(301)%20286-2776><br class="gmail_msg">
> _______________________________________________<br class="gmail_msg">
> gpfsug-discuss mailing list<br class="gmail_msg">
> gpfsug-discuss at <a href="http://spectrumscale.org" rel="noreferrer" class="gmail_msg" target="_blank">spectrumscale.org</a> <<a href="http://spectrumscale.org" rel="noreferrer" class="gmail_msg" target="_blank">http://spectrumscale.org</a>><br class="gmail_msg">
> <a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss" rel="noreferrer" class="gmail_msg" target="_blank">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a><br class="gmail_msg">
><br class="gmail_msg">
><br class="gmail_msg">
><br class="gmail_msg">
> _______________________________________________<br class="gmail_msg">
> gpfsug-discuss mailing list<br class="gmail_msg">
> gpfsug-discuss at <a href="http://spectrumscale.org" rel="noreferrer" class="gmail_msg" target="_blank">spectrumscale.org</a><br class="gmail_msg">
> <a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss" rel="noreferrer" class="gmail_msg" target="_blank">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a><br class="gmail_msg">
><br class="gmail_msg">
<br class="gmail_msg">
--<br class="gmail_msg">
Aaron Knister<br class="gmail_msg">
NASA Center for Climate Simulation (Code 606.2)<br class="gmail_msg">
Goddard Space Flight Center<br class="gmail_msg">
<a href="tel:(301)%20286-2776" value="+13012862776" class="gmail_msg" target="_blank">(301) 286-2776</a><br class="gmail_msg">
_______________________________________________<br class="gmail_msg">
gpfsug-discuss mailing list<br class="gmail_msg">
gpfsug-discuss at <a href="http://spectrumscale.org" rel="noreferrer" class="gmail_msg" target="_blank">spectrumscale.org</a><br class="gmail_msg">
<a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss" rel="noreferrer" class="gmail_msg" target="_blank">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a><br class="gmail_msg">
</blockquote></div>