[gpfsug-discuss] Hanging file-systems

Sven Oehme oehmes at gmail.com
Tue Nov 27 18:19:04 GMT 2018


Hi,

now i need to swap back in a lot of information about GPFS i tried to swap
out :-)

i bet kswapd is not doing anything you think the name suggest here, which
is handling swap space.  i claim the kswapd thread is trying to throw
dentries out of the cache and what it tries to actually get rid of are
entries of directories very high up in the tree which GPFS still has a
refcount on so it can't free it. when it does this there is a single thread
(unfortunate was never implemented with multiple threads) walking down the
tree to find some entries to steal, it it can't find any it goes to the
next , next , etc and on a bus system it can take forever to free anything
up. there have been multiple fixes in this area in 5.0.1.x and 5.0.2 which
i pushed for the weeks before i left IBM. you never see this in a trace
with default traces which is why nobody would have ever suspected this, you
need to set special trace levels to even see this.
i don't know the exact version the changes went into, but somewhere in the
5.0.1.X timeframe. the change was separating the cache list to prefer
stealing files before directories, also keep a minimum percentages of
directories in the cache (10 % by default) before it would ever try to get
rid of a directory. it also tries to keep a list of free entries all the
time (means pro active cleaning them) and also allows to go over the hard
limit compared to just block as in previous versions. so i assume you run a
version prior to 5.0.1.x and what you see is kspwapd desperately get rid of
entries, but can't find one its already at the limit so it blocks and
doesn't allow a new entry to be created or promoted from the statcache .

again all this is without source code access and speculation on my part
based on experience :-)

what version are you running and also share mmdiag --stats of that node

sven






On Tue, Nov 27, 2018 at 9:54 AM Simon Thompson <S.J.Thompson at bham.ac.uk>
wrote:

> Thanks Sven …
>
>
>
> We found a node with kswapd running 100% (and swap was off)…
>
>
>
> Killing that node made access to the FS spring into life.
>
>
>
> Simon
>
>
>
> *From: *<gpfsug-discuss-bounces at spectrumscale.org> on behalf of "
> oehmes at gmail.com" <oehmes at gmail.com>
> *Reply-To: *"gpfsug-discuss at spectrumscale.org" <
> gpfsug-discuss at spectrumscale.org>
> *Date: *Tuesday, 27 November 2018 at 16:14
> *To: *"gpfsug-discuss at spectrumscale.org" <gpfsug-discuss at spectrumscale.org
> >
> *Subject: *Re: [gpfsug-discuss] Hanging file-systems
>
>
>
> 1. are you under memory pressure or even worse started swapping .
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20181127/a7cbfbf9/attachment-0002.htm>


More information about the gpfsug-discuss mailing list