[gpfsug-discuss] e: GPFS Remote Cluster Co-existence with, CTDB/NFS Re-exporting

Fri Dec 11 00:11:29 GMT 2015

Hi Stewart,
Can't comment on NFS nor snapshot issues. However its common to change 
filesystem parameters "maxMissedPingTimeout" and "minMissedPingTimeout" 
when adding remote clusters.
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/Tuning%20Parameters

Below is an earlier gpfsug thread on about remote cluster expels:
> Re: [gpfsug-discuss] data interface and management infercace.
> *Bob Oesterlin*oester at 
> gmail.com<mailto:gpfsug-discuss%40gpfsug.org?Subject=Re:%20Re%3A%20%5Bgpfsug-discuss%5D%20data%20interface%20and%20management%20infercace.&In-Reply-To=%3CCAMNdFvA8ZjY%3DM8LABsw93zXgE03jh-YzCXEYHS7rTDZue-OddA%40mail.gmail.com%3E>
> /Mon Jul 13 18:42:47 BST 2015/
> Some thoughts on node expels, based on the last 2-3 months of "expel hell"
> here. We've spent a lot of time looking at this issue, across multiple
> clusters. A big thanks to IBM for helping us center in on the right issues.
> First, you need to understand if the expels are due to "expired lease"
> message, or expels due to "communication issues". It sounds like you are
> talking about the latter. In the case of nodes being expelled due to
> communication issues, it's more likely the problem in related to network
> congestion. This  can occur at many levels - the node, the network, or the
> switch.
>
> When it's a communication issue, changing prams like "missed ping timeout"
> isn't going to help you. The problem for us ended up being that GPFS wasn't
> getting a response to a periodic "keep alive" poll to the node, and after
> 300 seconds, it declared the node dead and expelled it. You can tell if
> this is the issue by starting to look at the RPC waiters just before the
> expel. If you see something like "Waiting for poll on sock" RPC, that the
> node is waiting for that periodic poll to return, and it's not seeing it.
> The response is either lost in the network, sitting on the network queue,
> or the node is too busy to send it. You may also see RPC's like "waiting
> for exclusive use of connection" RPC - this is another clear indication of
> network congestion.
>
> Look at the GPFSUG presentions (http://www.gpfsug.org/presentations/) for
> one by Jason Hick (NERSC) - he also talks about these issues. You need to
> take a look at net.ipv4.tcp_wmem and net.ipv4.tcp_rmem, especially if you
> have client nodes that are on slower network interfaces.
>
> In our case, it was a number of factors - adjusting these  settings,
> looking at congestion at the switch level, and some physical hardware
> issues.
>
> Bob Oesterlin, Sr Storage Engineer, Nuance Communications
> robert.oesterlin at nuance.com 
> <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
chris hunter
chris.hunter at yale.edu

> -----Original Message-----
> Sent: Friday, 11 December 2015 2:14 AM
> To: gpfsug main discussion list<gpfsug-discuss at spectrumscale.org>
> Subject: Re: [gpfsug-discuss] GPFS Remote Cluster Co-existence with CTDB/NFS Re-exporting
>
> Hi Again Everybody,
>
> Ok, so we got resolution on this.  Recall that I had said we'd just added ~300 remote cluster GPFS clients and started having problems with CTDB the very same day...
>
> Among those clients, there were three that had misconfigured firewalls, such that they could reach our home cluster nodes on port 1191, but our home cluster nodes could*not*  reach them on 1191*or*  on any of the ephemeral ports.  This situation played absolute*havoc*  with the stability of the filesystem.  From what we could tell, it seemed that these three nodes would establish a harmless-looking connection and mount the filesystem.  However, as soon as one of them acquired a resource (lock token or similar?) that the home cluster needed back...watch out!
>
> In the GPFS logs on our side, we would see messages asking for the expulsion of these nodes about 4 - 5 times per day and a ton of messages about timeouts when trying to contact them.  These nodes would then re-join the cluster, since they could contact us, and this would entail repeated "delay N seconds for recovery" events.
>
> During these recovery periods, the filesystem would become unresponsive for up to 60 or more seconds at a time.  This seemed to cause various NFS processes to fall on their faces.  Sometimes, the victim would be nfsd itself;  other times, it would be rpc.mountd.  CTDB would then come check on NFS, find that it was floundering, and start a recovery run.  To make things worse, at those very times the CTDB shared accounting files would*also*  be unavailable since they reside on the same GPFS filesystem that they are serving (thanks to Doug for pointing out the flaw in this design and we're currently looking for an alternate home for these shared files).
>
> This all added up to a*lot*  of flapping, in NFS as well as with CTDB itself.  However, the problems with CTDB/NFS were a*symptom*  in this case, not a root cause.  The*cause*  was the imperfect connectivity of just three out of 300 new clients.  I think the moral of the story here is this:  if you're adding remote cluster clients, make*absolutely*  sure that all communications work going both ways between your home cluster and*every*  new client.  If there is asymmetrical connectivity such as we had last week, you are in for one wild ride.  I would also point out that the flapping did not stop until we resolved connectivity for*all*  of the clients, so remember that even having one single half-connected client is poisonous to your stability.
>
> Thanks to everybody for all of your help!  Unless something changes, I'm declaring that our site is out of the woods on this one
>
> Stewart