<div><br></div><div>I *think* I've seen this, and that we then had open TCP connection from client to NFS server according to netstat, but these connections were not visible from netstat on NFS-server side.</div><div><br></div><div>Unfortunately I don't remember what the fix was..</div><div><br></div><div><br></div><div><br></div><div>  -jf</div><div><br><div class="gmail_quote"><div>tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support) <<a href="mailto:S.J.Thompson@bham.ac.uk">S.J.Thompson@bham.ac.uk</a>>:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>

<br>

>From what I can see, Ganesha uses the Export_Id option in the config file<br>

(which is managed by CES) for this. I did find some reference in the<br>

Ganesha devs list that if its not set, then it would read the FSID from<br>

the GPFS file-system, either way they should surely be consistent across<br>

all the nodes. The posts I found were from someone with an IBM email<br>

address, so I guess someone in the IBM teams.<br>

<br>

I checked a couple of my protocol nodes and they use the same Export_Id<br>

consistently, though I guess that might not be the same as the FSID value.<br>

<br>

Perhaps someone from IBM could comment on if FSID is likely to the cause<br>

of my problems?<br>

<br>

Thanks<br>

<br>

Simon<br>

<br>

On 25/04/2017, 14:51, "<a href="mailto:gpfsug-discuss-bounces@spectrumscale.org" target="_blank">gpfsug-discuss-bounces@spectrumscale.org</a> on behalf<br>

of Ouwehand, JJ" <<a href="mailto:gpfsug-discuss-bounces@spectrumscale.org" target="_blank">gpfsug-discuss-bounces@spectrumscale.org</a> on behalf of<br>

<a href="mailto:j.ouwehand@vumc.nl" target="_blank">j.ouwehand@vumc.nl</a>> wrote:<br>

<br>

>Hello,<br>

><br>

>At first a short introduction. My name is Jaap Jan Ouwehand, I work at a<br>

>Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of IBM<br>

>Spectrum Scale, Spectrum Archive and Spectrum Protect in our critical<br>

>(office, research and clinical data) business process. We have three<br>

>large GPFS filesystems for different purposes.<br>

><br>

>We also had such a situation with cNFS. A failover (IPtakeover) was<br>

>technically good, only clients experienced "stale filehandles". We opened<br>

>a PMR at IBM and after testing, deliver logs, tcpdumps and a few months<br>

>later, the solution appeared to be in the fsid option.<br>

><br>

>An NFS filehandle is built by a combination of fsid and a hash function<br>

>on the inode. After a failover, the fsid value can be different and the<br>

>client has a "stale filehandle". To avoid this, the fsid value can be<br>

>statically specified. See:<br>

><br>

><a href="https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum" rel="noreferrer" target="_blank">https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum</a>.<br>

>scale.v4r22.doc/bl1adm_nfslin.htm<br>

><br>

>Maybe there is also a value in Ganesha that changes after a failover.<br>

>Certainly since most sessions will be re-established after a failback.<br>

>Maybe you see more debug information with tcpdump.<br>

><br>

><br>

>Kind regards,<br>

><br>

>Jaap Jan Ouwehand<br>

>ICT Specialist (Storage & Linux)<br>

>VUmc - ICT<br>

>E: <a href="mailto:jj.ouwehand@vumc.nl" target="_blank">jj.ouwehand@vumc.nl</a><br>

>W: <a href="http://www.vumc.com" rel="noreferrer" target="_blank">www.vumc.com</a><br>

><br>

><br>

><br>

>-----Oorspronkelijk bericht-----<br>

>Van: <a href="mailto:gpfsug-discuss-bounces@spectrumscale.org" target="_blank">gpfsug-discuss-bounces@spectrumscale.org</a><br>

>[mailto:<a href="mailto:gpfsug-discuss-bounces@spectrumscale.org" target="_blank">gpfsug-discuss-bounces@spectrumscale.org</a>] Namens Simon Thompson<br>

>(IT Research Support)<br>

>Verzonden: dinsdag 25 april 2017 13:21<br>

>Aan: <a href="mailto:gpfsug-discuss@spectrumscale.org" target="_blank">gpfsug-discuss@spectrumscale.org</a><br>

>Onderwerp: [gpfsug-discuss] NFS issues<br>

><br>

>Hi,<br>

><br>

>We have recently started deploying NFS in addition our existing SMB<br>

>exports on our protocol nodes.<br>

><br>

>We use a RR DNS name that points to 4 VIPs for SMB services and failover<br>

>seems to work fine with SMB clients. We figured we could use the same<br>

>name and IPs and run Ganesha on the protocol servers, however we are<br>

>seeing issues with NFS clients when IP failover occurs.<br>

><br>

>In normal operation on a client, we might see several mounts from<br>

>different IPs obviously due to the way the DNS RR is working, but it all<br>

>works fine.<br>

><br>

>In a failover situation, the IP will move to another node and some<br>

>clients will carry on, others will hang IO to the mount points referred<br>

>to by the IP which has moved. We can *sometimes* trigger this by manually<br>

>suspending a CES node, but not always and some clients mounting from the<br>

>IP moving will be fine, others won't.<br>

><br>

>If we resume a node an it fails back, the clients that are hanging will<br>

>usually recover fine. We can reboot a client prior to failback and it<br>

>will be fine, stopping and starting the ganesha service on a protocol<br>

>node will also sometimes resolve the issues.<br>

><br>

>So, has anyone seen this sort of issue and any suggestions for how we<br>

>could either debug more or workaround?<br>

><br>

>We are currently running the packages<br>

>nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones).<br>

><br>

>At one point we were seeing it a lot, and could track it back to an<br>

>underlying GPFS network issue that was causing protocol nodes to be<br>

>expelled occasionally, we resolved that and the issues became less<br>

>apparent, but maybe we just fixed one failure mode so see it less often.<br>

><br>

>On the clients, we use -o sync,hard BTW as in the IBM docs.<br>

><br>

>On a client showing the issues, we'll see in dmesg, NFS related messages<br>

>like:<br>

>[Wed Apr 12 16:59:53 2017] nfs: server <a href="http://MYNFSSERVER.bham.ac.uk" rel="noreferrer" target="_blank">MYNFSSERVER.bham.ac.uk</a> not<br>

>responding, timed out<br>

><br>

>Which explains the client hang on certain mount points.<br>

><br>

>The symptoms feel very much like those logged in this Gluster/ganesha bug:<br>

><a href="https://bugzilla.redhat.com/show_bug.cgi?id=1354439" rel="noreferrer" target="_blank">https://bugzilla.redhat.com/show_bug.cgi?id=1354439</a><br>

><br>

><br>

>Thanks<br>

><br>

>Simon<br>

><br>

>_______________________________________________<br>

>gpfsug-discuss mailing list<br>

>gpfsug-discuss at <a href="http://spectrumscale.org" rel="noreferrer" target="_blank">spectrumscale.org</a><br>

><a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss" rel="noreferrer" target="_blank">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a><br>

>_______________________________________________<br>

>gpfsug-discuss mailing list<br>

>gpfsug-discuss at <a href="http://spectrumscale.org" rel="noreferrer" target="_blank">spectrumscale.org</a><br>

><a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss" rel="noreferrer" target="_blank">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a><br>

<br>

_______________________________________________<br>

gpfsug-discuss mailing list<br>

gpfsug-discuss at <a href="http://spectrumscale.org" rel="noreferrer" target="_blank">spectrumscale.org</a><br>

<a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss" rel="noreferrer" target="_blank">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a><br>

</blockquote></div></div>