[gpfsug-discuss] NFS issues

Jan-Frode Myklebust janfrode at tanso.net
Tue Apr 25 18:04:41 BST 2017


I *think* I've seen this, and that we then had open TCP connection from
client to NFS server according to netstat, but these connections were not
visible from netstat on NFS-server side.

Unfortunately I don't remember what the fix was..



  -jf

tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support) <
S.J.Thompson at bham.ac.uk>:

> Hi,
>
> From what I can see, Ganesha uses the Export_Id option in the config file
> (which is managed by CES) for this. I did find some reference in the
> Ganesha devs list that if its not set, then it would read the FSID from
> the GPFS file-system, either way they should surely be consistent across
> all the nodes. The posts I found were from someone with an IBM email
> address, so I guess someone in the IBM teams.
>
> I checked a couple of my protocol nodes and they use the same Export_Id
> consistently, though I guess that might not be the same as the FSID value.
>
> Perhaps someone from IBM could comment on if FSID is likely to the cause
> of my problems?
>
> Thanks
>
> Simon
>
> On 25/04/2017, 14:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf
> of Ouwehand, JJ" <gpfsug-discuss-bounces at spectrumscale.org on behalf of
> j.ouwehand at vumc.nl> wrote:
>
> >Hello,
> >
> >At first a short introduction. My name is Jaap Jan Ouwehand, I work at a
> >Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of IBM
> >Spectrum Scale, Spectrum Archive and Spectrum Protect in our critical
> >(office, research and clinical data) business process. We have three
> >large GPFS filesystems for different purposes.
> >
> >We also had such a situation with cNFS. A failover (IPtakeover) was
> >technically good, only clients experienced "stale filehandles". We opened
> >a PMR at IBM and after testing, deliver logs, tcpdumps and a few months
> >later, the solution appeared to be in the fsid option.
> >
> >An NFS filehandle is built by a combination of fsid and a hash function
> >on the inode. After a failover, the fsid value can be different and the
> >client has a "stale filehandle". To avoid this, the fsid value can be
> >statically specified. See:
> >
> >https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum
> .
> >scale.v4r22.doc/bl1adm_nfslin.htm
> >
> >Maybe there is also a value in Ganesha that changes after a failover.
> >Certainly since most sessions will be re-established after a failback.
> >Maybe you see more debug information with tcpdump.
> >
> >
> >Kind regards,
> >
> >Jaap Jan Ouwehand
> >ICT Specialist (Storage & Linux)
> >VUmc - ICT
> >E: jj.ouwehand at vumc.nl
> >W: www.vumc.com
> >
> >
> >
> >-----Oorspronkelijk bericht-----
> >Van: gpfsug-discuss-bounces at spectrumscale.org
> >[mailto:gpfsug-discuss-bounces at spectrumscale.org] Namens Simon Thompson
> >(IT Research Support)
> >Verzonden: dinsdag 25 april 2017 13:21
> >Aan: gpfsug-discuss at spectrumscale.org
> >Onderwerp: [gpfsug-discuss] NFS issues
> >
> >Hi,
> >
> >We have recently started deploying NFS in addition our existing SMB
> >exports on our protocol nodes.
> >
> >We use a RR DNS name that points to 4 VIPs for SMB services and failover
> >seems to work fine with SMB clients. We figured we could use the same
> >name and IPs and run Ganesha on the protocol servers, however we are
> >seeing issues with NFS clients when IP failover occurs.
> >
> >In normal operation on a client, we might see several mounts from
> >different IPs obviously due to the way the DNS RR is working, but it all
> >works fine.
> >
> >In a failover situation, the IP will move to another node and some
> >clients will carry on, others will hang IO to the mount points referred
> >to by the IP which has moved. We can *sometimes* trigger this by manually
> >suspending a CES node, but not always and some clients mounting from the
> >IP moving will be fine, others won't.
> >
> >If we resume a node an it fails back, the clients that are hanging will
> >usually recover fine. We can reboot a client prior to failback and it
> >will be fine, stopping and starting the ganesha service on a protocol
> >node will also sometimes resolve the issues.
> >
> >So, has anyone seen this sort of issue and any suggestions for how we
> >could either debug more or workaround?
> >
> >We are currently running the packages
> >nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones).
> >
> >At one point we were seeing it a lot, and could track it back to an
> >underlying GPFS network issue that was causing protocol nodes to be
> >expelled occasionally, we resolved that and the issues became less
> >apparent, but maybe we just fixed one failure mode so see it less often.
> >
> >On the clients, we use -o sync,hard BTW as in the IBM docs.
> >
> >On a client showing the issues, we'll see in dmesg, NFS related messages
> >like:
> >[Wed Apr 12 16:59:53 2017] nfs: server MYNFSSERVER.bham.ac.uk not
> >responding, timed out
> >
> >Which explains the client hang on certain mount points.
> >
> >The symptoms feel very much like those logged in this Gluster/ganesha bug:
> >https://bugzilla.redhat.com/show_bug.cgi?id=1354439
> >
> >
> >Thanks
> >
> >Simon
> >
> >_______________________________________________
> >gpfsug-discuss mailing list
> >gpfsug-discuss at spectrumscale.org
> >http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> >_______________________________________________
> >gpfsug-discuss mailing list
> >gpfsug-discuss at spectrumscale.org
> >http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170425/dc6ea11b/attachment-0002.htm>


More information about the gpfsug-discuss mailing list