[gpfsug-discuss] Filesystem access issues via CES NFS

Leonardo Sala leonardo.sala at psi.ch
Fri Oct 4 07:32:42 BST 2019


Dear Malahal,

thanks for the answer. Concerning SSSD, we are also using it, should we 
use 5.0.2-PTF3? We would like to avoid using 5.0.2.2, as it has issues 
with recent RHEL 7.6 kernels [*] and we are impacted: do you suggest to 
use 5.0.3.3?

cheers

leo


[*] 
https://www.ibm.com/support/pages/ibm-spectrum-scale-gpfs-releases-42313-or-later-and-5022-or-later-have-issues-where-kernel-crashes-rhel76-0

Paul Scherrer Institut
Dr. Leonardo Sala
Group Leader High Performance Computing
Deputy Section Head Science IT
Science IT
WHGA/106
5232 Villigen PSI
Switzerland

Phone: +41 56 310 3369
leonardo.sala at psi.ch
www.psi.ch

On 03.10.19 19:15, Malahal R Naineni wrote:
> >> @Malahal: Looks like you have written the netgroup caching code, 
> feel free to ask for further details if required.
> Hi Ulrich, Ganesha uses innetgr() call for netgroup information and 
> sssd has too many issues in its implementation. Redhat said that they 
> are going to fix sssd synchronization issues in RHEL8. It is in my 
> plate to serialize innergr() call in Ganesha to match kernel NFS 
> server usage! I expect the sssd issue to give EACCESS/EPERM kind of 
> issue but not EINVAL though.
> If you are using sssd, you must be getting into a sssd issue. 
> Ganesha has a host-ip cache fix in 5.0.2 PTF3. Please make sure you 
> use ganesha version V2.5.3-ibm030.01 if you are using netgroups 
> (shipped with 5.0.2 PTF3 but can be used with Scale 5.0.1 or later)
> Regards, Malahal.
>
>     ----- Original message -----
>     From: Ulrich Sibiller <u.sibiller at science-computing.de>
>     Sent by: gpfsug-discuss-bounces at spectrumscale.org
>     To: gpfsug-discuss at spectrumscale.org
>     Cc:
>     Subject: Re: [gpfsug-discuss] Filesystem access issues via CES NFS
>     Date: Thu, Dec 13, 2018 7:32 PM
>     On 23.11.2018 14:41, Andreas Mattsson wrote:
>     > Yes, this is repeating.
>     >
>     > We’ve ascertained that it has nothing to do at all with file
>     operations on the GPFS side.
>     >
>     > Randomly throughout the filesystem mounted via NFS, ls or file
>     access will give
>     >
>     > ”
>     >
>     >  > ls: reading directory /gpfs/filessystem/test/testdir: Invalid
>     argument
>     >
>     > “
>     >
>     > Trying again later might work on that folder, but might fail
>     somewhere else.
>     >
>     > We have tried exporting the same filesystem via a standard
>     kernel NFS instead of the CES
>     > Ganesha-NFS, and then the problem doesn’t exist.
>     >
>     > So it is definitely related to the Ganesha NFS server, or its
>     interaction with the file system.
>     >  > Will see if I can get a tcpdump of the issue.
>
>     We see this, too. We cannot trigger it. Fortunately I have managed
>     to capture some logs with
>     debugging enabled. I have now dug into the ganesha 2.5.3 code and
>     I think the netgroup caching is
>     the culprit.
>
>     Here some FULL_DEBUG output:
>     2018-12-13 11:53:41 : epoch 0009008d : server1 :
>     gpfs.ganesha.nfsd-258762[work-250]
>     export_check_access :EXPORT :M_DBG :Check for address 1.2.3.4 for
>     export id 1 path /gpfsexport
>     2018-12-13 11:53:41 : epoch 0009008d : server1 :
>     gpfs.ganesha.nfsd-258762[work-250] client_match
>     :EXPORT :M_DBG :Match V4: 0xcf7fe0 NETGROUP_CLIENT: netgroup1
>     (options=421021e2root_squash   , RWrw,
>     3--, ---, TCP, ----, Manage_Gids   , -- Deleg, anon_uid=  -2,
>     anon_gid=    -2, sys)
>     2018-12-13 11:53:41 : epoch 0009008d : server1 :
>     gpfs.ganesha.nfsd-258762[work-250] nfs_ip_name_get
>     :DISP :F_DBG :Cache get hit for 1.2.3.4->client1.domain
>     2018-12-13 11:53:41 : epoch 0009008d : server1 :
>     gpfs.ganesha.nfsd-258762[work-250] client_match
>     :EXPORT :M_DBG :Match V4: 0xcfe320 NETGROUP_CLIENT: netgroup2
>     (options=421021e2root_squash   , RWrw,
>     3--, ---, TCP, ----, Manage_Gids   , -- Deleg, anon_uid=  -2,
>     anon_gid=    -2, sys)
>     2018-12-13 11:53:41 : epoch 0009008d : server1 :
>     gpfs.ganesha.nfsd-258762[work-250] nfs_ip_name_get
>     :DISP :F_DBG :Cache get hit for 1.2.3.4->client1.domain
>     2018-12-13 11:53:41 : epoch 0009008d : server1 :
>     gpfs.ganesha.nfsd-258762[work-250] client_match
>     :EXPORT :M_DBG :Match V4: 0xcfe380 NETGROUP_CLIENT: netgroup3
>     (options=421021e2root_squash   , RWrw,
>     3--, ---, TCP, ----, Manage_Gids   , -- Deleg, anon_uid=  -2,
>     anon_gid=    -2, sys)
>     2018-12-13 11:53:41 : epoch 0009008d : server1 :
>     gpfs.ganesha.nfsd-258762[work-250] nfs_ip_name_get
>     :DISP :F_DBG :Cache get hit for 1.2.3.4->client1.domain
>     2018-12-13 11:53:41 : epoch 0009008d : server1 :
>     gpfs.ganesha.nfsd-258762[work-250]
>     export_check_access :EXPORT :M_DBG :EXPORT  (options=03303002    
>              ,     ,    ,
>           ,               , -- Deleg,                ,        )
>     2018-12-13 11:53:41 : epoch 0009008d : server1 :
>     gpfs.ganesha.nfsd-258762[work-250]
>     export_check_access :EXPORT :M_DBG :EXPORT_DEFAULTS
>     (options=42102002root_squash   , ----, 3--, ---,
>     TCP, ----, Manage_Gids   ,         , anon_uid=    -2, anon_gid=  
>      -2, sys)
>     2018-12-13 11:53:41 : epoch 0009008d : server1 :
>     gpfs.ganesha.nfsd-258762[work-250]
>     export_check_access :EXPORT :M_DBG :default options
>     (options=03303002root_squash   , ----, 34-, UDP,
>     TCP, ----, No Manage_Gids, -- Deleg, anon_uid=    -2, anon_gid=  
>      -2, none, sys)
>     2018-12-13 11:53:41 : epoch 0009008d : server1 :
>     gpfs.ganesha.nfsd-258762[work-250]
>     export_check_access :EXPORT :M_DBG :Final options
>     (options=42102002root_squash   , ----, 3--, ---,
>     TCP, ----, Manage_Gids   , -- Deleg, anon_uid=    -2, anon_gid=  
>      -2, sys)
>     2018-12-13 11:53:41 : epoch 0009008d : server1 :
>     gpfs.ganesha.nfsd-258762[work-250] nfs_rpc_execute
>     :DISP :INFO :DISP: INFO: Client ::ffff:1.2.3.4 is not allowed to
>     access Export_Id 1 /gpfsexport,
>     vers=3, proc=18
>
>     The client "client1" is definitely a member of the "netgroup1".
>     But the NETGROUP_CLIENT lookups for
>     "netgroup2" and "netgroup3" can only happen if the netgroup
>     caching code reports that "client1" is
>     NOT a member of "netgroup1".
>
>     I have also opened a support case at IBM for this.
>
>     @Malahal: Looks like you have written the netgroup caching code,
>     feel free to ask for further
>     details if required.
>
>     Kind regards,
>
>     Ulrich Sibiller
>
>     --
>     Dipl.-Inf. Ulrich Sibiller           science + computing ag
>     System Administration                    Hagellocher Weg 73
>                                          72070 Tuebingen, Germany
>     https://atos.net/de/deutschland/sc
>     --
>     Science + Computing AG
>     Vorstandsvorsitzender/Chairman of the board of management:
>     Dr. Martin Matzke
>     Vorstand/Board of Management:
>     Matthias Schempp, Sabine Hohenstein
>     Vorsitzender des Aufsichtsrats/
>     Chairman of the Supervisory Board:
>     Philippe Miltin
>     Aufsichtsrat/Supervisory Board:
>     Martin Wibbe, Ursula Morgenstern
>     Sitz/Registered Office: Tuebingen
>     Registergericht/Registration Court: Stuttgart
>     Registernummer/Commercial Register No.: HRB 382196
>     _______________________________________________
>     gpfsug-discuss mailing list
>     gpfsug-discuss at spectrumscale.org
>     http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20191004/3090cc50/attachment-0002.htm>


More information about the gpfsug-discuss mailing list