[gpfsug-discuss] Ganesha daemon has 400'000 open files - is this unusual?

Billich Heinrich Rainer (ID SD) heinrich.billich at id.ethz.ch
Mon Sep 23 11:43:06 BST 2019


Hello Malhal,

Thank you. Actually I don’t see the parameter Cache_FDs in our ganesha config. But when I trace LRU processing I see that almost no FDs get released. And the number of FDs given in the log messages doesn’t match what I see in /proc/<pid of ganesha>/fd/. I see 512k open files while the logfile give 600k. Even 4hours since the I suspended the node and all i/o activity stopped I see 500k open files and LRU processing doesn’t close any of them.

This looks like a bug in gpfs.nfs-ganesha-2.5.3-ibm036.10.el7. I’ll open a case with IBM. We did see gansha to fail to open new files and hence client requests to fail. I assume that 500K FDs compared to 10K FDs as before create some notable overhead for ganesha, spectrum scale and kernel and withdraw resources from samba.

I’ll post to the list once we got some results.

Cheers,

Heiner


Start of LRU processing

2019-09-23 11:37:30 : epoch 00100524 : nas12ces01 : gpfs.ganesha.nfsd-100816[cache_lru] lru_run :INODE LRU :DEBUG :Reaping up to 50 entries from lane 1027
2019-09-23 11:37:30 : epoch 00100524 : nas12ces01 : gpfs.ganesha.nfsd-100816[cache_lru] lru_run :INODE LRU :F_DBG :formeropen=607025 totalwork=0 workpass=51350 totalclosed:6
2019-09-23 11:37:30 : epoch 00100524 : nas12ces01 : gpfs.ganesha.nfsd-100816[cache_lru] lru_run_lane :INODE LRU :DEBUG :Reaping up to 50 entries from lane 1027
2019-09-23 11:37:30 : epoch 00100524 : nas12ces01 : gpfs.ganesha.nfsd-100816[cache_lru] lru_run_lane :INODE LRU :DEBUG :Actually processed 50 entries on lane 1027 closing 0 descriptors
2019-09-23 11:37:30 : epoch 00100524 : nas12ces01 : gpfs.ganesha.nfsd-100816[cache_lru] lru_run :INODE LRU :DEBUG :Reaping up to 50 entries from lane 1028
2019-09-23 11:37:30 : epoch 00100524 : nas12ces01 : gpfs.ganesha.nfsd-100816[cache_lru] lru_run :INODE LRU :F_DBG :formeropen=607025 totalwork=0 workpass=51400 totalclosed:6
2019-09-23 11:37:30 : epoch 00100524 : nas12ces01 : gpfs.ganesha.nfsd-100816[cache_lru] lru_run_lane :INODE LRU :DEBUG :Reaping up to 50 entries from lane 1028
2019-09-23 11:37:30 : epoch 00100524 : nas12ces01 : gpfs.ganesha.nfsd-100816[cache_lru] lru_run_lane :INODE LRU :DEBUG :Actually processed 50 entries on lane 1028 closing 0 descriptors

End of log
2019-09-23 11:37:30 : epoch 00100524 : nas12ces01 : gpfs.ganesha.nfsd-100816[cache_lru] lru_run_lane :INODE LRU :DEBUG :Reaping up to 50 entries from lane 1029
2019-09-23 11:37:30 : epoch 00100524 : nas12ces01 : gpfs.ganesha.nfsd-100816[cache_lru] lru_run_lane :INODE LRU :DEBUG :Actually processed 50 entries on lane 1029 closing 0 descriptors
2019-09-23 11:37:30 : epoch 00100524 : nas12ces01 : gpfs.ganesha.nfsd-100816[cache_lru] lru_run :INODE LRU :DEBUG :Reaping up to 50 entries from lane 1030
2019-09-23 11:37:30 : epoch 00100524 : nas12ces01 : gpfs.ganesha.nfsd-100816[cache_lru] lru_run :INODE LRU :F_DBG :formeropen=607025 totalwork=0 workpass=51500 totalclosed:6
2019-09-23 11:37:30 : epoch 00100524 : nas12ces01 : gpfs.ganesha.nfsd-100816[cache_lru] lru_run_lane :INODE LRU :DEBUG :Reaping up to 50 entries from lane 1030
2019-09-23 11:37:30 : epoch 00100524 : nas12ces01 : gpfs.ganesha.nfsd-100816[cache_lru] lru_run_lane :INODE LRU :DEBUG :Actually processed 50 entries on lane 1030 closing 0 descriptors
2019-09-23 11:37:30 : epoch 00100524 : nas12ces01 : gpfs.ganesha.nfsd-100816[cache_lru] lru_run :INODE LRU :DEBUG :After work, open_fd_count:607024  count:29503718 fdrate:1908874353 threadwait=9
2019-09-23 11:37:30 : epoch 00100524 : nas12ces01 : gpfs.ganesha.nfsd-100816[cache_lru] lru_run :INODE LRU :F_DBG :currentopen=607024 futility=0 totalwork=51550 biggest_window=335544 extremis=0 lanes=1031 fds_lowat=167772

From: <gpfsug-discuss-bounces at spectrumscale.org> on behalf of Malahal R Naineni <mnaineni at in.ibm.com>
Reply to: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date: Thursday, 19 September 2019 at 20:39
To: "gpfsug-discuss at spectrumscale.org" <gpfsug-discuss at spectrumscale.org>
Cc: "gpfsug-discuss at spectrumscale.org" <gpfsug-discuss at spectrumscale.org>
Subject: Re: [gpfsug-discuss] Ganesha daemon has 400'000 open files - is this unusual?

NFSv3 doesn't have open/close requests, so nfs-ganesha opens a file for read/write when there is an NFSv3 read/write request. It does cache file descriptors, so its open count can be very large. If you have 'Cache_FDs = true" in your config, ganesha aggressively caches file descriptors.

Taking traces with COMPONENT_CACHE_INODE_LRU level set to full debug should give us better insight on what is happening when the the open file descriptors count is very high.

When the I/O failure happens or when the open fd count is high, you could do the following:

1. ganesha_mgr set_log COMPONENT_CACHE_INODE_LRU FULL_DEBUG
2. wait for 90 seconds, then run
3. ganesha_mgr set_log COMPONENT_CACHE_INODE_LRU EVENT

Regards, Malahal.

----- Original message -----
From: "Billich Heinrich Rainer (ID SD)" <heinrich.billich at id.ethz.ch>
Sent by: gpfsug-discuss-bounces at spectrumscale.org
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Cc:
Subject: [EXTERNAL] [gpfsug-discuss] Ganesha daemon has 400'000 open files - is this unusual?
Date: Thu, Sep 19, 2019 7:51 PM





Hello,



Is it usual to see 200’000-400’000 open files for a single ganesha process? Or does this indicate that something ist wrong?



We have some issues with ganesha (on spectrum scale protocol nodes)  reporting NFS3ERR_IO in the log. I noticed that the affected nodes have a large number of open files, 200’000-400’000 open files per daemon (and 500 threads and about 250 client connections). Other nodes have 1’000 – 10’000 open files by ganesha only and don’t show the issue.



If someone could explain how ganesha decides which files to keep open and which to close that would help, too. As NFSv3 is stateless the client doesn’t open/close a file, it’s the server to decide when to close it? We do have a few NFSv4 clients, too.



Are there certain access patterns that can trigger such a large number of open file? Maybe traversing and reading a large number of small files?



Thank you,

Heiner



I did count the open files  by counting the entries in /proc/<pid of ganesha>/fd/ . With several 100k entries I failed to do a ‘ls -ls’ to list all the symbolic links, hence I can’t relate the open files to different exports easily.



I did post this to the ganesha mailing list, too.

--

=======================

Heinrich Billich

ETH Zürich

Informatikdienste

Tel.: +41 44 632 72 56

heinrich.billich at id.ethz.ch

========================




_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20190923/f76d0274/attachment-0002.htm>


More information about the gpfsug-discuss mailing list