[gpfsug-discuss] Checking for Stale File Handles

Wahl, Edward ewahl at osc.edu
Fri Aug 9 19:54:48 BST 2019


We use NHC here (Node Health Check) from LBNL and our SS clients are almost all using NFS root.   We have a check where we look for access to a couple of dotfiles (we have multiple SS file systems) and will mark a node offline if the checks fail.
Many things can contribute to the failure of a single client node as we all know.  Our checks are for actual node health on the clients, NOT to assess the health of the File Systems themselves.  I will normally see MANY other problems from other monitoring sources long before I normally see stale file handles at the client level.

We did have to turn up the timeout for a check of the file to return on very busy clients, but we've haven't seen slowdowns due to hundreds of nodes all checking the file at the same time.  Localized node slowdowns will occasionally mark a node offline for this check here and there (normally a node that is extremely busy), but the next check will put the node right back online in the batch system.

Ed Wahl
Ohio Supercomputer Center
ewahl at osc.edu

________________________________
From: gpfsug-discuss-bounces at spectrumscale.org <gpfsug-discuss-bounces at spectrumscale.org> on behalf of Alexander John Mamach <alex.mamach at northwestern.edu>
Sent: Friday, August 9, 2019 1:46 PM
To: gpfsug-discuss at spectrumscale.org <gpfsug-discuss at spectrumscale.org>
Subject: [gpfsug-discuss] Checking for Stale File Handles


Hi folks,



We’re currently investigating a way to check for stale file handles on the nodes across our cluster in a way that minimizes impact to the filesystem and performance.



Has anyone found a direct way of doing so? We considered a few methods, including simply attempting to ls a GPFS filesystem from each node, but that might have false positives, (detecting slowdowns as stale file handles), and could negatively impact performance with hundreds of nodes doing this simultaneously.



Thanks,



Alex



Senior Systems Administrator

Research Computing Infrastructure
Northwestern University Information Technology (NUIT)

2020 Ridge Ave
Evanston, IL 60208-4311

O: (847) 491-2219
M: (312) 887-1881
www.it.northwestern.edu


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20190809/af022a2c/attachment-0002.htm>


More information about the gpfsug-discuss mailing list