[gpfsug-discuss] High I/O wait times

Jonathan Buzzard jonathan.buzzard at strath.ac.uk
Tue Jul 10 14:12:02 BST 2018


On Mon, 2018-07-09 at 21:44 +0000, Buterbaugh, Kevin L wrote:

[SNIP]

> Interestingly enough, one user showed up waaaayyyyyy more often than
> anybody else.  And many times she was on a node with only one other
> user who we know doesn’t access the GPFS filesystem and other times
> she was the only user on the node.  
> 

I have seen on our old HPC system which had been running fine for three
years a particular user with a particular piece of software with
presumably a particular access pattern trigger a firmware bug in a SAS
drive (local disk to the node) that caused it to go offline (dead to
the world and power/presence LED off) and only a power cycle of the
node would bring it back.

At first we through the drives where failing, because what the hell,
but in the end a firmware update to the drives and they where fine.

The moral of the story is don't rule out wacky access patterns from a
single user causing problems.

JAB.

-- 
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG





More information about the gpfsug-discuss mailing list