[gpfsug-discuss] Associating I/O operations with files/processes

Marc A Kaplan makaplan at us.ibm.com
Tue May 30 16:15:11 BST 2017


In version 4.2.3 you can turn on QOS --fine-stats and --pid-stats and get 
IO operations statistics for each active process on each node.

https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1adm_mmchqos.htm
https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1adm_mmlsqos.htm


The statistics allow you to distinguish single sector IOPS from partial 
block multisector iops from full block multisector iops.

Notice that to use this feature you must enable QOS, but by default you 
start by running with all throttles set at "unlimited". 
There are some overheads, so you might want to use it only when you need 
to find the "bad" processes.

It's a little tricky to use effectively, but we give you a sample script 
that shows some ways to produce, massage and filter the raw data:

samples/charts/qosplotfine.pl

The data is available in a CSV format, so it's easy to feed into 
spreadsheets or data bases and crunch...

--marc of GPFS.



From:   "Andreas Petzold (SCC)" <andreas.petzold at kit.edu>
To:     <gpfsug-discuss at spectrumscale.org>
Date:   05/30/2017 08:17 AM
Subject:        [gpfsug-discuss] Associating I/O operations with 
files/processes
Sent by:        gpfsug-discuss-bounces at spectrumscale.org



                 Dear group,

first a quick introduction: at KIT we are running a 20+PB storage system 
with several large (1-9PB) file systems. We have a 14 node NSD server 
cluster and 5 small (~10 nodes) protocol node clusters which each mount 
one of the file systems. The protocol nodes run server software (dCache, 
xrootd) specific to our users which primarily are the LHC experiments at 
CERN. GPFS version is 4.2.2 everywhere. All servers are connected via IB, 
while the protocol nodes communicate via Ethernet to their clients.

Now let me describe the problem we are facing. Since a few days, one of 
the protocol nodes shows a very strange and as of yet unexplained I/O 
behaviour. Before we were usually seeing reads like this (iohist example 
from a well behaved node):

14:03:37.637526  R        data   32:138835918848  8192   46.626  cli 
0A417D79:58E3B179    172.18.224.19 
14:03:37.660177  R        data   18:12590325760   8192   25.498  cli 
0A4179AD:58E3AE66    172.18.224.14 
14:03:37.640660  R        data   15:106365067264  8192   45.682  cli 
0A4179AD:58E3ADD7    172.18.224.14 
14:03:37.657006  R        data   35:130482421760  8192   30.872  cli 
0A417DAD:58E3B266    172.18.224.21 
14:03:37.643908  R        data   33:107847139328  8192   45.571  cli 
0A417DAD:58E3B206    172.18.224.21 

Since a few days we see this on the problematic node:

14:06:27.253537  R        data   46:126258287872     8   15.474  cli 
0A4179AB:58E3AE54    172.18.224.13 
14:06:27.268626  R        data   40:137280768624     8    0.395  cli 
0A4179AD:58E3ADE3    172.18.224.14 
14:06:27.269056  R        data   46:56452781528      8    0.427  cli 
0A4179AB:58E3AE54    172.18.224.13 
14:06:27.269417  R        data   47:97273159640      8    0.293  cli 
0A4179AD:58E3AE5A    172.18.224.14 
14:06:27.269293  R        data   49:59102786168      8    0.425  cli 
0A4179AD:58E3AE72    172.18.224.14 
14:06:27.269531  R        data   46:142387326944     8    0.340  cli 
0A4179AB:58E3AE54    172.18.224.13 
14:06:27.269377  R        data   28:102988517096     8    0.554  cli 
0A417879:58E3AD08    172.18.224.10

The number of read ops has gone up by O(1000) which is what one would 
expect when going from 8192 sector reads to 8 sector reads.

We have already excluded problems of node itself so we are focusing on the 
applications running on the node. What we'd like to to is to associate the 
I/O requests either with files or specific processes running on the 
machine in order to be able to blame the correct application. Can somebody 
tell us, if this is possible and if now, if there are other ways to 
understand what application is causing this?

                 Thanks,

                                 Andreas

-- 

  Karlsruhe Institute of Technology (KIT)
  Steinbuch Centre for Computing (SCC)

  Andreas Petzold

  Hermann-von-Helmholtz-Platz 1, Building 449, Room 202
  D-76344 Eggenstein-Leopoldshafen

  Tel: +49 721 608 24916
  Fax: +49 721 608 24972
  Email: petzold at kit.edu
  www.scc.kit.edu

  KIT – The Research University in the Helmholtz Association

  Since 2010, KIT has been certified as a family-friendly university.


[attachment "smime.p7s" deleted by Marc A Kaplan/Watson/IBM] 
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170530/e473f5a8/attachment-0002.htm>


More information about the gpfsug-discuss mailing list