<font size=2 face="sans-serif">In version 4.2.3 you can turn on QOS --fine-stats

and --pid-stats and get IO operations statistics for each active process

on each node.</font><br><br><a href=https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1adm_mmchqos.htm><font size=2 color=blue face="sans-serif">https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1adm_mmchqos.htm</font></a><br><a href=https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1adm_mmlsqos.htm><font size=2 color=blue face="sans-serif">https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1adm_mmlsqos.htm</font></a><font size=2 face="sans-serif"><br></font><br><font size=2 face="sans-serif">The statistics allow you to distinguish

single sector IOPS from partial block multisector iops from full block

multisector iops.</font><br><br><font size=2 face="sans-serif">Notice that to use this feature you

must enable QOS, but by default you start by running with all throttles

set at "unlimited". </font><br><font size=2 face="sans-serif">There are some overheads, so you might

want to use it only when you need to find the "bad" processes.</font><br><br><font size=2 face="sans-serif">It's a little tricky to use effectively,

but we give you a sample script that shows some ways to produce, massage

and filter the raw data:</font><br><br><font size=2 face="sans-serif">samples/charts/qosplotfine.pl</font><br><br><font size=2 face="sans-serif">The data is available in a CSV format,

so it's easy to feed into spreadsheets or data bases and crunch...</font><br><br><font size=2 face="sans-serif">--marc of GPFS.</font><br><br><br><br><font size=1 color=#5f5f5f face="sans-serif">From:      

 </font><font size=1 face="sans-serif">"Andreas Petzold

(SCC)" <andreas.petzold@kit.edu></font><br><font size=1 color=#5f5f5f face="sans-serif">To:      

 </font><font size=1 face="sans-serif"><gpfsug-discuss@spectrumscale.org></font><br><font size=1 color=#5f5f5f face="sans-serif">Date:      

 </font><font size=1 face="sans-serif">05/30/2017 08:17 AM</font><br><font size=1 color=#5f5f5f face="sans-serif">Subject:    

   </font><font size=1 face="sans-serif">[gpfsug-discuss]

Associating I/O operations with files/processes</font><br><font size=1 color=#5f5f5f face="sans-serif">Sent by:    

   </font><font size=1 face="sans-serif">gpfsug-discuss-bounces@spectrumscale.org</font><br><hr noshade><br><br><br><tt><font size=2>           

     Dear group,<br><br>first a quick introduction: at KIT we are running a 20+PB storage system

with several large (1-9PB) file systems. We have a 14 node NSD server cluster

and 5 small (~10 nodes) protocol node clusters which each mount one of

the file systems. The protocol nodes run server software (dCache, xrootd)

specific to our users which primarily are the LHC experiments at CERN.

GPFS version is 4.2.2 everywhere. All servers are connected via IB, while

the protocol nodes communicate via Ethernet to their clients.<br><br>Now let me describe the problem we are facing. Since a few days, one of

the protocol nodes shows a very strange and as of yet unexplained I/O behaviour.

Before we were usually seeing reads like this (iohist example from a well

behaved node):<br><br>14:03:37.637526  R        data   32:138835918848

 8192   46.626  cli  0A417D79:58E3B179    172.18.224.19

<br>14:03:37.660177  R        data   18:12590325760

  8192   25.498  cli  0A4179AD:58E3AE66    172.18.224.14

<br>14:03:37.640660  R        data   15:106365067264

 8192   45.682  cli  0A4179AD:58E3ADD7    172.18.224.14

<br>14:03:37.657006  R        data   35:130482421760

 8192   30.872  cli  0A417DAD:58E3B266    172.18.224.21

<br>14:03:37.643908  R        data   33:107847139328

 8192   45.571  cli  0A417DAD:58E3B206    172.18.224.21

<br><br>Since a few days we see this on the problematic node:<br><br>14:06:27.253537  R        data   46:126258287872

    8   15.474  cli  0A4179AB:58E3AE54  

 172.18.224.13 <br>14:06:27.268626  R        data   40:137280768624

    8    0.395  cli  0A4179AD:58E3ADE3  

 172.18.224.14 <br>14:06:27.269056  R        data   46:56452781528

     8    0.427  cli  0A4179AB:58E3AE54

   172.18.224.13 <br>14:06:27.269417  R        data   47:97273159640

     8    0.293  cli  0A4179AD:58E3AE5A

   172.18.224.14 <br>14:06:27.269293  R        data   49:59102786168

     8    0.425  cli  0A4179AD:58E3AE72

   172.18.224.14 <br>14:06:27.269531  R        data   46:142387326944

    8    0.340  cli  0A4179AB:58E3AE54  

 172.18.224.13 <br>14:06:27.269377  R        data   28:102988517096

    8    0.554  cli  0A417879:58E3AD08  

 172.18.224.10<br><br>The number of read ops has gone up by O(1000) which is what one would expect

when going from 8192 sector reads to 8 sector reads.<br><br>We have already excluded problems of node itself so we are focusing on

the applications running on the node. What we'd like to to is to associate

the I/O requests either with files or specific processes running on the

machine in order to be able to blame the correct application. Can somebody

tell us, if this is possible and if now, if there are other ways to understand

what application is causing this?<br><br>                

Thanks,<br><br>                

                 Andreas<br><br>-- <br><br>  Karlsruhe Institute of Technology (KIT)<br>  Steinbuch Centre for Computing (SCC)<br><br>  Andreas Petzold<br><br>  Hermann-von-Helmholtz-Platz 1, Building 449, Room 202<br>  D-76344 Eggenstein-Leopoldshafen<br><br>  Tel: +49 721 608 24916<br>  Fax: +49 721 608 24972<br>  Email: petzold@kit.edu<br>  </font></tt><a href=www.scc.kit.edu><tt><font size=2>www.scc.kit.edu</font></tt></a><tt><font size=2><br><br>  KIT – The Research University in the Helmholtz Association<br><br>  Since 2010, KIT has been certified as a family-friendly university.<br><br><br>[attachment "smime.p7s" deleted by Marc A Kaplan/Watson/IBM]

_______________________________________________<br>gpfsug-discuss mailing list<br>gpfsug-discuss at spectrumscale.org<br></font></tt><a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss"><tt><font size=2>http://gpfsug.org/mailman/listinfo/gpfsug-discuss</font></tt></a><tt><font size=2><br></font></tt><br><br><BR>