[gpfsug-discuss] Associating I/O operations with files/processes

John Hearns john.hearns at asml.com
Tue May 30 13:28:17 BST 2017


Andreas,
This is a stupid reply, but please bear with me.
Not exactly GPFS related, but I once managed an SGI CXFS (Clustered XFS filesystem) setup.
We also had a new application which did post-processing One of the users reported that a post-processing job would take about 30 minutes.
However when two or more of the same application were running the job would take several hours.

We finally found that this slowdown was due to the IO size, the application was using the default size.
We only found this by stracing the application and spending hours staring at the trace...

I am sure there are better tools for this, and I do hope you don’t have to strace every application.... really.
A good tool to get a general feel for IO pattersn is 'iotop'. It might help?




-----Original Message-----
From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Andreas Petzold (SCC)
Sent: Tuesday, May 30, 2017 2:17 PM
To: gpfsug-discuss at spectrumscale.org
Subject: [gpfsug-discuss] Associating I/O operations with files/processes

Dear group,

first a quick introduction: at KIT we are running a 20+PB storage system with several large (1-9PB) file systems. We have a 14 node NSD server cluster and 5 small (~10 nodes) protocol node clusters which each mount one of the file systems. The protocol nodes run server software (dCache, xrootd) specific to our users which primarily are the LHC experiments at CERN. GPFS version is 4.2.2 everywhere. All servers are connected via IB, while the protocol nodes communicate via Ethernet to their clients.

Now let me describe the problem we are facing. Since a few days, one of the protocol nodes shows a very strange and as of yet unexplained I/O behaviour. Before we were usually seeing reads like this (iohist example from a well behaved node):

14:03:37.637526  R        data   32:138835918848  8192   46.626  cli  0A417D79:58E3B179    172.18.224.19
14:03:37.660177  R        data   18:12590325760   8192   25.498  cli  0A4179AD:58E3AE66    172.18.224.14
14:03:37.640660  R        data   15:106365067264  8192   45.682  cli  0A4179AD:58E3ADD7    172.18.224.14
14:03:37.657006  R        data   35:130482421760  8192   30.872  cli  0A417DAD:58E3B266    172.18.224.21
14:03:37.643908  R        data   33:107847139328  8192   45.571  cli  0A417DAD:58E3B206    172.18.224.21

Since a few days we see this on the problematic node:

14:06:27.253537  R        data   46:126258287872     8   15.474  cli  0A4179AB:58E3AE54    172.18.224.13
14:06:27.268626  R        data   40:137280768624     8    0.395  cli  0A4179AD:58E3ADE3    172.18.224.14
14:06:27.269056  R        data   46:56452781528      8    0.427  cli  0A4179AB:58E3AE54    172.18.224.13
14:06:27.269417  R        data   47:97273159640      8    0.293  cli  0A4179AD:58E3AE5A    172.18.224.14
14:06:27.269293  R        data   49:59102786168      8    0.425  cli  0A4179AD:58E3AE72    172.18.224.14
14:06:27.269531  R        data   46:142387326944     8    0.340  cli  0A4179AB:58E3AE54    172.18.224.13
14:06:27.269377  R        data   28:102988517096     8    0.554  cli  0A417879:58E3AD08    172.18.224.10

The number of read ops has gone up by O(1000) which is what one would expect when going from 8192 sector reads to 8 sector reads.

We have already excluded problems of node itself so we are focusing on the applications running on the node. What we'd like to to is to associate the I/O requests either with files or specific processes running on the machine in order to be able to blame the correct application. Can somebody tell us, if this is possible and if now, if there are other ways to understand what application is causing this?

Thanks,

Andreas

--

  Karlsruhe Institute of Technology (KIT)
  Steinbuch Centre for Computing (SCC)

  Andreas Petzold

  Hermann-von-Helmholtz-Platz 1, Building 449, Room 202
  D-76344 Eggenstein-Leopoldshafen

  Tel: +49 721 608 24916
  Fax: +49 721 608 24972
  Email: petzold at kit.edu
  https://emea01.safelinks.protection.outlook.com/?url=www.scc.kit.edu&data=01%7C01%7Cjohn.hearns%40asml.com%7Cd3f8f819bf21408c419e08d4a755bde9%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1&sdata=IwCAFwU6OI38yZK9cnmAcWpWD%2BlujeYDpgXuvvAdvVg%3D&reserved=0

  KIT – The Research University in the Helmholtz Association

  Since 2010, KIT has been certified as a family-friendly university.


-- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt.


More information about the gpfsug-discuss mailing list