[gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process

Fri Nov 13 02:33:17 GMT 2020

Hi Uwe -

I hit the issue again today, no major waiters, and nothing useful from the iohist report.  Nothing interesting in the logs either.

I was able to get a trace today while the issue was happening.  I took 2 traces a few min apart.

The beginning of the traces look something like this:
Overwrite trace parameters:
  buffer size: 134217728
   64 kernel trace streams, indices 0-63 (selected by low bits of processor ID)
  128 daemon trace streams, indices 64-191 (selected by low bits of thread ID)
Interval for calibrating clock rate was 100.019054 seconds and 260049296314 cycles
Measured cycle count update rate to be 2599997559 per second <---- using this value
OS reported cycle count update rate as 2599999000 per second
Trace milestones:
  kernel trace enabled  Thu Nov 12 20:56:40.114080000 2020 (TOD 1605232600.114080, cycles 20701293141385220)
  daemon trace enabled  Thu Nov 12 20:56:40.247430000 2020 (TOD 1605232600.247430, cycles 20701293488095152)
  all streams included  Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here
  trace quiesced        Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <----   to here
Approximate number of times the trace buffer was filled: 553.529

Here is the output of
trsum.awk details=0
I'm not quite sure what to make of it, can you help me decipher it?  The 'lookup' operations are taking a hell of a long time, what does that mean?

Capture 1

Unfinished operations:

   21234 ***************** lookup               ************** 0.165851604
  290020 ***************** lookup               ************** 0.151032241
  302757 ***************** lookup               ************** 0.168723402
  301677 ***************** lookup               ************** 0.070016530
  230983 ***************** lookup               ************** 0.127699082
   21233 ***************** lookup               ************** 0.060357257
  309046 ***************** lookup               ************** 0.157124551
  301643 ***************** lookup               ************** 0.165543982
  304042 ***************** lookup               ************** 0.172513838
  167794 ***************** lookup               ************** 0.056056815
  189680 ***************** lookup               ************** 0.062022237
  362216 ***************** lookup               ************** 0.072063619
  406314 ***************** lookup               ************** 0.114121838
  167776 ***************** lookup               ************** 0.114899642
  303016 ***************** lookup               ************** 0.144491120
  290021 ***************** lookup               ************** 0.142311603
  167762 ***************** lookup               ************** 0.144240366
  248530 ***************** lookup               ************** 0.168728131
       0       0.000000000 *********  Unfinished IO: buffer/disk 30018014000 14:48493092752^\00000000FFFFFFFF
       0       0.000000000 *********  Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFE
       0       0.000000000 *********  Unfinished IO: buffer/disk 2000006B000 2:6058336^\00000000FFFFFFFF
       0       0.000000000 *********  Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFF
       0       0.000000000 *********  Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFE
       0       0.000000000 *********  Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFF
       0       0.000000000 *********  Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFE
       0       0.000000000 *********  Unfinished IO: buffer/disk 3000002E000 2:1917676744^\00000000FFFFFFFF
       0       0.000000000 *********  Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFF

Elapsed trace time:                                    0.182617894 seconds
Elapsed trace time from first VFS call to last:        0.182617893
Time idle between VFS calls:                           0.000006317 seconds

Operations stats:             total time(s)  count    avg-usecs        wait-time(s)    avg-usecs
  rdwr                       0.012021696        35      343.477
  read_inode2                0.000100787        43        2.344
  follow_link                0.000050609         8        6.326
  pagein                     0.000097806        10        9.781
  revalidate                 0.000010884       156        0.070
  open                       0.001001824        18       55.657
  lookup                     1.152449696        36    32012.492
  delete_inode               0.000036816        38        0.969
  permission                 0.000080574        14        5.755
  release                    0.000470096        18       26.116
  mmap                       0.000340095         9       37.788
  llseek                     0.000001903         9        0.211

User thread stats: GPFS-time(sec)    Appl-time  GPFS-%  Appl-%     Ops
    221919       0.000000244       0.050064080   0.00% 100.00%       4
    167794       0.000011891       0.000069707  14.57%  85.43%       4
    309046       0.147664569       0.000074663  99.95%   0.05%       9
    349767       0.000000070       0.000000000 100.00%   0.00%       1
    301677       0.017638372       0.048741086  26.57%  73.43%      12
     84407       0.000010448       0.000016977  38.10%  61.90%       3
    406314       0.000002279       0.000122367   1.83%  98.17%       7
     25464       0.043270937       0.000006200  99.99%   0.01%       2
    362216       0.000005617       0.000017498  24.30%  75.70%       2
    379982       0.000000626       0.000000000 100.00%   0.00%       1
    230983       0.123947465       0.000056796  99.95%   0.05%       6
     21233       0.047877661       0.004887113  90.74%   9.26%      17
    302757       0.154486003       0.010695642  93.52%   6.48%      24
    248530       0.000006763       0.000035442  16.02%  83.98%       3
    303016       0.014678039       0.000013098  99.91%   0.09%       2
    301643       0.088025575       0.054036566  61.96%  38.04%      33
      3339       0.000034997       0.178199426   0.02%  99.98%      35
     21234       0.164240073       0.000262711  99.84%   0.16%      39
    167762       0.000011886       0.000041865  22.11%  77.89%       3
    336006       0.000001246       0.100519562   0.00% 100.00%      16
    304042       0.121322325       0.019218406  86.33%  13.67%      33
    301644       0.054325242       0.087715613  38.25%  61.75%      37
    301680       0.000015005       0.020838281   0.07%  99.93%       9
    290020       0.147713357       0.000121422  99.92%   0.08%      19
    290021       0.000476072       0.000085833  84.72%  15.28%      10
     44777       0.040819757       0.000010957  99.97%   0.03%       3
    189680       0.000000044       0.000002376   1.82%  98.18%       1
    241759       0.000000698       0.000000000 100.00%   0.00%       1
    184839       0.000001621       0.150341986   0.00% 100.00%      28
    362220       0.000010818       0.000020949  34.05%  65.95%       2
    104687       0.000000495       0.000000000 100.00%   0.00%       1

# total App-read/write = 45 Average duration = 0.000269322 sec
#  time(sec)  count         %     %ile       read      write  avgBytesR  avgBytesW
0.000500         34  0.755556 0.755556         34          0      32889          0
0.001000         10  0.222222 0.977778         10          0     108136          0
0.004000          1  0.022222 1.000000          1          0          8          0

# max concurrant App-read/write = 2
# conc    count         %     %ile
   1         38  0.844444 0.844444
   2          7  0.155556 1.000000

Capture 2

Unfinished operations:

  335096 ***************** lookup               ************** 0.289127895
  334691 ***************** lookup               ************** 0.225380797
  362246 ***************** lookup               ************** 0.052106493
  334694 ***************** lookup               ************** 0.048567769
  362220 ***************** lookup               ************** 0.054825580
  333972 ***************** lookup               ************** 0.275355791
  406314 ***************** lookup               ************** 0.283219905
  334686 ***************** lookup               ************** 0.285973208
  289606 ***************** lookup               ************** 0.064608288
   21233 ***************** lookup               ************** 0.074923689
  189680 ***************** lookup               ************** 0.089702578
  335100 ***************** lookup               ************** 0.151553955
  334685 ***************** lookup               ************** 0.117808430
  167700 ***************** lookup               ************** 0.119441314
  336813 ***************** lookup               ************** 0.120572137
  334684 ***************** lookup               ************** 0.124718126
   21234 ***************** lookup               ************** 0.131124745
   84407 ***************** lookup               ************** 0.132442945
  334696 ***************** lookup               ************** 0.140938740
  335094 ***************** lookup               ************** 0.201637910
  167735 ***************** lookup               ************** 0.164059859
  334687 ***************** lookup               ************** 0.252930745
  334695 ***************** lookup               ************** 0.278037098
  341818       0.291815990 *********  Unfinished IO: buffer/disk 50000015000 3:439888512^\scratch_metadata_5
  341818       0.291822084 MSG FSnd: nsdMsgReadExt msg_id 2894690129 Sduration 199.688 + us
  100041       0.025061905 MSG FRep: nsdMsgReadExt msg_id 1012644746 Rduration 266959.867 us Rlen 0 Hduration 266963.954 + us

Elapsed trace time:                                    0.292021772 seconds
Elapsed trace time from first VFS call to last:        0.292021771
Time idle between VFS calls:                           0.001436519 seconds

Operations stats:             total time(s)  count    avg-usecs        wait-time(s)    avg-usecs
  rdwr                       0.000831801         4      207.950
  read_inode2                0.000082347        31        2.656
  pagein                     0.000033905         3       11.302
  revalidate                 0.000013109       156        0.084
  open                       0.000237969        22       10.817
  lookup                     1.233407280        10   123340.728
  delete_inode               0.000013877        33        0.421
  permission                 0.000046486         8        5.811
  release                    0.000172456        21        8.212
  mmap                       0.000064411         2       32.206
  llseek                     0.000000391         2        0.196
  readdir                    0.000213657        36        5.935

User thread stats: GPFS-time(sec)    Appl-time  GPFS-%  Appl-%     Ops
    335094       0.053506265       0.000170270  99.68%   0.32%      16
    167700       0.000008522       0.000027547  23.63%  76.37%       2
    167776       0.000008293       0.000019462  29.88%  70.12%       2
    334684       0.000023562       0.000160872  12.78%  87.22%       8
    349767       0.000000467       0.250029787   0.00% 100.00%       5
     84407       0.000000230       0.000017947   1.27%  98.73%       2
    334685       0.000028543       0.000094147  23.26%  76.74%       8
    406314       0.221755229       0.000009720 100.00%   0.00%       2
    334694       0.000024913       0.000125229  16.59%  83.41%      10
    335096       0.254359005       0.000240785  99.91%   0.09%      18
    334695       0.000028966       0.000127823  18.47%  81.53%      10
    334686       0.223770082       0.000267271  99.88%   0.12%      24
    334687       0.000031265       0.000132905  19.04%  80.96%       9
    334696       0.000033808       0.000131131  20.50%  79.50%       9
    129075       0.000000102       0.000000000 100.00%   0.00%       1
    341842       0.000000318       0.000000000 100.00%   0.00%       1
    335100       0.059518133       0.000287934  99.52%   0.48%      19
    224423       0.000000471       0.000000000 100.00%   0.00%       1
    336812       0.000042720       0.000193294  18.10%  81.90%      10
     21233       0.000556984       0.000083399  86.98%  13.02%      11
    289606       0.000000088       0.000018043   0.49%  99.51%       2
    362246       0.014440188       0.000046516  99.68%   0.32%       4
     21234       0.000524848       0.000162353  76.37%  23.63%      13
    336813       0.000046426       0.000175666  20.90%  79.10%       9
      3339       0.000011816       0.272396876   0.00% 100.00%      29
    341818       0.000000778       0.000000000 100.00%   0.00%       1
    167735       0.000007866       0.000049468  13.72%  86.28%       3
    175480       0.000000278       0.000000000 100.00%   0.00%       1
    336006       0.000001170       0.250020470   0.00% 100.00%      16
     44777       0.000000367       0.250149757   0.00% 100.00%       6
    189680       0.000002717       0.000006518  29.42%  70.58%       1
    184839       0.000003001       0.250144214   0.00% 100.00%      35
    145858       0.000000687       0.000000000 100.00%   0.00%       1
    333972       0.218656404       0.000043897  99.98%   0.02%       4
    334691       0.187695040       0.000295117  99.84%   0.16%      25

# total App-read/write = 7 Average duration = 0.000123672 sec
#  time(sec)  count         %     %ile       read      write  avgBytesR  avgBytesW
0.000500          7  1.000000 1.000000          7          0       1172          0

-----Original Message-----
From: gpfsug-discuss-bounces at spectrumscale.org <gpfsug-discuss-bounces at spectrumscale.org> On Behalf Of Uwe Falke
Sent: Wednesday, November 11, 2020 8:57 PM
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process

Hi, Kamil,
I suppose you'd rather not see such an issue than pursue the ugly work-around to kill off processes.
In such situations, the first looks should be for the GPFS log (on the client, on the cluster manager, and maybe on the file system manager) and for the current waiters (that is the list of currently waiting threads) on the hanging client.

-> /var/adm/ras/mmfs.log.latest
mmdiag --waiters

That might give you a first idea what is taking long and which components are involved.

Also,
mmdiag --iohist
shows you the last IOs and some stats (service time, size) for them.

Either that clue is already sufficient, or you go on (if you see DIO somewhere, direct IO is used which might slow down things, for example).
GPFS has a nice tracing which you can configure or just run the default trace.

Running a dedicated (low-level) io trace can be achieved by mmtracectl --start --trace=io  --tracedev-write-mode=overwrite -N <your_critical_node> then, when the issue is seen, stop the trace by
mmtracectl --stop   -N <your_critical_node>

Do not wait  to stop the trace once you've seen the issue, the trace file cyclically overwrites its output. If the issue lasts some time you could also start the trace while you see it, run the trace for say 20 secs and stop again. On stopping the trace, the output gets converted into an ASCII trace file named trcrpt.*(usually in /tmp/mmfs, check the command output).

There you should see lines with  FIO which carry the inode of the related file after the "tag" keyword.
example:
0.000745100  25123 TRACE_IO: FIO: read data tag 248415 43466 ioVecSize 8 1st buf 0x299E89BC000 disk 8D0 da 154:2083875440 nSectors 128 err 0 finishTime 1563473283.135212150

-> inode is 248415

there is a utility , tsfindinode, to translate that into the file path.
you need to build this first if not yet done:
cd /usr/lpp/mmfs/samples/util ; make
, then run
./tsfindinode -i <inode_num> <fs_mount_point>

For the IO trace analysis there is an older tool :
/usr/lpp/mmfs/samples/debugtools/trsum.awk.

Then there is some new stuff I've not yet used in /usr/lpp/mmfs/samples/traceanz/ (always check the README)

Hope that halps a bit.

Mit freundlichen Grüßen / Kind regards

Dr. Uwe Falke
IT Specialist
Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services
+49 175 575 2877 Mobile
Rathausstr. 7, 09111 Chemnitz, Germany
uwefalke at de.ibm.com

IBM Services

IBM Data Privacy Statement

IBM Deutschland Business & Technology Services GmbH
Geschäftsführung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen
Registergericht: Amtsgericht Stuttgart, HRB 17122

From:   "Czauz, Kamil" <Kamil.Czauz at Squarepoint-Capital.com>
To:     "gpfsug-discuss at spectrumscale.org"
<gpfsug-discuss at spectrumscale.org>
Date:   11/11/2020 23:36
Subject:        [EXTERNAL] [gpfsug-discuss] Poor client performance with
high cpu usage of       mmfsd process
Sent by:        gpfsug-discuss-bounces at spectrumscale.org

We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a
ls could take a few minutes to complete.   This affects every gpfs mount
on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu.

The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science.  I often end up killing many processes before finding the offending one.

My question is really about finding the offending process easier.  Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client?

-Kamil
Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation.
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20201113/20e7414e/attachment-0002.htm>