[gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process

Fri Nov 20 19:13:41 GMT 2020

Here is the output of waiters on 2 hosts that were having the issue today:

HOST 1
 [2020-11-20 09:07:53 root at nyzls149m ~]# /usr/lpp/mmfs/bin/mmdiag --waiters

=== mmdiag: waiters ===
Waiting 0.0035 sec since 09:08:07, monitored, thread 135497 FileBlockReadFetchHandlerThread: on ThCond 0x7F615C152468 (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.64.44.180 <c0n105>
Waiting 0.0036 sec since 09:08:07, monitored, thread 139228 PrefetchWorkerThread: on ThCond 0x7F627000D5D8 (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.64.44.181 <c0n106>

[2020-11-20 09:08:07 root at nyzls149m ~]# /usr/lpp/mmfs/bin/mmdiag --waiters

=== mmdiag: waiters ===

HOST 2
[2020-11-20 09:08:49 root at nyzls150m ~]# /usr/lpp/mmfs/bin/mmdiag --waiters

=== mmdiag: waiters ===
Waiting 0.0034 sec since 09:08:50, monitored, thread 345318 SharedHashTabFetchHandlerThread: on ThCond 0x7F049C001F08 (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.64.44.133 <c1n2>

[2020-11-20 09:08:50 root at nyzls150m ~]# /usr/lpp/mmfs/bin/mmdiag --waiters

=== mmdiag: waiters ===

[2020-11-20 09:08:52 root at nyzls150m ~]# /usr/lpp/mmfs/bin/mmdiag --waiters

=== mmdiag: waiters ===

You can see the waiters go from 0 to 1-2 , but they are hardly blocking.

Yes there are separate pools for metadata for all of the filesystems here.

I did another trace today when the problem was happening - this time I was able to get a longer trace using the following command:
/usr/lpp/mmfs/bin/mmtracectl --start --trace=io --trace-file-size=512M --tracedev-write-mode=blocking --tracedev-buffer-size=64M -N nyzls149m

This is what the trsum output looks like:

Elapsed trace time:                                   62.412092000 seconds
Elapsed trace time from first VFS call to last:       62.412091999
Time idle between VFS calls:                           0.002913000 seconds

Operations stats:             total time(s)  count    avg-usecs        wait-time(s)    avg-usecs
  readpage                   0.003487000         9      387.444
  rdwr                       0.273721000       183     1495.743
  read_inode2                0.007304000       325       22.474
  follow_link                0.013952000        58      240.552
  pagein                     0.025974000        66      393.545
  getattr                    0.002792000        26      107.385
  revalidate                 0.009406000      2172        4.331
  create                    66.194479000         3 22064826.333
  open                       1.725505000        88    19608.011
  unlink                    18.685099000         1 18685099.000
  setattr                    0.011627000        14      830.500
  lookup                  2379.215514000       502  4739473.135
  delete_inode               0.015553000       328       47.418
  rename                    98.099073000         5 19619814.600
  release                    0.050574000        89      568.247
  permission                 0.007454000        73      102.110
  getxattr                   0.002346000        32       73.312
  statfs                     0.000081000         6       13.500
  mmap                       0.049809000        18     2767.167
  removexattr                0.000827000        14       59.071
  llseek                     0.000441000        47        9.383
  readdir                    0.002667000        34       78.441
Ops      4093 Secs      62.409178999  Ops/Sec       65.583

MaxFilesToCache is set to 12000 :
[common]
maxFilesToCache 12000

I only see gpfs_i_lookup in the tracefile, no gpfs_v_lookups
#  grep gpfs_i_lookup trcrpt.2020-11-20_09.20.38.283986.nyzls149m |wc -l
1097

They mostly look like this -

  62.346560 238895 TRACE_VNODE: gpfs_i_lookup exit: new inode 0xFFFF922178971A40 iNum 21980113 (0x14F63D1) cnP 0xFFFF922178971C88 retP 0x0 code 0 rc 0
  62.346955 238895 TRACE_VNODE: gpfs_i_lookup enter: diP 0xFFFF91A8A4991E00 dentryP 0xFFFF92C545A93500 name '20170323.txt' d_flags 0x80 d_count 1 unhashed 1
  62.367701 218442 TRACE_VNODE: gpfs_i_lookup exit: new inode 0xFFFF922071300000 iNum 29629892 (0x1C41DC4) cnP 0xFFFF922071300248 retP 0x0 code 0 rc 0
  62.367734 218444 TRACE_VNODE: gpfs_i_lookup enter: diP 0xFFFF9193CF457800 dentryP 0xFFFF9229527A89C0 name 'node.py' d_flags 0x80 d_count 1 unhashed 1

-----Original Message-----
From: gpfsug-discuss-bounces at spectrumscale.org <gpfsug-discuss-bounces at spectrumscale.org> On Behalf Of Uwe Falke
Sent: Monday, November 16, 2020 8:46 AM
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process

Hi,
while the other nodes can well block the local one, as Frederick suggests,  there should at least be something visible locally waiting for these other nodes.
Looking at all waiters might be a good thing, but this case looks strange in other ways. Mind statement there are almost no local waiters and none of them gets older than 10 ms.

I am no developer nor do I have the code, so don't expect too much.  Can you tell what lookups you see (check in the trcrpt file, could be like gpfs_i_lookup or gpfs_v_lookup)?
Lookups are metadata ops, do you have a separate pool for your metadata?
How is that pool set up (doen to the physical block devices)?
Your trcsum down revealed 36 lookups, each one on avg taking >30ms. That is a lot (albeit the respective waiters won't show up at first glance as suspicious ...).
So, which waiters did you see  (hope you saved them, if not, do it next time).

What are the node you see this on and the whole cluster used for? What is the MaxFilesToCache setting (for that node and for others)? what HW is that, how big are your nodes (memory,CPU)?
To check the unreasonably short trace capture time: how large are the trcrpt files you obtain?

Mit freundlichen Grüßen / Kind regards

Dr. Uwe Falke
IT Specialist
Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services
+49 175 575 2877 Mobile
Rathausstr. 7, 09111 Chemnitz, Germany
uwefalke at de.ibm.com

IBM Services

IBM Data Privacy Statement

IBM Deutschland Business & Technology Services GmbH
Geschäftsführung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen
Registergericht: Amtsgericht Stuttgart, HRB 17122

From:   "Czauz, Kamil" <Kamil.Czauz at Squarepoint-Capital.com>
To:     gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date:   13/11/2020 14:33
Subject:        [EXTERNAL] Re: [gpfsug-discuss] Poor client performance
with high cpu   usage   of      mmfsd process
Sent by:        gpfsug-discuss-bounces at spectrumscale.org

Hi Uwe -

Regarding your previous message - waiters were coming / going with just
1-2 waiters when I ran the mmdiag command, with very low wait times (<0.01s).

We are running version 4.2.3

I did another capture today while the client is functioning normally and this was the header result:

Overwrite trace parameters:
buffer size: 134217728
64 kernel trace streams, indices 0-63 (selected by low bits of processor
ID)
128 daemon trace streams, indices 64-191 (selected by low bits of thread
ID)
Interval for calibrating clock rate was 25.996957 seconds and 67592121252 cycles Measured cycle count update rate to be 2600001271 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones:
kernel trace enabled Fri Nov 13 08:20:01.800558000 2020 (TOD 1605273601.800558, cycles 20807897445779444) daemon trace enabled Fri Nov 13 08:20:01.910017000 2020 (TOD 1605273601.910017, cycles 20807897730372442) all streams included Fri Nov 13 08:20:26.423085049 2020 (TOD 1605273626.423085, cycles 20807961464381068) <---- useful part of trace extends from here trace quiesced Fri Nov 13 08:20:27.797515000 2020 (TOD 1605273627.000797, cycles 20807965037900696) <---- to here Approximate number of times the trace buffer was filled: 14.631

Still a very small capture (1.3s), but the trsum.awk output was not filled with lookup commands / large lookup times. Can you help debug what those long lookup operations mean?

Unfinished operations:

27967 ***************** pagein ************** 1.362382116
27967 ***************** readpage ************** 1.362381516
139130 1.362448448 ********* Unfinished IO: buffer/disk 3002F670000
20:107498951168^\archive_data_16
104686 1.362022068 ********* Unfinished IO: buffer/disk 50011878000
1:47169618944^\archive_data_1
0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\FFFFFFFE
0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\FFFFFFFF
0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\FFFFFFFE
341710 1.362423815 ********* Unfinished IO: buffer/disk 20022218000
19:107498951680^\archive_data_15
0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\FFFFFFFF
0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\FFFFFFFE
0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\00000000FFFFFFFF
0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\FFFFFFFF
139150 1.361122006 ********* Unfinished IO: buffer/disk 50012018000
2:47169622016^\archive_data_2
0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\00000000FFFFFFFF
95782 1.361112791 ********* Unfinished IO: buffer/disk 40016300000
20:107498950656^\archive_data_16
0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\00000000FFFFFFFF
271076 1.361579585 ********* Unfinished IO: buffer/disk 20023DB8000
4:47169606656^\archive_data_4
341676 1.362018599 ********* Unfinished IO: buffer/disk 40038140000
5:47169614336^\archive_data_5
139150 1.361131599 MSG FSnd: nsdMsgReadExt msg_id 2930654492 Sduration
13292.382 + us
341676 1.362027104 MSG FSnd: nsdMsgReadExt msg_id 2930654495 Sduration
12396.877 + us
95782 1.361124739 MSG FSnd: nsdMsgReadExt msg_id 2930654491 Sduration
13299.242 + us
271076 1.361587653 MSG FSnd: nsdMsgReadExt msg_id 2930654493 Sduration
12836.328 + us
92182 0.000000000 MSG FSnd: msg_id 0 Sduration 0.000 + us
341710 1.362429643 MSG FSnd: nsdMsgReadExt msg_id 2930654497 Sduration
11994.338 + us
341662 0.000000000 MSG FSnd: msg_id 0 Sduration 0.000 + us
139130 1.362458376 MSG FSnd: nsdMsgReadExt msg_id 2930654498 Sduration
11965.605 + us
104686 1.362028772 MSG FSnd: nsdMsgReadExt msg_id 2930654496 Sduration
12395.209 + us
412373 0.775676657 MSG FRep: nsdMsgReadExt msg_id 304915249 Rduration
598747.324 us Rlen 262144 Hduration 598752.112 + us
341770 0.589739579 MSG FRep: nsdMsgReadExt msg_id 338079050 Rduration
784684.402 us Rlen 4 Hduration 784692.651 + us
143315 0.536252844 MSG FRep: nsdMsgReadExt msg_id 631945522 Rduration
838171.137 us Rlen 233472 Hduration 838174.299 + us
341878 0.134331812 MSG FRep: nsdMsgReadExt msg_id 338079023 Rduration
1240092.169 us Rlen 262144 Hduration 1240094.403 + us
175478 0.587353287 MSG FRep: nsdMsgReadExt msg_id 338079047 Rduration
787070.694 us Rlen 262144 Hduration 787073.990 + us
139558 0.633517347 MSG FRep: nsdMsgReadExt msg_id 631945538 Rduration
740906.634 us Rlen 102400 Hduration 740910.172 + us
143308 0.958832110 MSG FRep: nsdMsgReadExt msg_id 631945542 Rduration
415591.871 us Rlen 262144 Hduration 415597.056 + us

Elapsed trace time: 1.374423981 seconds
Elapsed trace time from first VFS call to last: 1.374423980 Time idle between VFS calls: 0.001603738 seconds

Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs readpage 1.151660085 1874 614.546 rdwr 0.431456904 581 742.611
read_inode2 0.001180648 934 1.264
follow_link 0.000029502 7 4.215
getattr 0.000048413 9 5.379
revalidate 0.000007080 67 0.106
pagein 1.149699537 1877 612.520
create 0.007664829 9 851.648
open 0.001032657 19 54.350
unlink 0.002563726 14 183.123
delete_inode 0.000764598 826 0.926
lookup 0.312847947 953 328.277
setattr 0.020651226 824 25.062
permission 0.000015018 1 15.018
rename 0.000529023 4 132.256
release 0.001613800 22 73.355
getxattr 0.000030494 6 5.082
mmap 0.000054767 1 54.767
llseek 0.000001130 4 0.283
readdir 0.000033947 2 16.973
removexattr 0.002119736 820 2.585

User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops
42625 0.000000138 0.000031017 0.44% 99.56% 3
42378 0.000586959 0.011596801 4.82% 95.18% 32
42627 0.000000272 0.000013421 1.99% 98.01% 2
42641 0.003284590 0.012593594 20.69% 79.31% 35
42628 0.001522335 0.000002748 99.82% 0.18% 2
25464 0.003462795 0.500281914 0.69% 99.31% 12
301420 0.000016711 0.052848218 0.03% 99.97% 38
95103 0.000000544 0.000000000 100.00% 0.00% 1
145858 0.000000659 0.000794896 0.08% 99.92% 2
42221 0.000011484 0.000039445 22.55% 77.45% 5
371718 0.000000707 0.001805425 0.04% 99.96% 2
95109 0.000000880 0.008998763 0.01% 99.99% 2
95337 0.000010330 0.503057866 0.00% 100.00% 8
42700 0.002442175 0.012504429 16.34% 83.66% 35
189680 0.003466450 0.500128627 0.69% 99.31% 9
42681 0.006685396 0.000391575 94.47% 5.53% 16
42702 0.000048203 0.000000500 98.97% 1.03% 2
42703 0.000033280 0.140102087 0.02% 99.98% 9
224423 0.000000195 0.000000000 100.00% 0.00% 1
42706 0.000541098 0.000014713 97.35% 2.65% 3
106275 0.000000456 0.000000000 100.00% 0.00% 1
42721 0.000372857 0.000000000 100.00% 0.00% 1

-----Original Message-----
From: gpfsug-discuss-bounces at spectrumscale.org
<gpfsug-discuss-bounces at spectrumscale.org> On Behalf Of Uwe Falke
Sent: Friday, November 13, 2020 4:37 AM
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process

Hi Kamil,
in my mail just a few minutes ago I'd overlooked that the buffer size in your trace was indeed 128M (I suppose the trace file is adapting that size if not set in particular). That is very strange, even under high load, the trace should then capture some longer time than 10 secs, and , most of all, it should contain much more activities than just these few you had.
That is very mysterious.
I am out of ideas for the moment, and a bit short of time to dig here.

To check your tracing, you could run a trace like before but when everything is normal and check that out - you should see many records, the trcsum.awk should list just a small portion of unfinished ops at the end, ... If that is fine, then the tracing itself is affected by your crritical condition (never experienced that before - rather GPFS grinds to a halt than the trace is abandoned), and that might well be worth a support ticket.

Mit freundlichen Grüßen / Kind regards

Dr. Uwe Falke
IT Specialist
Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services
+49 175 575 2877 Mobile
Rathausstr. 7, 09111 Chemnitz, Germany
uwefalke at de.ibm.com

IBM Services

IBM Data Privacy Statement

IBM Deutschland Business & Technology Services GmbH
Geschäftsführung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft:
Ehningen
Registergericht: Amtsgericht Stuttgart, HRB 17122

From: Uwe Falke/Germany/IBM
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date: 13/11/2020 10:21
Subject: Re: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process

Hi, Kamil,
looks your tracefile setting has been too low:
all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here means you effectively captured a period of about 5ms only ... you can't see much from that.

I'd assumed the default trace file size would be sufficient here but it doesn't seem to.
try running with something like
mmtracectl --start --trace-file-size=512M --trace=io --tracedev-write-mode=overwrite -N <your_critical_node>.

However, if you say "no major waiter" - how many waiters did you see at any time? what kind of waiters were the oldest, how long they'd waited?
it could indeed well be that some job is just creating a killer workload.

The very short cyle time of the trace points, OTOH, to high activity, OTOH the trace file setting appears quite low (trace=io doesnt' collect many trace infos, just basic IO stuff).
If I might ask: what version of GPFS are you running?

Mit freundlichen Grüßen / Kind regards

Dr. Uwe Falke
IT Specialist
Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services
+49 175 575 2877 Mobile
Rathausstr. 7, 09111 Chemnitz, Germany
uwefalke at de.ibm.com

IBM Services

IBM Data Privacy Statement

IBM Deutschland Business & Technology Services GmbH
Geschäftsführung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft:
Ehningen
Registergericht: Amtsgericht Stuttgart, HRB 17122

From: "Czauz, Kamil" <Kamil.Czauz at Squarepoint-Capital.com>
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date: 13/11/2020 03:33
Subject: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org

Hi Uwe -

I hit the issue again today, no major waiters, and nothing useful from the iohist report. Nothing interesting in the logs either.

I was able to get a trace today while the issue was happening. I took 2 traces a few min apart.

The beginning of the traces look something like this:
Overwrite trace parameters:
buffer size: 134217728
64 kernel trace streams, indices 0-63 (selected by low bits of processor
ID)
128 daemon trace streams, indices 64-191 (selected by low bits of thread
ID)
Interval for calibrating clock rate was 100.019054 seconds and
260049296314 cycles
Measured cycle count update rate to be 2599997559 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones:
kernel trace enabled Thu Nov 12 20:56:40.114080000 2020 (TOD 1605232600.114080, cycles 20701293141385220) daemon trace enabled Thu Nov
12 20:56:40.247430000 2020 (TOD 1605232600.247430, cycles
20701293488095152) all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here Approximate number of times the trace buffer was filled: 553.529

Here is the output of
trsum.awk details=0
I'm not quite sure what to make of it, can you help me decipher it? The 'lookup' operations are taking a hell of a long time, what does that mean?

Capture 1

Unfinished operations:

21234 ***************** lookup ************** 0.165851604
290020 ***************** lookup ************** 0.151032241
302757 ***************** lookup ************** 0.168723402
301677 ***************** lookup ************** 0.070016530
230983 ***************** lookup ************** 0.127699082
21233 ***************** lookup ************** 0.060357257
309046 ***************** lookup ************** 0.157124551
301643 ***************** lookup ************** 0.165543982
304042 ***************** lookup ************** 0.172513838
167794 ***************** lookup ************** 0.056056815
189680 ***************** lookup ************** 0.062022237
362216 ***************** lookup ************** 0.072063619
406314 ***************** lookup ************** 0.114121838
167776 ***************** lookup ************** 0.114899642
303016 ***************** lookup ************** 0.144491120
290021 ***************** lookup ************** 0.142311603
167762 ***************** lookup ************** 0.144240366
248530 ***************** lookup ************** 0.168728131
0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\00000000FFFFFFFF
0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFE
0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\00000000FFFFFFFF
0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFF
0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFE
0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFF
0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFE
0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\00000000FFFFFFFF
0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFF

Elapsed trace time: 0.182617894 seconds
Elapsed trace time from first VFS call to last: 0.182617893 Time idle between VFS calls: 0.000006317 seconds

Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.012021696 35 343.477
read_inode2 0.000100787 43 2.344
follow_link 0.000050609 8 6.326
pagein 0.000097806 10 9.781
revalidate 0.000010884 156 0.070
open 0.001001824 18 55.657
lookup 1.152449696 36 32012.492
delete_inode 0.000036816 38 0.969
permission 0.000080574 14 5.755
release 0.000470096 18 26.116
mmap 0.000340095 9 37.788
llseek 0.000001903 9 0.211

User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops
221919 0.000000244 0.050064080 0.00% 100.00% 4
167794 0.000011891 0.000069707 14.57% 85.43% 4
309046 0.147664569 0.000074663 99.95% 0.05% 9
349767 0.000000070 0.000000000 100.00% 0.00% 1
301677 0.017638372 0.048741086 26.57% 73.43% 12
84407 0.000010448 0.000016977 38.10% 61.90% 3
406314 0.000002279 0.000122367 1.83% 98.17% 7
25464 0.043270937 0.000006200 99.99% 0.01% 2
362216 0.000005617 0.000017498 24.30% 75.70% 2
379982 0.000000626 0.000000000 100.00% 0.00% 1
230983 0.123947465 0.000056796 99.95% 0.05% 6
21233 0.047877661 0.004887113 90.74% 9.26% 17
302757 0.154486003 0.010695642 93.52% 6.48% 24
248530 0.000006763 0.000035442 16.02% 83.98% 3
303016 0.014678039 0.000013098 99.91% 0.09% 2
301643 0.088025575 0.054036566 61.96% 38.04% 33
3339 0.000034997 0.178199426 0.02% 99.98% 35
21234 0.164240073 0.000262711 99.84% 0.16% 39
167762 0.000011886 0.000041865 22.11% 77.89% 3
336006 0.000001246 0.100519562 0.00% 100.00% 16
304042 0.121322325 0.019218406 86.33% 13.67% 33
301644 0.054325242 0.087715613 38.25% 61.75% 37
301680 0.000015005 0.020838281 0.07% 99.93% 9
290020 0.147713357 0.000121422 99.92% 0.08% 19
290021 0.000476072 0.000085833 84.72% 15.28% 10
44777 0.040819757 0.000010957 99.97% 0.03% 3
189680 0.000000044 0.000002376 1.82% 98.18% 1
241759 0.000000698 0.000000000 100.00% 0.00% 1
184839 0.000001621 0.150341986 0.00% 100.00% 28
362220 0.000010818 0.000020949 34.05% 65.95% 2
104687 0.000000495 0.000000000 100.00% 0.00% 1

# total App-read/write = 45 Average duration = 0.000269322 sec # time(sec) count % %ile read write avgBytesR avgBytesW
0.000500 34 0.755556 0.755556 34 0 32889 0
0.001000 10 0.222222 0.977778 10 0 108136 0
0.004000 1 0.022222 1.000000 1 0 8 0

# max concurrant App-read/write = 2
# conc count % %ile
1 38 0.844444 0.844444
2 7 0.155556 1.000000

Capture 2

Unfinished operations:

335096 ***************** lookup ************** 0.289127895
334691 ***************** lookup ************** 0.225380797
362246 ***************** lookup ************** 0.052106493
334694 ***************** lookup ************** 0.048567769
362220 ***************** lookup ************** 0.054825580
333972 ***************** lookup ************** 0.275355791
406314 ***************** lookup ************** 0.283219905
334686 ***************** lookup ************** 0.285973208
289606 ***************** lookup ************** 0.064608288
21233 ***************** lookup ************** 0.074923689
189680 ***************** lookup ************** 0.089702578
335100 ***************** lookup ************** 0.151553955
334685 ***************** lookup ************** 0.117808430
167700 ***************** lookup ************** 0.119441314
336813 ***************** lookup ************** 0.120572137
334684 ***************** lookup ************** 0.124718126
21234 ***************** lookup ************** 0.131124745
84407 ***************** lookup ************** 0.132442945
334696 ***************** lookup ************** 0.140938740
335094 ***************** lookup ************** 0.201637910
167735 ***************** lookup ************** 0.164059859
334687 ***************** lookup ************** 0.252930745
334695 ***************** lookup ************** 0.278037098
341818 0.291815990 ********* Unfinished IO: buffer/disk 50000015000
3:439888512^\scratch_metadata_5
341818 0.291822084 MSG FSnd: nsdMsgReadExt msg_id 2894690129 Sduration
199.688 + us
100041 0.025061905 MSG FRep: nsdMsgReadExt msg_id 1012644746 Rduration
266959.867 us Rlen 0 Hduration 266963.954 + us

Elapsed trace time: 0.292021772 seconds
Elapsed trace time from first VFS call to last: 0.292021771 Time idle between VFS calls: 0.001436519 seconds

Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.000831801 4 207.950
read_inode2 0.000082347 31 2.656
pagein 0.000033905 3 11.302
revalidate 0.000013109 156 0.084
open 0.000237969 22 10.817
lookup 1.233407280 10 123340.728
delete_inode 0.000013877 33 0.421
permission 0.000046486 8 5.811
release 0.000172456 21 8.212
mmap 0.000064411 2 32.206
llseek 0.000000391 2 0.196
readdir 0.000213657 36 5.935

User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops
335094 0.053506265 0.000170270 99.68% 0.32% 16
167700 0.000008522 0.000027547 23.63% 76.37% 2
167776 0.000008293 0.000019462 29.88% 70.12% 2
334684 0.000023562 0.000160872 12.78% 87.22% 8
349767 0.000000467 0.250029787 0.00% 100.00% 5
84407 0.000000230 0.000017947 1.27% 98.73% 2
334685 0.000028543 0.000094147 23.26% 76.74% 8
406314 0.221755229 0.000009720 100.00% 0.00% 2
334694 0.000024913 0.000125229 16.59% 83.41% 10
335096 0.254359005 0.000240785 99.91% 0.09% 18
334695 0.000028966 0.000127823 18.47% 81.53% 10
334686 0.223770082 0.000267271 99.88% 0.12% 24
334687 0.000031265 0.000132905 19.04% 80.96% 9
334696 0.000033808 0.000131131 20.50% 79.50% 9
129075 0.000000102 0.000000000 100.00% 0.00% 1
341842 0.000000318 0.000000000 100.00% 0.00% 1
335100 0.059518133 0.000287934 99.52% 0.48% 19
224423 0.000000471 0.000000000 100.00% 0.00% 1
336812 0.000042720 0.000193294 18.10% 81.90% 10
21233 0.000556984 0.000083399 86.98% 13.02% 11
289606 0.000000088 0.000018043 0.49% 99.51% 2
362246 0.014440188 0.000046516 99.68% 0.32% 4
21234 0.000524848 0.000162353 76.37% 23.63% 13
336813 0.000046426 0.000175666 20.90% 79.10% 9
3339 0.000011816 0.272396876 0.00% 100.00% 29
341818 0.000000778 0.000000000 100.00% 0.00% 1
167735 0.000007866 0.000049468 13.72% 86.28% 3
175480 0.000000278 0.000000000 100.00% 0.00% 1
336006 0.000001170 0.250020470 0.00% 100.00% 16
44777 0.000000367 0.250149757 0.00% 100.00% 6
189680 0.000002717 0.000006518 29.42% 70.58% 1
184839 0.000003001 0.250144214 0.00% 100.00% 35
145858 0.000000687 0.000000000 100.00% 0.00% 1
333972 0.218656404 0.000043897 99.98% 0.02% 4
334691 0.187695040 0.000295117 99.84% 0.16% 25

# total App-read/write = 7 Average duration = 0.000123672 sec # time(sec) count % %ile read write avgBytesR avgBytesW
0.000500 7 1.000000 1.000000 7 0 1172 0

-----Original Message-----
From: gpfsug-discuss-bounces at spectrumscale.org
<gpfsug-discuss-bounces at spectrumscale.org> On Behalf Of Uwe Falke
Sent: Wednesday, November 11, 2020 8:57 PM
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process

Hi, Kamil,
I suppose you'd rather not see such an issue than pursue the ugly work-around to kill off processes.
In such situations, the first looks should be for the GPFS log (on the client, on the cluster manager, and maybe on the file system manager) and for the current waiters (that is the list of currently waiting threads) on the hanging client.

-> /var/adm/ras/mmfs.log.latest
mmdiag --waiters

That might give you a first idea what is taking long and which components are involved.

Also,
mmdiag --iohist
shows you the last IOs and some stats (service time, size) for them.

Either that clue is already sufficient, or you go on (if you see DIO somewhere, direct IO is used which might slow down things, for example).
GPFS has a nice tracing which you can configure or just run the default trace.

Running a dedicated (low-level) io trace can be achieved by mmtracectl --start --trace=io --tracedev-write-mode=overwrite -N <your_critical_node> then, when the issue is seen, stop the trace by mmtracectl --stop -N <your_critical_node>

Do not wait to stop the trace once you've seen the issue, the trace file cyclically overwrites its output. If the issue lasts some time you could also start the trace while you see it, run the trace for say 20 secs and stop again. On stopping the trace, the output gets converted into an ASCII trace file named trcrpt.*(usually in /tmp/mmfs, check the command output).

There you should see lines with FIO which carry the inode of the related file after the "tag" keyword.
example:
0.000745100 25123 TRACE_IO: FIO: read data tag 248415 43466 ioVecSize 8 1st buf 0x299E89BC000 disk 8D0 da 154:2083875440 nSectors 128 err 0 finishTime 1563473283.135212150

-> inode is 248415

there is a utility , tsfindinode, to translate that into the file path.
you need to build this first if not yet done:
cd /usr/lpp/mmfs/samples/util ; make
, then run
./tsfindinode -i <inode_num> <fs_mount_point>

For the IO trace analysis there is an older tool :
/usr/lpp/mmfs/samples/debugtools/trsum.awk.

Then there is some new stuff I've not yet used in /usr/lpp/mmfs/samples/traceanz/ (always check the README)

Hope that halps a bit.

Mit freundlichen Grüßen / Kind regards

Dr. Uwe Falke
IT Specialist
Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services
+49 175 575 2877 Mobile
Rathausstr. 7, 09111 Chemnitz, Germany
uwefalke at de.ibm.com

IBM Services

IBM Data Privacy Statement

IBM Deutschland Business & Technology Services GmbH
Geschäftsführung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft:
Ehningen
Registergericht: Amtsgericht Stuttgart, HRB 17122

From: "Czauz, Kamil" <Kamil.Czauz at Squarepoint-Capital.com>
To: "gpfsug-discuss at spectrumscale.org"
<gpfsug-discuss at spectrumscale.org>
Date: 11/11/2020 23:36
Subject: [EXTERNAL] [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org

We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu.

The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one.

My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client?

-Kamil
Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation.
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation.
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation.
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20201120/8259daa6/attachment-0002.htm>