[gpfsug-discuss] Clarification of mmdiag --iohist output

Tue Feb 19 19:46:36 GMT 2019

Just to add a bit more details to that, If you want to track down an individual i/o or all i/o to a particular file you can do this with mmfsadm dump iohist (mmdiag doesn’t give you all you need) :

so run /usr/lpp/mmfs/bin/mmfsadm dump iohist >iohist on server as well as client : 

I/O history:

 I/O start time RW    Buf type disk:sectorNum     nSec  time ms      tag1         tag2           Disk UID typ        NSD node   context thread                           comment

--------------- -- ----------- ----------------- -----  ------- --------- ------------ ------------------ --- --------------- --------- -------------------------------- -------

12:22:41.880663  W        data    1:5602050048   32768  927.272 258249737          900  0A2405A0:5C6BD9B0 cli  172.16.254.160 Prefetch  WritebehindWorkerThread

12:22:42.038653  W        data    4:5815107584   32768  803.106 258249737          903  0A2405A0:5C6BD9B6 cli  172.16.254.163 Prefetch  WritebehindWorkerThread

12:22:42.504966  W        data    3:695664640    32768  375.272 258249737          918  0A2405A0:5C6BD9B2 cli  172.16.254.162 Prefetch  WritebehindWorkerThread

12:22:42.592712  W        data    1:1121779712   32768  311.026 258249737          920  0A2405A0:5C6BD9B0 cli  172.16.254.160 Prefetch  WritebehindWorkerThread

12:22:42.641689  W        data    2:1334837248   32768  350.373 258249737          921  0A2405A0:5C6BD9B4 cli  172.16.254.161 Prefetch  WritebehindWorkerThread

12:22:42.301120  W        data    1:6667337728   32768  758.629 258249737          912  0A2405A0:5C6BD9B0 cli  172.16.254.160 Prefetch  WritebehindWorkerThread

12:22:42.176365  W        data    1:6241222656   32768  895.423 258249737          908  0A2405A0:5C6BD9B0 cli  172.16.254.160 Prefetch  WritebehindWorkerThread

12:22:42.283152  W        data    4:6454280192   32768  840.528 258249737          911  0A2405A0:5C6BD9B6 cli  172.16.254.163 Prefetch  WritebehindWorkerThread

12:22:42.149964  W        data    4:6028165120   32768  981.661 258249737          907  0A2405A0:5C6BD9B6 cli  172.16.254.163 Prefetch  WritebehindWorkerThread

12:22:42.130402  W        data    3:6028165120   32768 1021.175 258249737          906  0A2405A0:5C6BD9B2 cli  172.16.254.162 Prefetch  WritebehindWorkerThread

12:22:42.838850  W        data    2:1867481088   32768  343.912 258249737          925  0A2405A0:5C6BD9B4 cli  172.16.254.161 Prefetch  WritebehindWorkerThread

12:22:42.841800  W        data    3:1867481088   32768  397.089 258249737          926  0A2405A0:5C6BD9B2 cli  172.16.254.162 Prefetch  WritebehindWorkerThread

12:22:42.652912  W        data    3:1334837248   32768  637.628 258249737          922  0A2405A0:5C6BD9B2 cli  172.16.254.162 Prefetch  WritebehindWorkerThread

12:22:42.883946  W        data    1:1974009856   32768  442.953 258249737          928  0A2405A0:5C6BD9B0 cli  172.16.254.160 Prefetch  WritebehindWorkerThread

12:22:42.903782  W        data    3:1974009856   32768  424.285 258249737          930  0A2405A0:5C6BD9B2 cli  172.16.254.162 Prefetch  WritebehindWorkerThread

12:22:42.329905  W        data    4:269549568    32768 1061.313 258249737          915  0A2405A0:5C6BD9B6 cli  172.16.254.163 Prefetch  WritebehindWorkerThread

12:22:42.392467  W        data    1:376078336    32768  998.770 258249737          916  0A2405A0:5C6BD9B0 cli  172.16.254.160 Prefetch  WritebehindWorkerThread

in this example I only care about one file with inode number 258249737 (which is stored in tag1) : 

Now simply on the server run : grep '258249737' iohist  

19:22:42.533259  W        data    1:5602050048   32768  283.016 258249737          900  0A2405A0:5C6BD9B0 srv  172.16.245.117 NSDWorker NSDThread

19:22:42.604062  W        data    1:1121779712   32768  308.015 258249737          920  0A2405A0:5C6BD9B0 srv  172.16.245.117 NSDWorker NSDThread

19:22:42.751549  W        data    1:6667337728   32768  316.536 258249737          912  0A2405A0:5C6BD9B0 srv  172.16.245.117 NSDWorker NSDThread

19:22:42.722716  W        data    1:6241222656   32768  357.409 258249737          908  0A2405A0:5C6BD9B0 srv  172.16.245.117 NSDWorker NSDThread

19:22:43.030353  W        data    1:1974009856   32768  304.887 258249737          928  0A2405A0:5C6BD9B0 srv  172.16.245.117 NSDWorker NSDThread

19:22:43.103745  W        data    1:376078336    32768  295.835 258249737          916  0A2405A0:5C6BD9B0 srv  172.16.245.117 NSDWorker NSDThread

So you can now see all the blocks of that file (tag 2) that went to this particular nsd server and how much time they took to issue against the media . so for each tag1:tag2 pair on the client you find the corresponding on the server. If you subtract time of server from time of client for each line you get network/client delays . 

Sven

From: <gpfsug-discuss-bounces at spectrumscale.org> on behalf of Steve Crusan <scrusan at ddn.com>
Reply-To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date: Tuesday, February 19, 2019 at 12:29 AM
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output

Context is key here. Where you run mmdiag—iohist matters, clientside or nsd server side.

>From what I have seen and possibly understand, from the client, the time field indicates when an I/O was fully serviced (arrived into VFS layer, to be sent to the application), including RTT from the servers/disk. If you run the same command server side, my understanding is that the time field indicates how long it took for the server to write/read data to or from disk.

For example, a few years ago I had to fix a system which was described as unacceptably slow after an upgrade from gpfs 3.3 to 3.4 or 3.5 (don’t fully remember).

Iohist client side was showing many IOs waiting for 10 all the way up and to 50 SECONDS, not ms. Server side, I/O was being serviced via iohist within less than 5 ms. Also verified with iostat, basically doing a paltry 25MB/s per NSD server.

What happened is that the small vs large queueing system changed in that version of GPFS, so there were hundreds of large IOS queued (found via Mmfsadm dump nsd server side) due to the limited number of large queues and threads server side. A quick mmchconfig fixed the problem, but if I only looked at the servers, it would’ve appeared things were fine, because the IO backend was sitting around twirling its thumbs.

I don’t have access to the code, but all of behavior I have seen leads me to believe client side iohist includes network RTT.

-Steve 

From: gpfsug-discuss-bounces at spectrumscale.org <gpfsug-discuss-bounces at spectrumscale.org> on behalf of Aaron Knister <aaron.knister at gmail.com>
Sent: Sunday, February 17, 2019 8:26:23 AM
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output 

Hi Kevin,

It's funny you bring this up because I was looking at this yesterday. My belief is that it's the time the from when the I/O request was queued internally by the client to when the I/O response was received from the NSD server which means it absolutely includes the network RTT. It would be great to get formal confirmation of this from someone who knows the code.

Here's some trace data showing a single 4K read from an NSD client. I've stripped out a bunch of uninteresting stuff. It's my belief that the TRACE_IO indicates the point at which the "i/o timer" reported on by mmdiag --iohist begins ticking. The testing data seems to support this. If I'm correct, the testing data shows that the RDMA I/O to the NSD server occurs within the TRACE_IO timing window. The other thing that makes me believe this, is in my testing the mmdiag --iohist on the client shows an average latency of ~230us for a 4K read whereas mmdiag --iohist on the NSD server appears to show an average latency of ~170us when servicing those 4K reads from the back-end disk (a DDN SFA14KX).

0.000218276     37005   TRACE_DISK: doReplicatedRead: da 34:490710888
0.000218424     37005   TRACE_IO: QIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdName S01_DMD_NSD_034 da 34:490710888 nSectors 8 align 0 by iocMBHandler (DioHandlerThread)
0.000218566     37005   TRACE_DLEASE: checkLeaseForIO: rc 0

0.000218628     37005   TRACE_IO: SIO: ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 doioP 0x1806994D300 isDaemonAddr 0
0.000218672     37005   TRACE_FS: verify4KIO exit: code 4 err 0
0.000219106     37005   TRACE_NSD: nsdDoIO enter: read ioVecSize 1 1st bufAddr 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nBytes 4096 isDaemonAddr 0
0.000219408     37005   TRACE_GCRYPTO: EncBufPool::getTmpBuf(): bsize=4096 code=205 err=0 bufP=0x180656C1058 outBufP=0x180656C1058 index=0
0.000220105     37005   TRACE_TS: sendMessage msg_id 22993695: dest <c0n2> 10.3.17.3 sto03
0.000220436     37005   TRACE_RDMA: verbs::verbsSend_i: enter: index 11 cookie 12 node <c0n2> msg_id 22993695 len 92

0.000221111      37005   TRACE_RDMA: verbsConn::postSend: currRpcHeadP 0x7F6D7FA106E8 sendType SEND_BUFFER_RPC index 11 cookie 12 threadId 37005 bufferId.index 11
0.000221662     37005   TRACE_MUTEX: Waiting on fast condvar for signal 0x7F6D7FA106F8 RdmaSend_NSD_SVC

0.000426716     16691   TRACE_RDMA: handleRecvComp: success 1 of 1 nWrSuccess 1 index 11 cookie 12 wr_id 0xB0E6E00000000 bufferP 0x7F6D7EEE1700 byte_len 4144

0.000432140     37005   TRACE_NSD: nsdDoIO_ReadAndCheck: read complete, len 0 status 6 err 0 bufP 0x180656C1058 dioIsOverRdma 1 ioDataP 0x200000BC000 ckSumType NsdCksum_None
0.000432163     37005   TRACE_NSD: nsdDoIO_ReadAndCheck: exit err 0
0.000433707     37005   TRACE_GCRYPTO: EncBufPool::releaseTmpBuf(): exit bsize=8192 err=0 inBufP=0x180656C1058 bufP=0x180656C1058 index=0
0.000433777     37005   TRACE_NSD: nsdDoIO exit: err 0 0

0.000433844     37005   TRACE_IO: FIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 err 0
0.000434236     37005   TRACE_DISK: postIO: qosid A00D91E read data disk FFFFFFFF ioVecSize 1 1st buf 0x122F000 err 0 duration 0.000215000 by iocMBHandler (DioHandlerThread)

I'd suggest looking at "mmdiag --iohist" on the NSD server itself and see if/how that differs from the client. The other thing you could do is see if your NSD server queues are backing up (e.g. "mmfsadm saferdump nsd" and look for "requests pending" on queues where the "active" field is > 0). That doesn't necessarily mean you need to tune your queues but I'd suggest that if the disk I/O on your NSD server looks healthy (e.g. low latency, not overly-taxed) that you could benefit from queue tuning.

-Aaron

On Sat, Feb 16, 2019 at 9:47 AM Buterbaugh, Kevin L <Kevin.Buterbaugh at vanderbilt.edu> wrote:

Hi All, 

Been reading man pages, docs, and Googling, and haven’t found a definitive answer to this question, so I knew exactly where to turn… ;-)

I’m dealing with some slow I/O’s to certain storage arrays in our environments … like really, really slow I/O’s … here’s just one example from one of my NSD servers of a 10 second I/O:

08:49:34.943186  W        data   30:41615622144   2048 10115.192  srv   dm-92                  <client IP redacted>

So here’s my question … when mmdiag —iohist tells me that that I/O took slightly over 10 seconds, is that:

1.  The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client?

2.  The time from when the client issued the I/O request until it received the data back from the NSD server?

3.  Something else?

I’m thinking it’s #1, but want to confirm.  Which one it is has very obvious implications for our troubleshooting steps.  Thanks in advance…

Kevin

—

Kevin Buterbaugh - Senior System Administrator

Vanderbilt University - Advanced Computing Center for Research and Education

Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20190219/24d99a0d/attachment-0002.htm>