[gpfsug-discuss] Clarification of mmdiag --iohist output

Tue Feb 19 21:02:51 GMT 2019

Steve,
Would the small vs large queuing setting be visible in mmlsconfig output somewhere? Which settings would control that?

-jimk

> On Feb 17, 2019, at 6:48 PM, Steve Crusan <scrusan at ddn.com> wrote:
> 
> Context is key here. Where you run mmdiag—iohist matters, clientside or nsd server side.
> 
> From what I have seen and possibly understand, from the client, the time field indicates when an I/O was fully serviced (arrived into VFS layer, to be sent to the application), including RTT from the servers/disk. If you run the same command server side, my understanding is that the time field indicates how long it took for the server to write/read data to or from disk.
> 
> For example, a few years ago I had to fix a system which was described as unacceptably slow after an upgrade from gpfs 3.3 to 3.4 or 3.5 (don’t fully remember).
> 
> Iohist client side was showing many IOs waiting for 10 all the way up and to 50 SECONDS, not ms. Server side, I/O was being serviced via iohist within less than 5 ms. Also verified with iostat, basically doing a paltry 25MB/s per NSD server.
> 
> What happened is that the small vs large queueing system changed in that version of GPFS, so there were hundreds of large IOS queued (found via Mmfsadm dump nsd server side) due to the limited number of large queues and threads server side. A quick mmchconfig fixed the problem, but if I only looked at the servers, it would’ve appeared things were fine, because the IO backend was sitting around twirling its thumbs.
> 
> I don’t have access to the code, but all of behavior I have seen leads me to believe client side iohist includes network RTT.
> 
> -Steve 
> From: gpfsug-discuss-bounces at spectrumscale.org <gpfsug-discuss-bounces at spectrumscale.org> on behalf of Aaron Knister <aaron.knister at gmail.com>
> Sent: Sunday, February 17, 2019 8:26:23 AM
> To: gpfsug main discussion list
> Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output
>  
> Hi Kevin,
> 
> It's funny you bring this up because I was looking at this yesterday. My belief is that it's the time the from when the I/O request was queued internally by the client to when the I/O response was received from the NSD server which means it absolutely includes the network RTT. It would be great to get formal confirmation of this from someone who knows the code.
> 
> Here's some trace data showing a single 4K read from an NSD client. I've stripped out a bunch of uninteresting stuff. It's my belief that the TRACE_IO indicates the point at which the "i/o timer" reported on by mmdiag --iohist begins ticking. The testing data seems to support this. If I'm correct, the testing data shows that the RDMA I/O to the NSD server occurs within the TRACE_IO timing window. The other thing that makes me believe this, is in my testing the mmdiag --iohist on the client shows an average latency of ~230us for a 4K read whereas mmdiag --iohist on the NSD server appears to show an average latency of ~170us when servicing those 4K reads from the back-end disk (a DDN SFA14KX).
> 
> 0.000218276     37005   TRACE_DISK: doReplicatedRead: da 34:490710888
> 0.000218424     37005   TRACE_IO: QIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdName S01_DMD_NSD_034 da 34:490710888 nSectors 8 align 0 by iocMBHandler (DioHandlerThread)
> 0.000218566     37005   TRACE_DLEASE: checkLeaseForIO: rc 0
> 
> 0.000218628     37005   TRACE_IO: SIO: ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 doioP 0x1806994D300 isDaemonAddr 0
> 0.000218672     37005   TRACE_FS: verify4KIO exit: code 4 err 0
> 0.000219106     37005   TRACE_NSD: nsdDoIO enter: read ioVecSize 1 1st bufAddr 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nBytes 4096 isDaemonAddr 0
> 0.000219408     37005   TRACE_GCRYPTO: EncBufPool::getTmpBuf(): bsize=4096 code=205 err=0 bufP=0x180656C1058 outBufP=0x180656C1058 index=0
> 0.000220105     37005   TRACE_TS: sendMessage msg_id 22993695: dest <c0n2> 10.3.17.3 sto03
> 0.000220436     37005   TRACE_RDMA: verbs::verbsSend_i: enter: index 11 cookie 12 node <c0n2> msg_id 22993695 len 92
> 0.000221111      37005   TRACE_RDMA: verbsConn::postSend: currRpcHeadP 0x7F6D7FA106E8 sendType SEND_BUFFER_RPC index 11 cookie 12 threadId 37005 bufferId.index 11
> 0.000221662     37005   TRACE_MUTEX: Waiting on fast condvar for signal 0x7F6D7FA106F8 RdmaSend_NSD_SVC
> 
> 0.000426716     16691   TRACE_RDMA: handleRecvComp: success 1 of 1 nWrSuccess 1 index 11 cookie 12 wr_id 0xB0E6E00000000 bufferP 0x7F6D7EEE1700 byte_len 4144
> 
> 0.000432140     37005   TRACE_NSD: nsdDoIO_ReadAndCheck: read complete, len 0 status 6 err 0 bufP 0x180656C1058 dioIsOverRdma 1 ioDataP 0x200000BC000 ckSumType NsdCksum_None
> 0.000432163     37005   TRACE_NSD: nsdDoIO_ReadAndCheck: exit err 0
> 0.000433707     37005   TRACE_GCRYPTO: EncBufPool::releaseTmpBuf(): exit bsize=8192 err=0 inBufP=0x180656C1058 bufP=0x180656C1058 index=0
> 0.000433777     37005   TRACE_NSD: nsdDoIO exit: err 0 0
> 
> 0.000433844     37005   TRACE_IO: FIO: read data tag 743942 108137 ioVecSize 1 1st buf 0x122F000 nsdId 0A011103:5C59DBAC da 34:490710888 nSectors 8 err 0
> 0.000434236     37005   TRACE_DISK: postIO: qosid A00D91E read data disk FFFFFFFF ioVecSize 1 1st buf 0x122F000 err 0 duration 0.000215000 by iocMBHandler (DioHandlerThread)
> 
> I'd suggest looking at "mmdiag --iohist" on the NSD server itself and see if/how that differs from the client. The other thing you could do is see if your NSD server queues are backing up (e.g. "mmfsadm saferdump nsd" and look for "requests pending" on queues where the "active" field is > 0). That doesn't necessarily mean you need to tune your queues but I'd suggest that if the disk I/O on your NSD server looks healthy (e.g. low latency, not overly-taxed) that you could benefit from queue tuning.
> 
> -Aaron
> 
> On Sat, Feb 16, 2019 at 9:47 AM Buterbaugh, Kevin L <Kevin.Buterbaugh at vanderbilt.edu <mailto:Kevin.Buterbaugh at vanderbilt.edu>> wrote:
> Hi All,
> 
> Been reading man pages, docs, and Googling, and haven’t found a definitive answer to this question, so I knew exactly where to turn… ;-)
> 
> I’m dealing with some slow I/O’s to certain storage arrays in our environments … like really, really slow I/O’s … here’s just one example from one of my NSD servers of a 10 second I/O:
> 
> 08:49:34.943186  W        data   30:41615622144   2048 10115.192  srv   dm-92                  <client IP redacted>
> 
> So here’s my question … when mmdiag —iohist tells me that that I/O took slightly over 10 seconds, is that:
> 
> 1.  The time from when the NSD server received the I/O request from the client until it shipped the data back onto the wire towards the client?
> 2.  The time from when the client issued the I/O request until it received the data back from the NSD server?
> 3.  Something else?
> 
> I’m thinking it’s #1, but want to confirm.  Which one it is has very obvious implications for our troubleshooting steps.  Thanks in advance…
> 
> Kevin
> —
> Kevin Buterbaugh - Senior System Administrator
> Vanderbilt University - Advanced Computing Center for Research and Education
> Kevin.Buterbaugh at vanderbilt.edu <mailto:Kevin.Buterbaugh at vanderbilt.edu> - (615)875-9633
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org <http://spectrumscale.org/>
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20190219/a51e0af6/attachment-0002.htm>