[gpfsug-discuss] Clarification of mmdiag --iohist output

Sven Oehme oehmes at gmail.com
Fri Mar 1 01:33:58 GMT 2019


Hi,

using nsdSmallThreadRatio 1 is not necessarily correct, as it 'significant
depends' (most used word combination of performance engineers) on your
workload. to give some more background - on reads you need much more
threads for small i/os than for large i/os to get maximum performance, the
reason is a small i/o usually only reads one strip of data (sitting on one
physical device) while a large i/o reads an entire stripe (which typically
spans multiple devices). as a more concrete example, in a 8+2p raid setup a
single full stripe read will trigger internal reads in parallel to 8
different targets at the same time, so for small i/os you would need 8
times as many small read requests (and therefore threads) to keep the
drives busy at the same level. on writes its even more complex, a large
full stripe write usually just writes to all target disks, while a tiny
small write in the middle might force a read / modify / write which can
have a huge write amplification and cause more work than a large full track
i/o. raid controller caches also play a significant role here and make this
especially hard to optimize as you need to know exactly what and where to
measure when you tune to get improvements for real world workload and not
just improve your synthetic test but actually hurt your real application
performance.
i should write a book about this some day ;-)

hope that helps. Sven




On Thu, Feb 21, 2019 at 4:23 AM Frederick Stock <stockf at us.ibm.com> wrote:

> Kevin I'm assuming you have seen the article on IBM developerWorks about
> the GPFS NSD queues.  It provides useful background for analyzing the dump
> nsd information.  Here I'll list some thoughts for items that you can
> investigate/consider.
>
> If your NSD servers are doing both large (greater than 64K) and small (64K
> or less) IOs then you want to have the nsdSmallThreadRatio set to 1 as it
> seems you do for the NSD servers.  This provides an equal number of SMALL
> and LARGE NSD queues.  You can also increase the total number of queues
> (currently 256) but I cannot determine if that is necessary from the data
> you provided.  Only on rare occasions have I seen a need to increase the
> number of queues.
>
> The fact that you have 71 highest pending on your LARGE queues and 73
> highest pending on your SMALL queues would imply your IOs are queueing for
> a good while either waiting for resources in GPFS or waiting for IOs to
> complete.  Your maximum buffer size is 16M which is defined to be the
> largest IO that can be requested by GPFS.  This is the buffer size that
> GPFS will use for LARGE IOs.  You indicated you had sufficient memory on
> the NSD servers but what is the value for the pagepool on those servers,
> and what is the value of the nsdBufSpace parameter?   If the NSD server is
> just that then usually nsdBufSpace is set to 70.  The IO buffers used by
> the NSD server come from the pagepool so you need sufficient space there
> for the maximum number of LARGE IO buffers that would be used concurrently
> by GPFS or threads will need to wait for those buffers to become
> available.  Essentially you want to ensure you have sufficient memory for
> the maximum number of IOs all doing a large IO and that value being less
> than 70% of the pagepool size.
>
> You could look at the settings for the FC cards to ensure they are
> configured to do the largest IOs possible.  I forget the actual values
> (have not done this for awhile) but there are settings for the adapters
> that control the maximum IO size that will be sent.  I think you want this
> to be as large as the adapter can handle to reduce the number of messages
> needed to complete the large IOs done by GPFS.
>
>
> Fred
> __________________________________________________
> Fred Stock | IBM Pittsburgh Lab | 720-430-8821 <(720)%20430-8821>
> stockf at us.ibm.com
>
>
>
> ----- Original message -----
> From: "Buterbaugh, Kevin L" <Kevin.Buterbaugh at Vanderbilt.Edu>
> Sent by: gpfsug-discuss-bounces at spectrumscale.org
> To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
>
> Cc:
> Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output
> Date: Thu, Feb 21, 2019 6:39 AM
>
> Hi All,
>
> My thanks to Aaron, Sven, Steve, and whoever responded for the GPFS team.
> You confirmed what I suspected … my example 10 second I/O was _from an NSD
> server_ … and since we’re in a 8 Gb FC SAN environment, it therefore means
> - correct me if I’m wrong about this someone - that I’ve got a problem
> somewhere in one (or more) of the following 3 components:
>
> 1) the NSD servers
> 2) the SAN fabric
> 3) the storage arrays
>
> I’ve been looking at all of the above and none of them are showing any
> obvious problems.  I’ve actually got a techie from the storage array vendor
> stopping by on Thursday, so I’ll see if he can spot anything there.  Our FC
> switches are QLogic’s, so I’m kinda screwed there in terms of getting any
> help.  But I don’t see any errors in the switch logs and “show perf” on the
> switches is showing I/O rates of 50-100 MB/sec on the in use ports, so I
> don’t _think_ that’s the issue.
>
> And this is the GPFS mailing list, after all … so let’s talk about the NSD
> servers.  Neither memory (64 GB) nor CPU (2 x quad-core Intel Xeon E5620’s)
> appear to be an issue.  But I have been looking at the output of “mmfsadm
> saferdump nsd” based on what Aaron and then Steve said.  Here’s some fairly
> typical output from one of the SMALL queues (I’ve checked several of my 8
> NSD servers and they’re all showing similar output):
>
>     Queue NSD type NsdQueueTraditional [244]: SMALL, threads started 12,
> active 3, highest 12, deferred 0, chgSize 0, draining 0, is_chg 0
>      requests pending 0, highest pending 73, total processed 4859732
>      mutex 0x7F3E449B8F10, reqCond 0x7F3E449B8F58, thCond 0x7F3E449B8F98,
> queue 0x7F3E449B8EF0, nFreeNsdRequests 29
>
> And for a LARGE queue:
>
>     Queue NSD type NsdQueueTraditional [8]: LARGE, threads started 12,
> active 1, highest 12, deferred 0, chgSize 0, draining 0, is_chg 0
>      requests pending 0, highest pending 71, total processed 2332966
>      mutex 0x7F3E441F3890, reqCond 0x7F3E441F38D8, thCond 0x7F3E441F3918,
> queue 0x7F3E441F3870, nFreeNsdRequests 31
>
> So my large queues seem to be slightly less utilized than my small queues
> overall … i.e. I see more inactive large queues and they generally have a
> smaller “highest pending” value.
>
> Question:  are those non-zero “highest pending” values something to be
> concerned about?
>
> I have the following thread-related parameters set:
>
> [common]
> maxReceiverThreads 12
> nsdMaxWorkerThreads 640
> nsdThreadsPerQueue 4
> nsdSmallThreadRatio 3
> workerThreads 128
>
> [serverLicense]
> nsdMaxWorkerThreads 1024
> nsdThreadsPerQueue 12
> nsdSmallThreadRatio 1
> pitWorkerThreadsPerNode 3
> workerThreads 1024
>
> Also, at the top of the “mmfsadm saferdump nsd” output I see:
>
> Total server worker threads: running 1008, desired 147, forNSD 147, forGNR
> 0, nsdBigBufferSize 16777216
> nsdMultiQueue: 256, nsdMultiQueueType: 1, nsdMinWorkerThreads: 16,
> nsdMaxWorkerThreads: 1024
>
> Question:  is the fact that 1008 is pretty close to 1024 a concern?
>
> Anything jump out at anybody?  I don’t mind sharing full output, but it is
> rather lengthy.  Is this worthy of a PMR?
>
> Thanks!
>
> --
> Kevin Buterbaugh - Senior System Administrator
> Vanderbilt University - Advanced Computing Center for Research and
> Education
> Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 <(615)%20875-9633>
>
>
> On Feb 17, 2019, at 1:01 PM, IBM Spectrum Scale <scale at us.ibm.com> wrote:
>
> Hi Kevin,
>
> The I/O hist shown by the command mmdiag --iohist actually depends on the
> node on which you are running this command from.
> If you are running this on a NSD server node then it will show the time
> taken to complete/serve the read or write I/O operation sent from the
> client node.
> And if you are running this on a client (or non NSD server) node then it
> will show the complete time taken by the read or write I/O operation
> requested by the client node to complete.
> So in a nut shell for the NSD server case it is just the latency of the
> I/O done on disk by the server whereas for the NSD client case it also the
> latency of send and receive of I/O request to the NSD server along with the
> latency of I/O done on disk by the NSD server.
> I hope this answers your query.
>
>
> Regards, The Spectrum Scale (GPFS) team
>
>
> ------------------------------------------------------------------------------------------------------------------
> If you feel that your question can benefit other users of  Spectrum Scale
> (GPFS), then please post it to the public IBM developerWroks Forum at
> https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ibm.com%2Fdeveloperworks%2Fcommunity%2Fforums%2Fhtml%2Fforum%3Fid%3D11111111-0000-0000-0000-000000000479&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C2bfb2e8e30e64fa06c0f08d6959b2d38%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636860891056267091&sdata=%2FWFsVfr73xZcfH25vIFYC4ts7LlWDFUIoh9fLheAEwE%3D&reserved=0>
> .
>
> If your query concerns a potential software error in Spectrum Scale (GPFS)
> and you have an IBM software maintenance contract please contact
> 1-800-237-5511 <(800)%20237-5511> in the United States or your local IBM
> Service Center in other countries.
>
> The forum is informally monitored as time permits and should not be used
> for priority messages to the Spectrum Scale (GPFS) team.
>
>
>
> From:        "Buterbaugh, Kevin L" <Kevin.Buterbaugh at Vanderbilt.Edu>
> To:        gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> Date:        02/16/2019 08:18 PM
> Subject:        [gpfsug-discuss] Clarification of mmdiag --iohist output
> Sent by:        gpfsug-discuss-bounces at spectrumscale.org
> ------------------------------
>
>
>
> Hi All,
>
> Been reading man pages, docs, and Googling, and haven’t found a definitive
> answer to this question, so I knew exactly where to turn… ;-)
>
> I’m dealing with some slow I/O’s to certain storage arrays in our
> environments … like really, really slow I/O’s … here’s just one example
> from one of my NSD servers of a 10 second I/O:
>
> 08:49:34.943186  W        data   30:41615622144   2048 10115.192  srv
> dm-92                  <client IP redacted>
>
> So here’s my question … when mmdiag —iohist tells me that that I/O took
> slightly over 10 seconds, is that:
>
> 1.  The time from when the NSD server received the I/O request from the
> client until it shipped the data back onto the wire towards the client?
> 2.  The time from when the client issued the I/O request until it received
> the data back from the NSD server?
> 3.  Something else?
>
> I’m thinking it’s #1, but want to confirm.  Which one it is has very
> obvious implications for our troubleshooting steps.  Thanks in advance…
>
> Kevin
>> Kevin Buterbaugh - Senior System Administrator
> Vanderbilt University - Advanced Computing Center for Research and
> Education
> *Kevin.Buterbaugh at vanderbilt.edu* <Kevin.Buterbaugh at vanderbilt.edu>-
> (615)875-9633 <(615)%20875-9633>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
>
> https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C2bfb2e8e30e64fa06c0f08d6959b2d38%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636860891056297114&sdata=5pL67mhVyScJovkRHRqZog9bM5BZG8F2q972czIYAbA%3D&reserved=0
> <https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C2bfb2e8e30e64fa06c0f08d6959b2d38%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636860891056297114&sdata=5pL67mhVyScJovkRHRqZog9bM5BZG8F2q972czIYAbA%3D&reserved=0>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20190228/619219fb/attachment-0001.htm>


More information about the gpfsug-discuss mailing list