[gpfsug-discuss] iowait?

Yuri L Volobuev volobuev at us.ibm.com
Mon Aug 29 21:31:17 BST 2016


I would advise caution on using "mmdiag --iohist" heavily.  In more recent
code streams (V4.1, V4.2) there's a problem with internal locking that
could, under certain conditions could lead to the symptoms that look very
similar to sporadic network blockage.  Basically, if "mmdiag --iohist" gets
blocked for long periods of time (e.g. due to local disk/NFS performance
issues), this may end up blocking an mmfsd receiver thread, delaying RPC
processing.  The problem was discovered fairly recently, and the fix hasn't
made it out to all service streams yet.

More generally, IO history is a valuable tool for troubleshooting disk IO
performance issues, but the tool doesn't have the right semantics for
regular, systemic IO performance sampling and monitoring.  The query
operation is too expensive, the coverage is subject to load, and the output
is somewhat unstructured.  With some effort, one can still build some form
of a roll-your-own monitoring implement, but this is certainly not an
optimal way of approaching the problem.  The data should be available in a
structured form, through a channel that supports light-weight, flexible
querying that doesn't impact mainline IO processing.  In Spectrum Scale,
this type of data is fed from mmfsd to Zimon, via an mmpmon interface, and
end users can then query Zimon for raw or partially processed data.  Where
it comes to high-volume stats, retaining raw data at its full resolution is
only practical for relatively short periods of time (seconds, or perhaps a
small number of minutes), and some form of aggregation is necessary for
covering longer periods of time (hours to days).  In the current versions
of the product, there's a very similar type of data available this way: RPC
stats.  There are plans to make IO history data available in a similar
fashion.  The entire approach may need to be re-calibrated, however.
Making RPC stats available doesn't appear to have generated a surge of user
interest.  This is probably because the data is too complex for casual
processing, and while without doubt a lot of very valuable insight can be
gained by analyzing RPC stats, the actual effort required to do so is too
much for most users.  That is, we need to provide some tools for raw data
analytics.  Largely the same argument applies to IO stats.  In fact, on an
NSD client IO stats are actually a subset of RPC stats.  With some effort,
one can perform a comprehensive analysis of NSD client IO stats by
analyzing NSD client-to-server RPC traffic.  One can certainly argue that
the effort required is a bit much though.

Getting back to the original question: would the proposed cxiWaitEventWait
() change work?  It'll likely result in nr_iowait being incremented every
time a thread in GPFS code performs an uninterruptible wait.  This could be
an act of performing an actual IO request, or something else, e.g. waiting
for a lock.  Those may be the desirable semantics in some scenarios, but I
wouldn't agree that it's the right behavior for any uninterruptible wait.
io_schedule() is intended for use for block device IO waits, so using it
this way is not in line with the code intent, which is never a good idea.
Besides, relative to schedule(), io_schedule() has some overhead that could
have performance implications of an uncertain nature.

yuri



From:	Bryan Banister <bbanister at jumptrading.com>
To:	gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>,
Date:	08/29/2016 11:06 AM
Subject:	Re: [gpfsug-discuss] iowait?
Sent by:	gpfsug-discuss-bounces at spectrumscale.org



Try this:

mmchconfig ioHistorySize=1024 # Or however big you want!

Cheers,
-Bryan

-----Original Message-----
From: gpfsug-discuss-bounces at spectrumscale.org [
mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Aaron Knister
Sent: Monday, August 29, 2016 1:05 PM
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] iowait?

That's an interesting idea. I took a look at mmdig --iohist on a busy node
it doesn't seem to capture more than literally 1 second of history.
Is there a better way to grab the data or have gpfs capture more of it?

Just to give some more context, as part of our monthly reporting
requirements we calculate job efficiency by comparing the number of cpu
cores requested by a given job with the cpu % utilization during that job's
time window. Currently a job that's doing a sleep 9000 would show up the
same as a job blocked on I/O. Having GPFS wait time included in iowait
would allow us to easily make this distinction.

-Aaron

On 8/29/16 1:56 PM, Bryan Banister wrote:
> There is the iohist data that may have what you're looking for, -Bryan
>
> -----Original Message-----
> From: gpfsug-discuss-bounces at spectrumscale.org
> [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Aaron
> Knister
> Sent: Monday, August 29, 2016 12:54 PM
> To: gpfsug-discuss at spectrumscale.org
> Subject: Re: [gpfsug-discuss] iowait?
>
> Sure, we can and we do use both iostat/sar and collectl to collect disk
utilization on our nsd servers. That doesn't give us insight, though, into
any individual client node of which we've got 3500. We do log mmpmon data
from each node but that doesn't give us any insight into how much time is
being spent waiting on I/O. Having GPFS report iowait on client nodes would
give us this insight.
>
> On 8/29/16 1:50 PM, Alex Chekholko wrote:
>> Any reason you can't just use iostat or collectl or any of a number
>> of other standards tools to look at disk utilization?
>>
>> On 08/29/2016 10:33 AM, Aaron Knister wrote:
>>> Hi Everyone,
>>>
>>> Would it be easy to have GPFS report iowait values in linux? This
>>> would be a huge help for us in determining whether a node's low
>>> utilization is due to some issue with the code running on it or if
>>> it's blocked on I/O, especially in a historical context.
>>>
>>> I naively tried on a test system changing schedule() in
>>> cxiWaitEventWait() on line ~2832 in gpl-linux/cxiSystem.c to this:
>>>
>>> again:
>>>   /* call the scheduler */
>>>   if ( waitFlags & INTERRUPTIBLE )
>>>     schedule();
>>>   else
>>>     io_schedule();
>>>
>>> Seems to actually do what I'm after but generally bad things happen
>>> when I start pretending I'm a kernel developer.
>>>
>>> Any thoughts? If I open an RFE would this be something that's
>>> relatively easy to implement (not asking for a commitment *to*
>>> implement it, just that I'm not asking for something seemingly
>>> simple that's actually fairly hard to implement)?
>>>
>>> -Aaron
>>>
>>
>
> --
> Aaron Knister
> NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight
> Center
> (301) 286-2776
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
> ________________________________
>
> Note: This email is for the confidential use of the named addressee(s)
only and may contain proprietary, confidential or privileged information.
If you are not the intended recipient, you are hereby notified that any
review, dissemination or copying of this email is strictly prohibited, and
to please notify the sender immediately and destroy this email and any
attachments. Email transmission cannot be guaranteed to be secure or
error-free. The Company, therefore, does not make any guarantees as to the
completeness or accuracy of this email or any attachments. This email is
for informational purposes only and does not constitute a recommendation,
offer, request or solicitation of any kind to buy, sell, subscribe, redeem
or perform any type of transaction of a financial product.
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>

--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center
(301) 286-2776
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

________________________________

Note: This email is for the confidential use of the named addressee(s) only
and may contain proprietary, confidential or privileged information. If you
are not the intended recipient, you are hereby notified that any review,
dissemination or copying of this email is strictly prohibited, and to
please notify the sender immediately and destroy this email and any
attachments. Email transmission cannot be guaranteed to be secure or
error-free. The Company, therefore, does not make any guarantees as to the
completeness or accuracy of this email or any attachments. This email is
for informational purposes only and does not constitute a recommendation,
offer, request or solicitation of any kind to buy, sell, subscribe, redeem
or perform any type of transaction of a financial product.
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160829/43eef563/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160829/43eef563/attachment-0002.gif>


More information about the gpfsug-discuss mailing list