<font size=2 face="sans-serif">------------------------------------------<br>
Sven Oehme <br>
Scalable Storage Research <br>
email: oehmes@us.ibm.com <br>
Phone: +1 (408) 824-8904 <br>
IBM Almaden Research Lab <br>
------------------------------------------</font>
<br>
<br><tt><font size=2>gpfsug-discuss-bounces@gpfsug.org wrote on 09/04/2014
03:43:36 AM:<br>
<br>
> From: Salvatore Di Nardo <sdinardo@ebi.ac.uk></font></tt>
<br><tt><font size=2>> To: gpfsug main discussion list <gpfsug-discuss@gpfsug.org></font></tt>
<br><tt><font size=2>> Date: 09/04/2014 03:44 AM</font></tt>
<br><tt><font size=2>> Subject: Re: [gpfsug-discuss] gpfs performance
monitoring</font></tt>
<br><tt><font size=2>> Sent by: gpfsug-discuss-bounces@gpfsug.org</font></tt>
<br><tt><font size=2>> <br>
> On 04/09/14 01:50, Sven Oehme wrote:</font></tt>
<br><tt><font size=2>> > Hello everybody,<br>
> <br>
> Hi <br>
> <br>
> > here i come here again, this time to ask some hint about how
to <br>
> monitor GPFS.<br>
> > <br>
> > I know about mmpmon, but the issue with its "fs_io_s"
and "io_s" is <br>
> > that they return number based only on the request done in the
<br>
> > current host, so i have to run them on all the clients ( over
600 <br>
> > nodes) so its quite unpractical. Instead i would like to
know from <br>
> > the servers whats going on, and i came across the vio_s statistics
<br>
> > wich are less documented and i dont know exacly what they mean.
<br>
> > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat"
that<br>
> > runs VIO_S.<br>
> > <br>
> > My problems with the output of this command: <br>
> > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1<br>
> > <br>
> > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per
second<br>
> > timestamp:
1409763206/477366<br>
> > recovery group:
*<br>
> > declustered array:
*<br>
> > vdisk:
*<br>
> > client reads:
2584229<br>
> > client short writes:
55299693<br>
> > client medium writes:
190071<br>
> > client promoted full track writes:
465145<br>
> > client full track writes:
9249<br>
> > flushed update writes:
4187708<br>
> > flushed promoted full track writes:
123<br>
> > migrate operations:
114<br>
> > scrub operations:
450590<br>
> > log writes:
28509602 <br>
> > <br>
> > it sais "VIOPS per second", but they seem to me just
counters as <br>
> > every time i re-run the command, the numbers increase by a bit..
<br>
> > Can anyone confirm if those numbers are counter or if they are
OPS/sec.<br>
> <br>
> the numbers are accumulative so everytime you run them they just <br>
> show the value since start (or last reset) time. </font></tt>
<br><tt><font size=2>> OK, you confirmed my toughts, thatks<br>
</font></tt>
<br><tt><font size=2>> <br>
> > <br>
> > On a closer eye about i dont understand what most of thosevalues
<br>
> > mean. For example, what exacly are "flushed promoted full
track write" ?? <br>
> > I tried to find a documentation about this output , but could
not <br>
> > find any. can anyone point me a link where output of vio_s is
explained?<br>
> > <br>
> > Another thing i dont understand about those numbers is if they
are <br>
> > just operations, or the number of blocks that was read/write/etc
. <br>
> <br>
> its just operations and if i would explain what the numbers mean i
<br>
> might confuse you even more because this is not what you are really
<br>
> looking for. <br>
> what you are looking for is what the client io's look like on the
<br>
> Server side, while the VIO layer is the Server side to the disks,
so<br>
> one lever lower than what you are looking for from what i could read<br>
> out of the description above. </font></tt>
<br><tt><font size=2>> No.. what I'm looking its exactly how the disks
are busy to keep the<br>
> requests. Obviously i'm not looking just that, but I feel the needs
<br>
> to monitor also those things. Ill explain you why. <br>
> <br>
> It happens when our storage is quite busy ( 180Gb/s of read/write
) <br>
> that the FS start to be slowin normal cd or ls requests. This might
<br>
> be normal, but in those situation i want to know where the <br>
> bottleneck is. Is the server CPU? Memory? Network? Spindles? knowing<br>
> where the bottlenek is might help me to understand if we can tweak
<br>
> the system a bit more.<br>
</font></tt>
<br><tt><font size=2>if cd or ls is very slow in GPFS in the majority
of the cases it has nothing to do with NSD Server bottlenecks, only indirect.
</font></tt>
<br><tt><font size=2>the main reason ls is slow in the field is you have
some very powerful nodes that all do buffered writes into the same directory
into 1 or multiple files while you do the ls on a different node. what
happens now is that the ls you did run most likely is a alias for ls -l
or something even more complex with color display, etc, but the point is
it most likely returns file size. GPFS doesn't lie about the filesize,
we only return accurate stat informations and while this is arguable, its
a fact today. </font></tt>
<br><tt><font size=2>so what happens is that the stat on each file triggers
a token revoke on the node that currently writing to the file you
do stat on, lets say it has 1 gb of dirty data in its memory for this file
(as its writes data buffered) this 1 GB of data now gets written to the
NSD server, the client updates the inode info and returns the correct size.
</font></tt>
<br><tt><font size=2>lets say you have very fast network and you have a
fast storage device like GSS (which i see you have) it will be able to
do this in a few 100 ms, but the problem is this now happens serialized
for each single file in this directory that people write into as
for each we need to get the exact stat info to satisfy your ls -l request.
</font></tt>
<br><tt><font size=2>this is what takes so long, not the fact that the
storage device might be slow or to much metadata activity is going on ,
this is token , means network traffic and obviously latency dependent.
</font></tt>
<br>
<br><tt><font size=2>the best way to see this is to look at waiters on
the client where you run the ls and see what they are waiting for.</font></tt>
<br>
<br><tt><font size=2>there are various ways to tune this to get better
'felt' ls responses but its not completely going away </font></tt>
<br><tt><font size=2>if all you try to with ls is if there is a file in
the directory run unalias ls and check if ls after that runs fast as it
shouldn't do the -l under the cover anymore. </font></tt>
<br>
<br><tt><font size=2>> <br>
> If its the CPU on the servers then there is no much to do beside <br>
> replacing or add more servers.If its not the CPU, maybe more memory
<br>
> would help? Maybe its just the network that filled up? so i can add
<br>
> more links <br>
> <br>
> Or if we reached the point there the bottleneck its the spindles,
<br>
> then there is no much point o look somethere else, we just reached
<br>
> the hardware limit..<br>
> <br>
> Sometimes, it also happens that there is very low IO (10Gb/s ), <br>
> almost no cpu usage on the servers but huge slownes ( ls can take
10<br>
> seconds). Why that happens? There is not much data ops , but
we <br>
> think there is a huge ammount of metadata ops. So what i want to <br>
> know is if the metadata vdisks are busy or not. If this is our <br>
> problem, could some SSD disks dedicated to metadata help? <br>
</font></tt>
<br><tt><font size=2>the answer if ssd's would help or not are hard to
say without knowing the root case and as i tried to explain above the most
likely case is token revoke, not disk i/o. obviously as more busy your
disks are as longer the token revoke will take. </font></tt>
<br>
<br><tt><font size=2>> <br>
> <br>
> In particular im, a bit puzzled with the design of our GSS storage.<br>
> Each recovery groups have 3 declustered arrays, and each declustered<br>
> aray have 1 data and 1 metadata vdisk, but in the end both metadata
<br>
> and data vdisks use the same spindles. The problem that, its that
I <br>
> dont understand if we have a metadata bottleneck there. Maybe some
<br>
> SSD disks in a dedicated declustered array would perform much <br>
> better, but this is just theory. I really would like to be able to
<br>
> monitor IO activities on the metadata vdisks.<br>
</font></tt>
<br><tt><font size=2>the short answer is we WANT the metadata disks to
be with the data disks on the same spindles. compared to other storage
systems, GSS is capable to handle different raid codes for different virtual
disks on the same physical disks, this way we create raid1'ish 'LUNS' for
metadata and raid6'is 'LUNS' for data so the small i/o penalty for a metadata
is very small compared to a read/modify/write on the data disks.</font></tt>
<br>
<br><tt><font size=2>> <br>
> <br>
</font></tt>
<br><tt><font size=2>> <br>
> <br>
> so the Layer you care about is the NSD Server layer, which sits on
<br>
> top of the VIO layer (which is essentially the SW RAID Layer in GNR)
<br>
> <br>
> > I'm asking that because if they are just ops, i don't know how
much <br>
> > they could be usefull. For example one write operation could
eman <br>
> > write 1 block or write a file of 100GB. If those are oprations,
<br>
> > there is a way to have the oupunt in bytes or blocks? <br>
> <br>
> there are multiple ways to get infos on the NSD layer, one would be
<br>
> to use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats
<br>
> counts again. </font></tt>
<br><tt><font size=2>> <br>
> Counters its not a problem. I can collect them and create some <br>
> graphs in a monitoring tool. I will check that.<br>
</font></tt>
<br><tt><font size=2>if you (let) upgrade your system to GSS 2.0 you get
a graphical monitoring as part of it. if you want i can send you some direct
email outside the group with additional informations on that. </font></tt>
<br>
<br><tt><font size=2>> <br>
> the alternative option is to use mmdiag --iohist. this shows you a
<br>
> history of the last X numbers of io operations on either the client
<br>
> or the server side like on a client : <br>
> <br>
> # mmdiag --iohist <br>
> <br>
> === mmdiag: iohist === <br>
> <br>
> I/O history: <br>
> <br>
> I/O start time RW Buf type disk:sectorNum
nSec time ms <br>
> qTime ms RpcTimes ms Type Device/NSD
ID NSD server <br>
> --------------- -- ----------- ----------------- ----- -------
<br>
> -------- ----------------- ---- ------------------ ---------------
<br>
> 14:25:22.169617 R LLIndBlock 1:1075622848
64 13.073 <br>
> 0.000 12.959 0.063 cli C0A70401:53BEEA7F
192.167.4.1 <br>
> 14:25:22.182723 R inode 1:1071252480
8 6.970 <br>
> 0.000 6.908 0.038 cli
C0A70401:53BEEA7F 192.167.4.1 <br>
> 14:25:53.659918 R LLIndBlock 1:1081202176
64 8.309 <br>
> 0.000 8.210 0.046 cli
C0A70401:53BEEA7F 192.167.4.1 <br>
> 14:25:53.668262 R inode 2:1081373696
8 14.117 <br>
> 0.000 14.032 0.058 cli C0A70402:53BEEA5E
192.167.4.2 <br>
> 14:25:53.682750 R LLIndBlock 1:1065508736
64 9.254 <br>
> 0.000 9.180 0.038 cli
C0A70401:53BEEA7F 192.167.4.1 <br>
> 14:25:53.692019 R inode 2:1064356608
8 14.899 <br>
> 0.000 14.847 0.029 cli C0A70402:53BEEA5E
192.167.4.2 <br>
> 14:25:53.707100 R inode 2:1077830152
8 16.499 <br>
> 0.000 16.449 0.025 cli C0A70402:53BEEA5E
192.167.4.2 <br>
> 14:25:53.723788 R LLIndBlock 1:1081202432
64 4.280 <br>
> 0.000 4.203 0.040 cli
C0A70401:53BEEA7F 192.167.4.1 <br>
> 14:25:53.728082 R inode 2:1081918976
8 7.760 <br>
> 0.000 7.710 0.027 cli
C0A70402:53BEEA5E 192.167.4.2 <br>
> 14:25:57.877416 R metadata 2:678978560
16 13.343 <br>
> 0.000 13.254 0.053 cli C0A70402:53BEEA5E
192.167.4.2 <br>
> 14:25:57.891048 R LLIndBlock 1:1065508608
64 15.491 <br>
> 0.000 15.401 0.058 cli C0A70401:53BEEA7F
192.167.4.1 <br>
> 14:25:57.906556 R inode 2:1083476520
8 11.723 <br>
> 0.000 11.676 0.029 cli C0A70402:53BEEA5E
192.167.4.2 <br>
> 14:25:57.918516 R LLIndBlock 1:1075622720
64 8.062 <br>
> 0.000 8.001 0.032 cli
C0A70401:53BEEA7F 192.167.4.1 <br>
> 14:25:57.926592 R inode 1:1076503480
8 8.087 <br>
> 0.000 8.043 0.026 cli
C0A70401:53BEEA7F 192.167.4.1 <br>
> 14:25:57.934856 R LLIndBlock 1:1071088512
64 6.572 <br>
> 0.000 6.510 0.033 cli
C0A70401:53BEEA7F 192.167.4.1 <br>
> 14:25:57.941441 R inode 2:1069885984
8 11.686 <br>
> 0.000 11.641 0.024 cli C0A70402:53BEEA5E
192.167.4.2 <br>
> 14:25:57.953294 R inode 2:1083476936
8 8.951 <br>
> 0.000 8.912 0.021 cli
C0A70402:53BEEA5E 192.167.4.2 <br>
> 14:25:57.965475 R inode 1:1076503504
8 0.477 <br>
> 0.000 0.053 0.000 cli
C0A70401:53BEEA7F 192.167.4.1 <br>
> 14:25:57.965755 R inode 2:1083476488
8 0.410 <br>
> 0.000 0.061 0.321 cli
C0A70402:53BEEA5E 192.167.4.2 <br>
> 14:25:57.965787 R inode 2:1083476512
8 0.439 <br>
> 0.000 0.053 0.342 cli
C0A70402:53BEEA5E 192.167.4.2 <br>
> <br>
> you basically see if its a inode , data block , what size it has (in<br>
> sectors) , which nsd server you did send this request to, etc. <br>
> <br>
> on the Server side you see the type , which physical disk it goes
to<br>
> and also what size of disk i/o it causes like : <br>
> <br>
> 14:26:50.129995 R inode 12:3211886376
64 14.261 <br>
> 0.000 0.000 0.000 pd
sdis <br>
> 14:26:50.137102 R inode 19:3003969520
64 9.004 <br>
> 0.000 0.000 0.000 pd
sdad <br>
> 14:26:50.136116 R inode 55:3591710992
64 11.057 <br>
> 0.000 0.000 0.000 pd
sdoh <br>
> 14:26:50.141510 R inode 21:3066810504
64 5.909 <br>
> 0.000 0.000 0.000 pd
sdaf <br>
> 14:26:50.130529 R inode 89:2962370072
64 17.437 <br>
> 0.000 0.000 0.000 pd
sddi <br>
> 14:26:50.131063 R inode 78:1889457000
64 17.062 <br>
> 0.000 0.000 0.000 pd
sdsj <br>
> 14:26:50.143403 R inode 36:3323035688
64 4.807 <br>
> 0.000 0.000 0.000 pd
sdmw <br>
> 14:26:50.131044 R inode 37:2513579736
128 17.181 <br>
> 0.000 0.000 0.000 pd
sddv <br>
> 14:26:50.138181 R inode 72:3868810400
64 10.951 <br>
> 0.000 0.000 0.000 pd
sdbz <br>
> 14:26:50.138188 R inode 131:2443484784
128 11.792 <br>
> 0.000 0.000 0.000 pd
sdug <br>
> 14:26:50.138003 R inode 102:3696843872
64 11.994 <br>
> 0.000 0.000 0.000 pd
sdgp <br>
> 14:26:50.137099 R inode 145:3370922504
64 13.225 <br>
> 0.000 0.000 0.000 pd
sdmi <br>
> 14:26:50.141576 R inode 62:2668579904
64 9.313 <br>
> 0.000 0.000 0.000 pd
sdou <br>
> 14:26:50.134689 R inode 159:2786164648
64 16.577 <br>
> 0.000 0.000 0.000 pd
sdpq <br>
> 14:26:50.145034 R inode 34:2097217320
64 7.409 <br>
> 0.000 0.000 0.000 pd
sdmt <br>
> 14:26:50.138140 R inode 139:2831038792
64 14.898 <br>
> 0.000 0.000 0.000 pd
sdlw <br>
> 14:26:50.130954 R inode 164:282120312
64 22.274 <br>
> 0.000 0.000 0.000 pd
sdzd <br>
> 14:26:50.137038 R inode 41:3421909608
64 16.314 <br>
> 0.000 0.000 0.000 pd
sdef <br>
> 14:26:50.137606 R inode 104:1870962416
64 16.644 <br>
> 0.000 0.000 0.000 pd
sdgx <br>
> 14:26:50.141306 R inode 65:2276184264
64 16.593 <br>
> 0.000 0.000 0.000 pd
sdrk <br>
> <br>
</font></tt>
<br><tt><font size=2>> <br>
> mmdiag --iohist its another think i looked at it, but i could not
<br>
> find good explanation for all the "buf type" ( third column
)<br>
</font></tt>
<br><tt><font size=2>> allocSeg<br>
> data<br>
> iallocSeg<br>
> indBlock<br>
> inode<br>
> LLIndBlock<br>
> logData<br>
> logDesc<br>
> logWrap<br>
> metadata<br>
> vdiskAULog<br>
> vdiskBuf<br>
> vdiskFWLog<br>
> vdiskMDLog<br>
> vdiskMeta<br>
> vdiskRGDesc</font></tt>
<br><tt><font size=2>> If i want to monifor metadata operation whan
should i look at? just <br>
</font></tt>
<br><tt><font size=2>inodes =inodes , *alloc* = file or data allocation
blocks , *ind* = indirect blocks (for very large files) and metadata ,
everyhing else is data or internal i/o's </font></tt>
<br><tt><font size=2> </font></tt>
<br><tt><font size=2>> the metadata flag or also inode? this command
takes also long to <br>
> run, especially if i run it a second time it hangs for a lot before
<br>
> to rerun again, so i'm not sure that run it every 30secs or minute
<br>
> its viable, but i will look also into that. THere is any <br>
> documentation that descibes clearly the whole output? what i found
<br>
> its quite generic and don't go into details...<br>
</font></tt>
<br><tt><font size=2>the reason it takes so long is because it collects
10's of thousands of i/os in a table and to not slow down the system when
we dump the data we copy it to a separate buffer so we don't need locks
:-) </font></tt>
<br><tt><font size=2>you can adjust the number of entries you want to collect
by adjusting the ioHistorySize config parameter</font></tt>
<br>
<br>
<br><tt><font size=2>> > <br>
> > Last but not least.. and this is what i really would like to
<br>
> > accomplish, i would to be able to monitor the latency of metadata
<br>
> operations. <br>
> <br>
> you can't do this on the server side as you don't know how much time<br>
> you spend on the client , network or anything between the app and
<br>
> the physical disk, so you can only reliably look at this from the
<br>
> client, the iohist output only shows you the Server disk i/o <br>
> processing time, but that can be a fraction of the overall time (in
<br>
> other cases this obviously can also be the dominant part depending
<br>
> on your workload). <br>
> <br>
> the easiest way on the client is to run <br>
> <br>
> mmfsadm vfsstats enable <br>
> from now on vfs stats are collected until you restart GPFS. <br>
> <br>
> then run : <br>
> <br>
> vfs statistics currently enabled <br>
> started at: Fri Aug 29 13:15:05.380 2014 <br>
> duration: 448446.970 sec <br>
> <br>
> name
calls time per call total time <br>
> -------------------- -------- -------------- --------------
<br>
> statfs
9 0.000002
0.000021 <br>
> startIO 246191176
0.005853 1441049.976740 <br>
> <br>
> to dump what ever you collected so far on this node. <br>
</font></tt>
<br><tt><font size=2>> <br>
> We already do that, but as I said, I want to check specifically how
<br>
> gss servers are keeping the requests to identify or exlude server
<br>
> side bottlenecks.<br>
> <br>
> <br>
> Thanks for your help, you gave me definitely few things where to look
at.<br>
> <br>
> Salvatore<br>
> _______________________________________________<br>
> gpfsug-discuss mailing list<br>
> gpfsug-discuss at gpfsug.org<br>
> </font></tt><a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss"><tt><font size=2>http://gpfsug.org/mailman/listinfo/gpfsug-discuss</font></tt></a><tt><font size=2><br>
</font></tt>