[gpfsug-discuss] gpfs performance monitoring
oehmes at us.ibm.com
Sat Sep 6 01:12:42 BST 2014
on your GSS nodes you have tuning files we suggest customers to use for
mixed workloads clients.
the files in /usr/lpp/mmfs/samples/gss/
if you create a nodeclass for all your clients you can run
/usr/lpp/mmfs/samples/gss/gssClientConfig.sh NODECLASS and it applies all
the settings to them so they will be active on next restart of the gpfs
this should be a very good starting point for your config. please try that
and let me know if it doesn't.
there are also several enhancements in GPFS 4.1 which reduce contention in
multiple areas, which would help as well, if you have the choice to update
btw. the GSS 2.0 package will update your GSS nodes to 4.1 also
Scalable Storage Research
email: oehmes at us.ibm.com
Phone: +1 (408) 824-8904
IBM Almaden Research Lab
From: Salvatore Di Nardo <sdinardo at ebi.ac.uk>
To: gpfsug main discussion list <gpfsug-discuss at gpfsug.org>
Date: 09/05/2014 03:57 AM
Subject: Re: [gpfsug-discuss] gpfs performance monitoring
Sent by: gpfsug-discuss-bounces at gpfsug.org
Our ls its plain ls, there is no alias.
Consider that all those things are already set up properly as EBI run hi
computing farms from many years, so those things are already fixed loong
time ago. We have very little experience with GPFS, but good knowledge
with LSF farms and own multiple NFS stotages ( several petabyte sized).
about NIS, all clients run NSCD that cashes all informations to avoid
such tipe of slownes, in fact then ls isslow, also ls -n is slow.
Beside that, also a "cd" sometimes hangs, so it have nothing to do with
Just to clarify a bit more. Now GSS usually seems working fine, we have
users that run jobs on the farms that pushes 180Gb/s read ( reading and
writing files of 100GB size). GPFS works very well there, where other
systems had performance problems accessing portion of data in so huge
Sadly, on the other hand, other users run jobs that do suge ammount of
metadata operations, like toons of ls in directory with many files, or
creating a silly amount of temporary files just to synchronize the jobs
between the farm nodes, or just to store temporary data for few
milliseconds and them immediately delete those temporary files. Imagine to
create constantly thousands files just to write few bytes and they delete
them after few milliseconds...
When those thing happens we see 10-15Gb/sec throughput, low CPU usage on
the server ( 80% iddle), but any cd, or ls or wathever takes few seconds.
So my question is, if the bottleneck could be the spindles, or if the
clients could be tuned a bit more?
I read your PDF and all the paramenters seems already well configured
except "maxFilesToCache", but I'm not sure how we should configure few of
those parameters on the clients. As an example I cannot immagine a client
that require 38g pagepool size.
so what's the correct pagepool on a client? what about those others?
Right now all the clients have 1 GB pagepool size. In theory, we can
afford to use more ( i thing we can easily go up to 8GB) as they have
plenty or available memory. If this could help, we can do that, but the
client really really need more than 1G? They are just clients after all,
so the memory in theory should be used for jobs not just for "caching".
Last question about "maxFIlesToCache" you say that must be large on small
cluster but small on large clusters. What do you consider 6 servers and
almost 700 clients?
on clienst we have:
on servers we have
On 05/09/14 01:48, Sven Oehme wrote:
Scalable Storage Research
email: oehmes at us.ibm.com
Phone: +1 (408) 824-8904
IBM Almaden Research Lab
gpfsug-discuss-bounces at gpfsug.org wrote on 09/04/2014 03:43:36 AM:
> From: Salvatore Di Nardo <sdinardo at ebi.ac.uk>
> To: gpfsug main discussion list <gpfsug-discuss at gpfsug.org>
> Date: 09/04/2014 03:44 AM
> Subject: Re: [gpfsug-discuss] gpfs performance monitoring
> Sent by: gpfsug-discuss-bounces at gpfsug.org
> On 04/09/14 01:50, Sven Oehme wrote:
> > Hello everybody,
> > here i come here again, this time to ask some hint about how to
> monitor GPFS.
> > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is
> > that they return number based only on the request done in the
> > current host, so i have to run them on all the clients ( over 600
> > nodes) so its quite unpractical. Instead i would like to know from
> > the servers whats going on, and i came across the vio_s statistics
> > wich are less documented and i dont know exacly what they mean.
> > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that
> > runs VIO_S.
> > My problems with the output of this command:
> > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1
> > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second
> > timestamp: 1409763206/477366
> > recovery group: *
> > declustered array: *
> > vdisk: *
> > client reads: 2584229
> > client short writes: 55299693
> > client medium writes: 190071
> > client promoted full track writes: 465145
> > client full track writes: 9249
> > flushed update writes: 4187708
> > flushed promoted full track writes: 123
> > migrate operations: 114
> > scrub operations: 450590
> > log writes: 28509602
> > it sais "VIOPS per second", but they seem to me just counters as
> > every time i re-run the command, the numbers increase by a bit..
> > Can anyone confirm if those numbers are counter or if they are
> the numbers are accumulative so everytime you run them they just
> show the value since start (or last reset) time.
> OK, you confirmed my toughts, thatks
> > On a closer eye about i dont understand what most of thosevalues
> > mean. For example, what exacly are "flushed promoted full track write"
> > I tried to find a documentation about this output , but could not
> > find any. can anyone point me a link where output of vio_s is
> > Another thing i dont understand about those numbers is if they are
> > just operations, or the number of blocks that was read/write/etc .
> its just operations and if i would explain what the numbers mean i
> might confuse you even more because this is not what you are really
> looking for.
> what you are looking for is what the client io's look like on the
> Server side, while the VIO layer is the Server side to the disks, so
> one lever lower than what you are looking for from what i could read
> out of the description above.
> No.. what I'm looking its exactly how the disks are busy to keep the
> requests. Obviously i'm not looking just that, but I feel the needs
> to monitor also those things. Ill explain you why.
> It happens when our storage is quite busy ( 180Gb/s of read/write )
> that the FS start to be slowin normal cd or ls requests. This might
> be normal, but in those situation i want to know where the
> bottleneck is. Is the server CPU? Memory? Network? Spindles? knowing
> where the bottlenek is might help me to understand if we can tweak
> the system a bit more.
if cd or ls is very slow in GPFS in the majority of the cases it has
nothing to do with NSD Server bottlenecks, only indirect.
the main reason ls is slow in the field is you have some very powerful
nodes that all do buffered writes into the same directory into 1 or
multiple files while you do the ls on a different node. what happens now
is that the ls you did run most likely is a alias for ls -l or something
even more complex with color display, etc, but the point is it most likely
returns file size. GPFS doesn't lie about the filesize, we only return
accurate stat informations and while this is arguable, its a fact today.
so what happens is that the stat on each file triggers a token revoke on
the node that currently writing to the file you do stat on, lets say it
has 1 gb of dirty data in its memory for this file (as its writes data
buffered) this 1 GB of data now gets written to the NSD server, the client
updates the inode info and returns the correct size.
lets say you have very fast network and you have a fast storage device
like GSS (which i see you have) it will be able to do this in a few 100
ms, but the problem is this now happens serialized for each single file in
this directory that people write into as for each we need to get the
exact stat info to satisfy your ls -l request.
this is what takes so long, not the fact that the storage device might be
slow or to much metadata activity is going on , this is token , means
network traffic and obviously latency dependent.
the best way to see this is to look at waiters on the client where you run
the ls and see what they are waiting for.
there are various ways to tune this to get better 'felt' ls responses but
its not completely going away
if all you try to with ls is if there is a file in the directory run
unalias ls and check if ls after that runs fast as it shouldn't do the -l
under the cover anymore.
> If its the CPU on the servers then there is no much to do beside
> replacing or add more servers.If its not the CPU, maybe more memory
> would help? Maybe its just the network that filled up? so i can add
> more links
> Or if we reached the point there the bottleneck its the spindles,
> then there is no much point o look somethere else, we just reached
> the hardware limit..
> Sometimes, it also happens that there is very low IO (10Gb/s ),
> almost no cpu usage on the servers but huge slownes ( ls can take 10
> seconds). Why that happens? There is not much data ops , but we
> think there is a huge ammount of metadata ops. So what i want to
> know is if the metadata vdisks are busy or not. If this is our
> problem, could some SSD disks dedicated to metadata help?
the answer if ssd's would help or not are hard to say without knowing the
root case and as i tried to explain above the most likely case is token
revoke, not disk i/o. obviously as more busy your disks are as longer the
token revoke will take.
> In particular im, a bit puzzled with the design of our GSS storage.
> Each recovery groups have 3 declustered arrays, and each declustered
> aray have 1 data and 1 metadata vdisk, but in the end both metadata
> and data vdisks use the same spindles. The problem that, its that I
> dont understand if we have a metadata bottleneck there. Maybe some
> SSD disks in a dedicated declustered array would perform much
> better, but this is just theory. I really would like to be able to
> monitor IO activities on the metadata vdisks.
the short answer is we WANT the metadata disks to be with the data disks
on the same spindles. compared to other storage systems, GSS is capable to
handle different raid codes for different virtual disks on the same
physical disks, this way we create raid1'ish 'LUNS' for metadata and
raid6'is 'LUNS' for data so the small i/o penalty for a metadata is very
small compared to a read/modify/write on the data disks.
> so the Layer you care about is the NSD Server layer, which sits on
> top of the VIO layer (which is essentially the SW RAID Layer in GNR)
> > I'm asking that because if they are just ops, i don't know how much
> > they could be usefull. For example one write operation could eman
> > write 1 block or write a file of 100GB. If those are oprations,
> > there is a way to have the oupunt in bytes or blocks?
> there are multiple ways to get infos on the NSD layer, one would be
> to use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats
> counts again.
> Counters its not a problem. I can collect them and create some
> graphs in a monitoring tool. I will check that.
if you (let) upgrade your system to GSS 2.0 you get a graphical monitoring
as part of it. if you want i can send you some direct email outside the
group with additional informations on that.
> the alternative option is to use mmdiag --iohist. this shows you a
> history of the last X numbers of io operations on either the client
> or the server side like on a client :
> # mmdiag --iohist
> === mmdiag: iohist ===
> I/O history:
> I/O start time RW Buf type disk:sectorNum nSec time ms
> qTime ms RpcTimes ms Type Device/NSD ID NSD server
> --------------- -- ----------- ----------------- ----- -------
> -------- ----------------- ---- ------------------ ---------------
> 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073
> 0.000 12.959 0.063 cli C0A70401:53BEEA7F 22.214.171.124
> 14:25:22.182723 R inode 1:1071252480 8 6.970
> 0.000 6.908 0.038 cli C0A70401:53BEEA7F 126.96.36.199
> 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309
> 0.000 8.210 0.046 cli C0A70401:53BEEA7F 188.8.131.52
> 14:25:53.668262 R inode 2:1081373696 8 14.117
> 0.000 14.032 0.058 cli C0A70402:53BEEA5E 184.108.40.206
> 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254
> 0.000 9.180 0.038 cli C0A70401:53BEEA7F 220.127.116.11
> 14:25:53.692019 R inode 2:1064356608 8 14.899
> 0.000 14.847 0.029 cli C0A70402:53BEEA5E 18.104.22.168
> 14:25:53.707100 R inode 2:1077830152 8 16.499
> 0.000 16.449 0.025 cli C0A70402:53BEEA5E 22.214.171.124
> 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280
> 0.000 4.203 0.040 cli C0A70401:53BEEA7F 126.96.36.199
> 14:25:53.728082 R inode 2:1081918976 8 7.760
> 0.000 7.710 0.027 cli C0A70402:53BEEA5E 188.8.131.52
> 14:25:57.877416 R metadata 2:678978560 16 13.343
> 0.000 13.254 0.053 cli C0A70402:53BEEA5E 184.108.40.206
> 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491
> 0.000 15.401 0.058 cli C0A70401:53BEEA7F 220.127.116.11
> 14:25:57.906556 R inode 2:1083476520 8 11.723
> 0.000 11.676 0.029 cli C0A70402:53BEEA5E 18.104.22.168
> 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062
> 0.000 8.001 0.032 cli C0A70401:53BEEA7F 22.214.171.124
> 14:25:57.926592 R inode 1:1076503480 8 8.087
> 0.000 8.043 0.026 cli C0A70401:53BEEA7F 126.96.36.199
> 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572
> 0.000 6.510 0.033 cli C0A70401:53BEEA7F 188.8.131.52
> 14:25:57.941441 R inode 2:1069885984 8 11.686
> 0.000 11.641 0.024 cli C0A70402:53BEEA5E 184.108.40.206
> 14:25:57.953294 R inode 2:1083476936 8 8.951
> 0.000 8.912 0.021 cli C0A70402:53BEEA5E 220.127.116.11
> 14:25:57.965475 R inode 1:1076503504 8 0.477
> 0.000 0.053 0.000 cli C0A70401:53BEEA7F 18.104.22.168
> 14:25:57.965755 R inode 2:1083476488 8 0.410
> 0.000 0.061 0.321 cli C0A70402:53BEEA5E 22.214.171.124
> 14:25:57.965787 R inode 2:1083476512 8 0.439
> 0.000 0.053 0.342 cli C0A70402:53BEEA5E 126.96.36.199
> you basically see if its a inode , data block , what size it has (in
> sectors) , which nsd server you did send this request to, etc.
> on the Server side you see the type , which physical disk it goes to
> and also what size of disk i/o it causes like :
> 14:26:50.129995 R inode 12:3211886376 64 14.261
> 0.000 0.000 0.000 pd sdis
> 14:26:50.137102 R inode 19:3003969520 64 9.004
> 0.000 0.000 0.000 pd sdad
> 14:26:50.136116 R inode 55:3591710992 64 11.057
> 0.000 0.000 0.000 pd sdoh
> 14:26:50.141510 R inode 21:3066810504 64 5.909
> 0.000 0.000 0.000 pd sdaf
> 14:26:50.130529 R inode 89:2962370072 64 17.437
> 0.000 0.000 0.000 pd sddi
> 14:26:50.131063 R inode 78:1889457000 64 17.062
> 0.000 0.000 0.000 pd sdsj
> 14:26:50.143403 R inode 36:3323035688 64 4.807
> 0.000 0.000 0.000 pd sdmw
> 14:26:50.131044 R inode 37:2513579736 128 17.181
> 0.000 0.000 0.000 pd sddv
> 14:26:50.138181 R inode 72:3868810400 64 10.951
> 0.000 0.000 0.000 pd sdbz
> 14:26:50.138188 R inode 131:2443484784 128 11.792
> 0.000 0.000 0.000 pd sdug
> 14:26:50.138003 R inode 102:3696843872 64 11.994
> 0.000 0.000 0.000 pd sdgp
> 14:26:50.137099 R inode 145:3370922504 64 13.225
> 0.000 0.000 0.000 pd sdmi
> 14:26:50.141576 R inode 62:2668579904 64 9.313
> 0.000 0.000 0.000 pd sdou
> 14:26:50.134689 R inode 159:2786164648 64 16.577
> 0.000 0.000 0.000 pd sdpq
> 14:26:50.145034 R inode 34:2097217320 64 7.409
> 0.000 0.000 0.000 pd sdmt
> 14:26:50.138140 R inode 139:2831038792 64 14.898
> 0.000 0.000 0.000 pd sdlw
> 14:26:50.130954 R inode 164:282120312 64 22.274
> 0.000 0.000 0.000 pd sdzd
> 14:26:50.137038 R inode 41:3421909608 64 16.314
> 0.000 0.000 0.000 pd sdef
> 14:26:50.137606 R inode 104:1870962416 64 16.644
> 0.000 0.000 0.000 pd sdgx
> 14:26:50.141306 R inode 65:2276184264 64 16.593
> 0.000 0.000 0.000 pd sdrk
> mmdiag --iohist its another think i looked at it, but i could not
> find good explanation for all the "buf type" ( third column )
> If i want to monifor metadata operation whan should i look at? just
inodes =inodes , *alloc* = file or data allocation blocks , *ind* =
indirect blocks (for very large files) and metadata , everyhing else is
data or internal i/o's
> the metadata flag or also inode? this command takes also long to
> run, especially if i run it a second time it hangs for a lot before
> to rerun again, so i'm not sure that run it every 30secs or minute
> its viable, but i will look also into that. THere is any
> documentation that descibes clearly the whole output? what i found
> its quite generic and don't go into details...
the reason it takes so long is because it collects 10's of thousands of
i/os in a table and to not slow down the system when we dump the data we
copy it to a separate buffer so we don't need locks :-)
you can adjust the number of entries you want to collect by adjusting the
ioHistorySize config parameter
> > Last but not least.. and this is what i really would like to
> > accomplish, i would to be able to monitor the latency of metadata
> you can't do this on the server side as you don't know how much time
> you spend on the client , network or anything between the app and
> the physical disk, so you can only reliably look at this from the
> client, the iohist output only shows you the Server disk i/o
> processing time, but that can be a fraction of the overall time (in
> other cases this obviously can also be the dominant part depending
> on your workload).
> the easiest way on the client is to run
> mmfsadm vfsstats enable
> from now on vfs stats are collected until you restart GPFS.
> then run :
> vfs statistics currently enabled
> started at: Fri Aug 29 13:15:05.380 2014
> duration: 448446.970 sec
> name calls time per call total time
> -------------------- -------- -------------- --------------
> statfs 9 0.000002 0.000021
> startIO 246191176 0.005853 1441049.976740
> to dump what ever you collected so far on this node.
> We already do that, but as I said, I want to check specifically how
> gss servers are keeping the requests to identify or exlude server
> side bottlenecks.
> Thanks for your help, you gave me definitely few things where to look
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the gpfsug-discuss