From S.J.Thompson at bham.ac.uk Mon Sep 1 20:44:45 2014 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Mon, 1 Sep 2014 19:44:45 +0000 Subject: [gpfsug-discuss] GPFS admin host name vs subnets Message-ID: I was just reading through the docs at: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/GPFS+Network+Communication+Overview And was wondering about using admin host name bs using subnets. My reading of the page is that if say I have a 1GbE network and a 40GbE network, I could have an admin host name on the 1GbE network. But equally from the docs, it looks like I could also use subnets to achieve the same whilst allowing the admin network to be a fall back for data if necessary. For example, create the cluster using the primary name on the 1GbE network, then use the subnets property to use set the network on the 40GbE network as the first and the network on the 1GbE network as the second in the list, thus GPFS data will pass over the 40GbE network in preference and the 1GbE network will, by default only be used for admin traffic as the admin host name will just be the name of the host on the 1GbE network. Is my reading of the docs correct? Or do I really want to be creating the cluster using the 40GbE network hostnames and set the admin node name to the name of the 1GbE network interface? (there's actually also an FDR switch in there somewhere for verbs as well) Thanks Simon From ewahl at osc.edu Tue Sep 2 14:44:29 2014 From: ewahl at osc.edu (Ed Wahl) Date: Tue, 2 Sep 2014 13:44:29 +0000 Subject: [gpfsug-discuss] GPFS admin host name vs subnets In-Reply-To: References: Message-ID: Seems like you are on the correct track. This is similar to my setup. subnett'ed daemon 10GbE, 1GbE with main being QDR RDMA, admin 1GbE. To my mind the most important part is Setting "privateSubnetOverride" to 1. This allows both your 1GbE and your 40GbE to be on a private subnet. Serving block over public IPs just seems wrong on SO many levels. Whether truly private/internal or not. And how many people use public IPs internally? Wait, maybe I don't want to know... Using 'verbsRdma enable' for your FDR seems to override Daemon node name for block, at least in my experience. I love the fallback to 10GbE and then 1GbE in case of disaster when using IB. Lately we seem to be generating bugs in OpenSM at a frightening rate so that has been _extremely_ helpful. Now if we could just monitor when it happens more easily than running mmfsadm test verbs conn, say by logging a failure of RDMA? Ed OSC ________________________________________ From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of Simon Thompson (Research Computing - IT Services) [S.J.Thompson at bham.ac.uk] Sent: Monday, September 01, 2014 3:44 PM To: gpfsug main discussion list Subject: [gpfsug-discuss] GPFS admin host name vs subnets I was just reading through the docs at: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/GPFS+Network+Communication+Overview And was wondering about using admin host name bs using subnets. My reading of the page is that if say I have a 1GbE network and a 40GbE network, I could have an admin host name on the 1GbE network. But equally from the docs, it looks like I could also use subnets to achieve the same whilst allowing the admin network to be a fall back for data if necessary. For example, create the cluster using the primary name on the 1GbE network, then use the subnets property to use set the network on the 40GbE network as the first and the network on the 1GbE network as the second in the list, thus GPFS data will pass over the 40GbE network in preference and the 1GbE network will, by default only be used for admin traffic as the admin host name will just be the name of the host on the 1GbE network. Is my reading of the docs correct? Or do I really want to be creating the cluster using the 40GbE network hostnames and set the admin node name to the name of the 1GbE network interface? (there's actually also an FDR switch in there somewhere for verbs as well) Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From oehmes at gmail.com Tue Sep 2 15:11:03 2014 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 2 Sep 2014 07:11:03 -0700 Subject: [gpfsug-discuss] GPFS admin host name vs subnets In-Reply-To: References: Message-ID: Ed, if you enable RDMA, GPFS will always use this as preferred data transfer. if you have subnets configured, GPFS will prefer this for communication with higher priority as the default interface. so the order is RDMA , subnets, default. if RDMA will fail for whatever reason we will use the subnets defined interface and if that fails as well we will use the default interface. the easiest way to see what is used is to run mmdiag --network (only avail on more recent versions of GPFS) it will tell you if RDMA is enabled between individual nodes as well as if a subnet connection is used or not : [root at client05 ~]# mmdiag --network === mmdiag: network === Pending messages: (none) Inter-node communication configuration: tscTcpPort 1191 my address 192.167.13.5/16 (eth0) my addr list 192.1.13.5/16 (ib1) 192.0.13.5/16 (ib0)/ client04.clientad.almaden.ibm.com 192.167.13.5/16 (eth0) my node number 17 TCP Connections between nodes: Device ib0: hostname node destination status err sock sent(MB) recvd(MB) ostype client04n1 192.0.4.1 connected 0 69 0 37 Linux/L client04n2 192.0.4.2 connected 0 70 0 37 Linux/L client04n3 192.0.4.3 connected 0 68 0 0 Linux/L Device ib1: hostname node destination status err sock sent(MB) recvd(MB) ostype clientcl21 192.1.201.21 connected 0 65 0 0 Linux/L clientcl25 192.1.201.25 connected 0 66 0 0 Linux/L clientcl26 192.1.201.26 connected 0 67 0 0 Linux/L clientcl21 192.1.201.21 connected 0 71 0 0 Linux/L clientcl22 192.1.201.22 connected 0 63 0 0 Linux/L client10 192.1.13.10 connected 0 73 0 0 Linux/L client08 192.1.13.8 connected 0 72 0 0 Linux/L RDMA Connections between nodes: Fabric 1 - Device mlx4_0 Port 1 Width 4x Speed FDR lid 13 hostname idx CM state VS buff RDMA_CT(ERR) RDMA_RCV_MB RDMA_SND_MB VS_CT(ERR) VS_SND_MB VS_RCV_MB WAIT_CON_SLOT WAIT_NODE_SLOT clientcl21 0 N RTS (Y)903 0 (0 ) 0 0 192 (0 ) 0 0 0 0 client04n1 0 N RTS (Y)477 0 (0 ) 0 0 12367404(0 ) 107905 594 0 0 client04n1 1 N RTS (Y)477 0 (0 ) 0 0 12367404(0 ) 107901 593 0 0 client04n2 0 N RTS (Y)477 0 (0 ) 0 0 12371352(0 ) 107911 594 0 0 client04n2 2 N RTS (Y)477 0 (0 ) 0 0 12371352(0 ) 107902 594 0 0 clientcl21 0 N RTS (Y)880 0 (0 ) 0 0 11 (0 ) 0 0 0 0 client04n3 0 N RTS (Y)969 0 (0 ) 0 0 5 (0 ) 0 0 0 0 clientcl26 0 N RTS (Y)702 0 (0 ) 0 0 35 (0 ) 0 0 0 0 client08 0 N RTS (Y)637 0 (0 ) 0 0 16 (0 ) 0 0 0 0 clientcl25 0 N RTS (Y)574 0 (0 ) 0 0 14 (0 ) 0 0 0 0 clientcl22 0 N RTS (Y)507 0 (0 ) 0 0 2 (0 ) 0 0 0 0 client10 0 N RTS (Y)568 0 (0 ) 0 0 121 (0 ) 0 0 0 0 Fabric 2 - Device mlx4_0 Port 2 Width 4x Speed FDR lid 65 hostname idx CM state VS buff RDMA_CT(ERR) RDMA_RCV_MB RDMA_SND_MB VS_CT(ERR) VS_SND_MB VS_RCV_MB WAIT_CON_SLOT WAIT_NODE_SLOT clientcl21 1 N RTS (Y)904 0 (0 ) 0 0 192 (0 ) 0 0 0 0 client04n1 2 N RTS (Y)477 0 (0 ) 0 0 12367404(0 ) 107897 593 0 0 client04n2 1 N RTS (Y)477 0 (0 ) 0 0 12371352(0 ) 107903 594 0 0 clientcl21 1 N RTS (Y)881 0 (0 ) 0 0 10 (0 ) 0 0 0 0 clientcl26 1 N RTS (Y)701 0 (0 ) 0 0 35 (0 ) 0 0 0 0 client08 1 N RTS (Y)637 0 (0 ) 0 0 16 (0 ) 0 0 0 0 clientcl25 1 N RTS (Y)574 0 (0 ) 0 0 14 (0 ) 0 0 0 0 clientcl22 1 N RTS (Y)507 0 (0 ) 0 0 2 (0 ) 0 0 0 0 client10 1 N RTS (Y)568 0 (0 ) 0 0 121 (0 ) 0 0 0 0 in this example you can see thet my client (client05) has multiple subnets configured as well as RDMA. so to connected to the various TCP devices (ib0 and ib1) to different cluster nodes and also has a RDMA connection to a different set of nodes. as you can see there is basically no traffic on the TCP devices, as all the traffic uses the 2 defined RDMA fabrics. there is not a single connection using the daemon interface (eth0) as all nodes are either connected via subnets or via RDMA. hope this helps. Sven On Tue, Sep 2, 2014 at 6:44 AM, Ed Wahl wrote: > Seems like you are on the correct track. This is similar to my setup. > subnett'ed daemon 10GbE, 1GbE with main being QDR RDMA, admin 1GbE. To > my mind the most important part is Setting "privateSubnetOverride" to 1. > This allows both your 1GbE and your 40GbE to be on a private subnet. > Serving block over public IPs just seems wrong on SO many levels. Whether > truly private/internal or not. And how many people use public IPs > internally? Wait, maybe I don't want to know... > > Using 'verbsRdma enable' for your FDR seems to override Daemon node > name for block, at least in my experience. I love the fallback to 10GbE > and then 1GbE in case of disaster when using IB. Lately we seem to be > generating bugs in OpenSM at a frightening rate so that has been > _extremely_ helpful. Now if we could just monitor when it happens more > easily than running mmfsadm test verbs conn, say by logging a failure of > RDMA? > > > Ed > OSC > > ________________________________________ > From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] > on behalf of Simon Thompson (Research Computing - IT Services) [ > S.J.Thompson at bham.ac.uk] > Sent: Monday, September 01, 2014 3:44 PM > To: gpfsug main discussion list > Subject: [gpfsug-discuss] GPFS admin host name vs subnets > > I was just reading through the docs at: > > > https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/GPFS+Network+Communication+Overview > > And was wondering about using admin host name bs using subnets. My reading > of the page is that if say I have a 1GbE network and a 40GbE network, I > could have an admin host name on the 1GbE network. But equally from the > docs, it looks like I could also use subnets to achieve the same whilst > allowing the admin network to be a fall back for data if necessary. > > For example, create the cluster using the primary name on the 1GbE > network, then use the subnets property to use set the network on the 40GbE > network as the first and the network on the 1GbE network as the second in > the list, thus GPFS data will pass over the 40GbE network in preference and > the 1GbE network will, by default only be used for admin traffic as the > admin host name will just be the name of the host on the 1GbE network. > > Is my reading of the docs correct? Or do I really want to be creating the > cluster using the 40GbE network hostnames and set the admin node name to > the name of the 1GbE network interface? > > (there's actually also an FDR switch in there somewhere for verbs as well) > > Thanks > > Simon > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Wed Sep 3 18:27:44 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Wed, 03 Sep 2014 18:27:44 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring Message-ID: <54074F90.7000303@ebi.ac.uk> Hello everybody, here i come here again, this time to ask some hint about how to monitor GPFS. I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is that they return number based only on the request done in the current host, so i have to run them on all the clients ( over 600 nodes) so its quite unpractical. Instead i would like to know from the servers whats going on, and i came across the vio_s statistics wich are less documented and i dont know exacly what they mean. There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that runs VIO_S. My problems with the output of this command: echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second timestamp: 1409763206/477366 recovery group: * declustered array: * vdisk: * client reads: 2584229 client short writes: 55299693 client medium writes: 190071 client promoted full track writes: 465145 client full track writes: 9249 flushed update writes: 4187708 flushed promoted full track writes: 123 migrate operations: 114 scrub operations: 450590 log writes: 28509602 it sais "VIOPS per second", but they seem to me just counters as every time i re-run the command, the numbers increase by a bit.. Can anyone confirm if those numbers are counter or if they are OPS/sec. On a closer eye about i dont understand what most of thosevalues mean. For example, what exacly are "flushed promoted full track write" ?? I tried to find a documentation about this output , but could not find any. can anyone point me a link where output of vio_s is explained? Another thing i dont understand about those numbers is if they are just operations, or the number of blocks that was read/write/etc . I'm asking that because if they are just ops, i don't know how much they could be usefull. For example one write operation could eman write 1 block or write a file of 100GB. If those are oprations, there is a way to have the oupunt in bytes or blocks? Last but not least.. and this is what i really would like to accomplish, i would to be able to monitor the latency of metadata operations. In my environment there are users that litterally overhelm our storages with metadata request, so even if there is no massive throughput or huge waiters, any "ls" could take ages. I would like to be able to monitor metadata behaviour. There is a way to to do that from the NSD servers? Thanks in advance for any tip/help. Regards, Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From chekh at stanford.edu Wed Sep 3 21:55:14 2014 From: chekh at stanford.edu (Alex Chekholko) Date: Wed, 03 Sep 2014 13:55:14 -0700 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <54074F90.7000303@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> Message-ID: <54078032.2050605@stanford.edu> The usual way to do that is to re-architect your filesystem so that the system pool is metadata-only, and then you can just look at the storage layer and see total metadata throughput that way. Otherwise your metadata ops are mixed in with your data ops. Of course, both NSDs and clients also have metadata caches. On 09/03/2014 10:27 AM, Salvatore Di Nardo wrote: > > Last but not least.. and this is what i really would like to accomplish, > i would to be able to monitor the latency of metadata operations. > In my environment there are users that litterally overhelm our storages > with metadata request, so even if there is no massive throughput or huge > waiters, any "ls" could take ages. I would like to be able to monitor > metadata behaviour. There is a way to to do that from the NSD servers? -- Alex Chekholko chekh at stanford.edu 347-401-4860 From oehmes at us.ibm.com Thu Sep 4 01:50:25 2014 From: oehmes at us.ibm.com (Sven Oehme) Date: Wed, 3 Sep 2014 17:50:25 -0700 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <54074F90.7000303@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> Message-ID: > Hello everybody, Hi > here i come here again, this time to ask some hint about how to monitor GPFS. > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is > that they return number based only on the request done in the > current host, so i have to run them on all the clients ( over 600 > nodes) so its quite unpractical. Instead i would like to know from > the servers whats going on, and i came across the vio_s statistics > wich are less documented and i dont know exacly what they mean. > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that > runs VIO_S. > > My problems with the output of this command: > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second > timestamp: 1409763206/477366 > recovery group: * > declustered array: * > vdisk: * > client reads: 2584229 > client short writes: 55299693 > client medium writes: 190071 > client promoted full track writes: 465145 > client full track writes: 9249 > flushed update writes: 4187708 > flushed promoted full track writes: 123 > migrate operations: 114 > scrub operations: 450590 > log writes: 28509602 > > it sais "VIOPS per second", but they seem to me just counters as > every time i re-run the command, the numbers increase by a bit.. > Can anyone confirm if those numbers are counter or if they are OPS/sec. the numbers are accumulative so everytime you run them they just show the value since start (or last reset) time. > > On a closer eye about i dont understand what most of thosevalues > mean. For example, what exacly are "flushed promoted full track write" ?? > I tried to find a documentation about this output , but could not > find any. can anyone point me a link where output of vio_s is explained? > > Another thing i dont understand about those numbers is if they are > just operations, or the number of blocks that was read/write/etc . its just operations and if i would explain what the numbers mean i might confuse you even more because this is not what you are really looking for. what you are looking for is what the client io's look like on the Server side, while the VIO layer is the Server side to the disks, so one lever lower than what you are looking for from what i could read out of the description above. so the Layer you care about is the NSD Server layer, which sits on top of the VIO layer (which is essentially the SW RAID Layer in GNR) > I'm asking that because if they are just ops, i don't know how much > they could be usefull. For example one write operation could eman > write 1 block or write a file of 100GB. If those are oprations, > there is a way to have the oupunt in bytes or blocks? there are multiple ways to get infos on the NSD layer, one would be to use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats counts again. the alternative option is to use mmdiag --iohist. this shows you a history of the last X numbers of io operations on either the client or the server side like on a client : # mmdiag --iohist === mmdiag: iohist === I/O history: I/O start time RW Buf type disk:sectorNum nSec time ms qTime ms RpcTimes ms Type Device/NSD ID NSD server --------------- -- ----------- ----------------- ----- ------- -------- ----------------- ---- ------------------ --------------- 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 14:25:22.182723 R inode 1:1071252480 8 6.970 0.000 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 14:25:53.668262 R inode 2:1081373696 8 14.117 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 14:25:53.692019 R inode 2:1064356608 8 14.899 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 14:25:53.707100 R inode 2:1077830152 8 16.499 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 14:25:53.728082 R inode 2:1081918976 8 7.760 0.000 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 14:25:57.877416 R metadata 2:678978560 16 13.343 0.000 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 14:25:57.906556 R inode 2:1083476520 8 11.723 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 14:25:57.926592 R inode 1:1076503480 8 8.087 0.000 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 14:25:57.941441 R inode 2:1069885984 8 11.686 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 14:25:57.953294 R inode 2:1083476936 8 8.951 0.000 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 14:25:57.965475 R inode 1:1076503504 8 0.477 0.000 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 14:25:57.965755 R inode 2:1083476488 8 0.410 0.000 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 14:25:57.965787 R inode 2:1083476512 8 0.439 0.000 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 you basically see if its a inode , data block , what size it has (in sectors) , which nsd server you did send this request to, etc. on the Server side you see the type , which physical disk it goes to and also what size of disk i/o it causes like : 14:26:50.129995 R inode 12:3211886376 64 14.261 0.000 0.000 0.000 pd sdis 14:26:50.137102 R inode 19:3003969520 64 9.004 0.000 0.000 0.000 pd sdad 14:26:50.136116 R inode 55:3591710992 64 11.057 0.000 0.000 0.000 pd sdoh 14:26:50.141510 R inode 21:3066810504 64 5.909 0.000 0.000 0.000 pd sdaf 14:26:50.130529 R inode 89:2962370072 64 17.437 0.000 0.000 0.000 pd sddi 14:26:50.131063 R inode 78:1889457000 64 17.062 0.000 0.000 0.000 pd sdsj 14:26:50.143403 R inode 36:3323035688 64 4.807 0.000 0.000 0.000 pd sdmw 14:26:50.131044 R inode 37:2513579736 128 17.181 0.000 0.000 0.000 pd sddv 14:26:50.138181 R inode 72:3868810400 64 10.951 0.000 0.000 0.000 pd sdbz 14:26:50.138188 R inode 131:2443484784 128 11.792 0.000 0.000 0.000 pd sdug 14:26:50.138003 R inode 102:3696843872 64 11.994 0.000 0.000 0.000 pd sdgp 14:26:50.137099 R inode 145:3370922504 64 13.225 0.000 0.000 0.000 pd sdmi 14:26:50.141576 R inode 62:2668579904 64 9.313 0.000 0.000 0.000 pd sdou 14:26:50.134689 R inode 159:2786164648 64 16.577 0.000 0.000 0.000 pd sdpq 14:26:50.145034 R inode 34:2097217320 64 7.409 0.000 0.000 0.000 pd sdmt 14:26:50.138140 R inode 139:2831038792 64 14.898 0.000 0.000 0.000 pd sdlw 14:26:50.130954 R inode 164:282120312 64 22.274 0.000 0.000 0.000 pd sdzd 14:26:50.137038 R inode 41:3421909608 64 16.314 0.000 0.000 0.000 pd sdef 14:26:50.137606 R inode 104:1870962416 64 16.644 0.000 0.000 0.000 pd sdgx 14:26:50.141306 R inode 65:2276184264 64 16.593 0.000 0.000 0.000 pd sdrk > > Last but not least.. and this is what i really would like to > accomplish, i would to be able to monitor the latency of metadata operations. you can't do this on the server side as you don't know how much time you spend on the client , network or anything between the app and the physical disk, so you can only reliably look at this from the client, the iohist output only shows you the Server disk i/o processing time, but that can be a fraction of the overall time (in other cases this obviously can also be the dominant part depending on your workload). the easiest way on the client is to run mmfsadm vfsstats enable from now on vfs stats are collected until you restart GPFS. then run : vfs statistics currently enabled started at: Fri Aug 29 13:15:05.380 2014 duration: 448446.970 sec name calls time per call total time -------------------- -------- -------------- -------------- statfs 9 0.000002 0.000021 startIO 246191176 0.005853 1441049.976740 to dump what ever you collected so far on this node. > In my environment there are users that litterally overhelm our > storages with metadata request, so even if there is no massive > throughput or huge waiters, any "ls" could take ages. I would like > to be able to monitor metadata behaviour. There is a way to to do > that from the NSD servers? not this simple as described above. > > Thanks in advance for any tip/help. > > Regards, > Salvatore_______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Thu Sep 4 11:05:18 2014 From: service at metamodul.com (service at metamodul.com) Date: Thu, 4 Sep 2014 12:05:18 +0200 (CEST) Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <54074F90.7000303@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> Message-ID: <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> > , any "ls" could take ages. Check if you large directories either with many files or simply large. Verify if you have NFS exported GPFS. Verify that your cache settings on the clients are large enough ( maxStatCache , maxFilesToCache , sharedMemLimit ) Verify that you have dedicated metadata luns ( metadataOnly ) Reference: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Tuning%20Parameters Note: If possible monitor your metadata luns on the storage directly. hth Hajo -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Thu Sep 4 11:43:36 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 04 Sep 2014 11:43:36 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: References: <54074F90.7000303@ebi.ac.uk> Message-ID: <54084258.90508@ebi.ac.uk> On 04/09/14 01:50, Sven Oehme wrote: > > Hello everybody, > > Hi > > > here i come here again, this time to ask some hint about how to > monitor GPFS. > > > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is > > that they return number based only on the request done in the > > current host, so i have to run them on all the clients ( over 600 > > nodes) so its quite unpractical. Instead i would like to know from > > the servers whats going on, and i came across the vio_s statistics > > wich are less documented and i dont know exacly what they mean. > > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that > > runs VIO_S. > > > > My problems with the output of this command: > > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 > > > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second > > timestamp: 1409763206/477366 > > recovery group: * > > declustered array: * > > vdisk: * > > client reads: 2584229 > > client short writes: 55299693 > > client medium writes: 190071 > > client promoted full track writes: 465145 > > client full track writes: 9249 > > flushed update writes: 4187708 > > flushed promoted full track writes: 123 > > migrate operations: 114 > > scrub operations: 450590 > > log writes: 28509602 > > > > it sais "VIOPS per second", but they seem to me just counters as > > every time i re-run the command, the numbers increase by a bit.. > > Can anyone confirm if those numbers are counter or if they are OPS/sec. > > the numbers are accumulative so everytime you run them they just show > the value since start (or last reset) time. OK, you confirmed my toughts, thatks > > > > > On a closer eye about i dont understand what most of thosevalues > > mean. For example, what exacly are "flushed promoted full track > write" ?? > > I tried to find a documentation about this output , but could not > > find any. can anyone point me a link where output of vio_s is explained? > > > > Another thing i dont understand about those numbers is if they are > > just operations, or the number of blocks that was read/write/etc . > > its just operations and if i would explain what the numbers mean i > might confuse you even more because this is not what you are really > looking for. > what you are looking for is what the client io's look like on the > Server side, while the VIO layer is the Server side to the disks, so > one lever lower than what you are looking for from what i could read > out of the description above. No.. what I'm looking its exactly how the disks are busy to keep the requests. Obviously i'm not looking just that, but I feel the needs to monitor _*also*_ those things. Ill explain you why. It happens when our storage is quite busy ( 180Gb/s of read/write ) that the FS start to be slowin normal /*cd*/ or /*ls*/ requests. This might be normal, but in those situation i want to know where the bottleneck is. Is the server CPU? Memory? Network? Spindles? knowing where the bottlenek is might help me to understand if we can tweak the system a bit more. If its the CPU on the servers then there is no much to do beside replacing or add more servers.If its not the CPU, maybe more memory would help? Maybe its just the network that filled up? so i can add more links Or if we reached the point there the bottleneck its the spindles, then there is no much point o look somethere else, we just reached the hardware limit.. Sometimes, it also happens that there is very low IO (10Gb/s ), almost no cpu usage on the servers but huge slownes ( ls can take 10 seconds). Why that happens? There is not much data ops , but we think there is a huge ammount of metadata ops. So what i want to know is if the metadata vdisks are busy or not. If this is our problem, could some SSD disks dedicated to metadata help? In particular im, a bit puzzled with the design of our GSS storage. Each recovery groups have 3 declustered arrays, and each declustered aray have 1 data and 1 metadata vdisk, but in the end both metadata and data vdisks use the same spindles. The problem that, its that I dont understand if we have a metadata bottleneck there. Maybe some SSD disks in a dedicated declustered array would perform much better, but this is just theory. I really would like to be able to monitor IO activities on the metadata vdisks. > > > so the Layer you care about is the NSD Server layer, which sits on top > of the VIO layer (which is essentially the SW RAID Layer in GNR) > > > I'm asking that because if they are just ops, i don't know how much > > they could be usefull. For example one write operation could eman > > write 1 block or write a file of 100GB. If those are oprations, > > there is a way to have the oupunt in bytes or blocks? > > there are multiple ways to get infos on the NSD layer, one would be to > use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats counts > again. Counters its not a problem. I can collect them and create some graphs in a monitoring tool. I will check that. > > the alternative option is to use mmdiag --iohist. this shows you a > history of the last X numbers of io operations on either the client or > the server side like on a client : > > # mmdiag --iohist > > === mmdiag: iohist === > > I/O history: > > I/O start time RW Buf type disk:sectorNum nSec time ms qTime > ms RpcTimes ms Type Device/NSD ID NSD server > --------------- -- ----------- ----------------- ----- ------- > -------- ----------------- ---- ------------------ --------------- > 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 > 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:22.182723 R inode 1:1071252480 8 6.970 0.000 > 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 > 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.668262 R inode 2:1081373696 8 14.117 > 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 > 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.692019 R inode 2:1064356608 8 14.899 > 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.707100 R inode 2:1077830152 8 16.499 > 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 > 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.728082 R inode 2:1081918976 8 7.760 0.000 > 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.877416 R metadata 2:678978560 16 13.343 0.000 > 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 > 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.906556 R inode 2:1083476520 8 11.723 > 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 > 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.926592 R inode 1:1076503480 8 8.087 0.000 > 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 > 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.941441 R inode 2:1069885984 8 11.686 > 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.953294 R inode 2:1083476936 8 8.951 0.000 > 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965475 R inode 1:1076503504 8 0.477 0.000 > 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.965755 R inode 2:1083476488 8 0.410 0.000 > 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965787 R inode 2:1083476512 8 0.439 0.000 > 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 > > you basically see if its a inode , data block , what size it has (in > sectors) , which nsd server you did send this request to, etc. > > on the Server side you see the type , which physical disk it goes to > and also what size of disk i/o it causes like : > > 14:26:50.129995 R inode 12:3211886376 64 14.261 > 0.000 0.000 0.000 pd sdis > 14:26:50.137102 R inode 19:3003969520 64 9.004 > 0.000 0.000 0.000 pd sdad > 14:26:50.136116 R inode 55:3591710992 64 11.057 > 0.000 0.000 0.000 pd sdoh > 14:26:50.141510 R inode 21:3066810504 64 5.909 > 0.000 0.000 0.000 pd sdaf > 14:26:50.130529 R inode 89:2962370072 64 17.437 > 0.000 0.000 0.000 pd sddi > 14:26:50.131063 R inode 78:1889457000 64 17.062 > 0.000 0.000 0.000 pd sdsj > 14:26:50.143403 R inode 36:3323035688 64 4.807 > 0.000 0.000 0.000 pd sdmw > 14:26:50.131044 R inode 37:2513579736 128 17.181 > 0.000 0.000 0.000 pd sddv > 14:26:50.138181 R inode 72:3868810400 64 10.951 > 0.000 0.000 0.000 pd sdbz > 14:26:50.138188 R inode 131:2443484784 128 11.792 > 0.000 0.000 0.000 pd sdug > 14:26:50.138003 R inode 102:3696843872 64 11.994 > 0.000 0.000 0.000 pd sdgp > 14:26:50.137099 R inode 145:3370922504 64 13.225 > 0.000 0.000 0.000 pd sdmi > 14:26:50.141576 R inode 62:2668579904 64 9.313 > 0.000 0.000 0.000 pd sdou > 14:26:50.134689 R inode 159:2786164648 64 16.577 > 0.000 0.000 0.000 pd sdpq > 14:26:50.145034 R inode 34:2097217320 64 7.409 > 0.000 0.000 0.000 pd sdmt > 14:26:50.138140 R inode 139:2831038792 64 14.898 > 0.000 0.000 0.000 pd sdlw > 14:26:50.130954 R inode 164:282120312 64 22.274 > 0.000 0.000 0.000 pd sdzd > 14:26:50.137038 R inode 41:3421909608 64 16.314 > 0.000 0.000 0.000 pd sdef > 14:26:50.137606 R inode 104:1870962416 64 16.644 > 0.000 0.000 0.000 pd sdgx > 14:26:50.141306 R inode 65:2276184264 64 16.593 > 0.000 0.000 0.000 pd sdrk > > mmdiag --iohist its another think i looked at it, but i could not find good explanation for all the "buf type" ( third column ) allocSeg data iallocSeg indBlock inode LLIndBlock logData logDesc logWrap metadata vdiskAULog vdiskBuf vdiskFWLog vdiskMDLog vdiskMeta vdiskRGDesc If i want to monifor metadata operation whan should i look at? just the metadata flag or also inode? this command takes also long to run, especially if i run it a second time it hangs for a lot before to rerun again, so i'm not sure that run it every 30secs or minute its viable, but i will look also into that. THere is any documentation that descibes clearly the whole output? what i found its quite generic and don't go into details... > > > > Last but not least.. and this is what i really would like to > > accomplish, i would to be able to monitor the latency of metadata > operations. > > you can't do this on the server side as you don't know how much time > you spend on the client , network or anything between the app and the > physical disk, so you can only reliably look at this from the client, > the iohist output only shows you the Server disk i/o processing time, > but that can be a fraction of the overall time (in other cases this > obviously can also be the dominant part depending on your workload). > > the easiest way on the client is to run > > mmfsadm vfsstats enable > from now on vfs stats are collected until you restart GPFS. > > then run : > > vfs statistics currently enabled > started at: Fri Aug 29 13:15:05.380 2014 > duration: 448446.970 sec > > name calls time per call total time > -------------------- -------- -------------- -------------- > statfs 9 0.000002 0.000021 > startIO 246191176 0.005853 1441049.976740 > > to dump what ever you collected so far on this node. > We already do that, but as I said, I want to check specifically how gss servers are keeping the requests to identify or exlude server side bottlenecks. Thanks for your help, you gave me definitely few things where to look at. Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Thu Sep 4 11:58:51 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 04 Sep 2014 11:58:51 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> References: <54074F90.7000303@ebi.ac.uk> <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> Message-ID: <540845EB.1020202@ebi.ac.uk> Little clarification, the filsystemn its not always slow. It happens that became very slow with particular users jobs in the farm. Maybe its just an indication thant we have huge ammount of metadata requestes, that's why i want to be able to monitor them On 04/09/14 11:05, service at metamodul.com wrote: > > , any "ls" could take ages. > Check if you large directories either with many files or simply large. it happens that the files are very large ( over 100G), but there usually ther are no many files. > Verify if you have NFS exported GPFS. No NFS > Verify that your cache settings on the clients are large enough ( > maxStatCache , maxFilesToCache , sharedMemLimit ) will look at them, but i'm not sure that the best number will be on the client. Obviously i cannot use all the memory of the client because those blients are meant to run jobs.... > Verify that you have dedicated metadata luns ( metadataOnly ) Yes, we have dedicate vdisks for metadata, but they are in the same declustered arrays/recoverygroups, so they whare the same spindles > Reference: > https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Tuning%20Parameters > > Note: > If possible monitor your metadata luns on the storage directly. that?s exactly than I'm trying to do !!!! :-D > hth > Hajo > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Thu Sep 4 13:04:21 2014 From: service at metamodul.com (service at metamodul.com) Date: Thu, 4 Sep 2014 14:04:21 +0200 (CEST) Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <540845EB.1020202@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> <540845EB.1020202@ebi.ac.uk> Message-ID: <1374296431.733076.1409832261718.open-xchange@oxbaltgw12.schlund.de> ... , any "ls" could take ages. >Check if you large directories either with many files or simply large. >> it happens that the files are very large ( over 100G), but there usually >> ther are no many files. >>> Please check that the directory size is not large. In a worst case you have a directory with 10MB in size but it contains only one file. In any way GPFS must fetch the whole directory structure might causing unnecassery IO. Thus my request that you check your directory sizes. >Verify that your cache settings on the clients are large enough ( maxStatCache >, maxFilesToCache , sharedMemLimit ) >>will look at them, but i'm not sure that the best number will be on the >>client. Obviously i cannot use all the memory of the client because those >>blients are meant to run jobs.... Use lsof on the client to determine the amount of open filese. mmdiag --stats ( >From my memory ) shows a little bit about the cache usage. maxStatCache does not use that much memory. > Verify that you have dedicated metadata luns ( metadataOnly ) >> Yes, we have dedicate vdisks for metadata, but they are in the same >> declustered arrays/recoverygroups, so they whare the same spindles Thats imho not a good approach. Metadata operation are small and random, data io is large and streaming. Just think you have a highway full of large trucks and you try to get with a high speed bike to your destination. You will be blocked. The same problem you have at your destiation. If many large trucks would like to get their stuff off there is no time for somebody with a small parcel. Thats the same reason why you should not access tape storage and disk storage via the same FC adapter. ( Streaming IO version v. random/small IO ) So even without your current problem and motivation for measureing i would strongly suggest to have at least dediacted SSD for metadata and if possible even dedicated NSD server for the metadata. Meaning have a dedicated path for your data and a dedicated path for your metadata. All from a users point of view Hajo -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Thu Sep 4 14:25:09 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 04 Sep 2014 14:25:09 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <1374296431.733076.1409832261718.open-xchange@oxbaltgw12.schlund.de> References: <54074F90.7000303@ebi.ac.uk> <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> <540845EB.1020202@ebi.ac.uk> <1374296431.733076.1409832261718.open-xchange@oxbaltgw12.schlund.de> Message-ID: <54086835.6050603@ebi.ac.uk> > >> Yes, we have dedicate vdisks for metadata, but they are in the same > declustered arrays/recoverygroups, so they whare the same spindles > > Thats imho not a good approach. Metadata operation are small and > random, data io is large and streaming. > > Just think you have a highway full of large trucks and you try to get > with a high speed bike to your destination. You will be blocked. > The same problem you have at your destiation. If many large trucks > would like to get their stuff off there is no time for somebody with a > small parcel. > > Thats the same reason why you should not access tape storage and disk > storage via the same FC adapter. ( Streaming IO version v. > random/small IO ) > > So even without your current problem and motivation for measureing i > would strongly suggest to have at least dediacted SSD for metadata and > if possible even dedicated NSD server for the metadata. > Meaning have a dedicated path for your data and a dedicated path for > your metadata. > > All from a users point of view > Hajo > That's where i was puzzled too. GSS its a gpfs appliance and came configured this way. Also official GSS documentation suggest to create separate vdisks for data and meatadata, but in the same declustered arrays. I always felt this a strange choice, specially if we consider that metadata require a very small abbount of space, so few ssd could do the trick.... From sdinardo at ebi.ac.uk Thu Sep 4 14:32:15 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 04 Sep 2014 14:32:15 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: References: <54074F90.7000303@ebi.ac.uk> Message-ID: <540869DF.5060100@ebi.ac.uk> Sorry to bother you again but dstat have some issues with the plugin: [root at gss01a util]# dstat --gpfs /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use the subprocess module. pipes[cmd] = os.popen3(cmd, 't', 0) Module dstat_gpfs failed to load. (global name 'select' is not defined) None of the stats you selected are available. I found this solution , but involve dstat recompile.... https://github.com/dagwieers/dstat/issues/44 Are you aware about any easier solution (we use RHEL6.3) ? Regards, Salvatore On 04/09/14 01:50, Sven Oehme wrote: > > Hello everybody, > > Hi > > > here i come here again, this time to ask some hint about how to > monitor GPFS. > > > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is > > that they return number based only on the request done in the > > current host, so i have to run them on all the clients ( over 600 > > nodes) so its quite unpractical. Instead i would like to know from > > the servers whats going on, and i came across the vio_s statistics > > wich are less documented and i dont know exacly what they mean. > > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that > > runs VIO_S. > > > > My problems with the output of this command: > > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 > > > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second > > timestamp: 1409763206/477366 > > recovery group: * > > declustered array: * > > vdisk: * > > client reads: 2584229 > > client short writes: 55299693 > > client medium writes: 190071 > > client promoted full track writes: 465145 > > client full track writes: 9249 > > flushed update writes: 4187708 > > flushed promoted full track writes: 123 > > migrate operations: 114 > > scrub operations: 450590 > > log writes: 28509602 > > > > it sais "VIOPS per second", but they seem to me just counters as > > every time i re-run the command, the numbers increase by a bit.. > > Can anyone confirm if those numbers are counter or if they are OPS/sec. > > the numbers are accumulative so everytime you run them they just show > the value since start (or last reset) time. > > > > > On a closer eye about i dont understand what most of thosevalues > > mean. For example, what exacly are "flushed promoted full track > write" ?? > > I tried to find a documentation about this output , but could not > > find any. can anyone point me a link where output of vio_s is explained? > > > > Another thing i dont understand about those numbers is if they are > > just operations, or the number of blocks that was read/write/etc . > > its just operations and if i would explain what the numbers mean i > might confuse you even more because this is not what you are really > looking for. > what you are looking for is what the client io's look like on the > Server side, while the VIO layer is the Server side to the disks, so > one lever lower than what you are looking for from what i could read > out of the description above. > > so the Layer you care about is the NSD Server layer, which sits on top > of the VIO layer (which is essentially the SW RAID Layer in GNR) > > > I'm asking that because if they are just ops, i don't know how much > > they could be usefull. For example one write operation could eman > > write 1 block or write a file of 100GB. If those are oprations, > > there is a way to have the oupunt in bytes or blocks? > > there are multiple ways to get infos on the NSD layer, one would be to > use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats counts > again. > > the alternative option is to use mmdiag --iohist. this shows you a > history of the last X numbers of io operations on either the client or > the server side like on a client : > > # mmdiag --iohist > > === mmdiag: iohist === > > I/O history: > > I/O start time RW Buf type disk:sectorNum nSec time ms qTime > ms RpcTimes ms Type Device/NSD ID NSD server > --------------- -- ----------- ----------------- ----- ------- > -------- ----------------- ---- ------------------ --------------- > 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 > 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:22.182723 R inode 1:1071252480 8 6.970 0.000 > 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 > 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.668262 R inode 2:1081373696 8 14.117 > 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 > 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.692019 R inode 2:1064356608 8 14.899 > 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.707100 R inode 2:1077830152 8 16.499 > 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 > 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.728082 R inode 2:1081918976 8 7.760 0.000 > 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.877416 R metadata 2:678978560 16 13.343 0.000 > 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 > 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.906556 R inode 2:1083476520 8 11.723 > 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 > 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.926592 R inode 1:1076503480 8 8.087 0.000 > 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 > 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.941441 R inode 2:1069885984 8 11.686 > 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.953294 R inode 2:1083476936 8 8.951 0.000 > 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965475 R inode 1:1076503504 8 0.477 0.000 > 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.965755 R inode 2:1083476488 8 0.410 0.000 > 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965787 R inode 2:1083476512 8 0.439 0.000 > 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 > > you basically see if its a inode , data block , what size it has (in > sectors) , which nsd server you did send this request to, etc. > > on the Server side you see the type , which physical disk it goes to > and also what size of disk i/o it causes like : > > 14:26:50.129995 R inode 12:3211886376 64 14.261 > 0.000 0.000 0.000 pd sdis > 14:26:50.137102 R inode 19:3003969520 64 9.004 > 0.000 0.000 0.000 pd sdad > 14:26:50.136116 R inode 55:3591710992 64 11.057 > 0.000 0.000 0.000 pd sdoh > 14:26:50.141510 R inode 21:3066810504 64 5.909 > 0.000 0.000 0.000 pd sdaf > 14:26:50.130529 R inode 89:2962370072 64 17.437 > 0.000 0.000 0.000 pd sddi > 14:26:50.131063 R inode 78:1889457000 64 17.062 > 0.000 0.000 0.000 pd sdsj > 14:26:50.143403 R inode 36:3323035688 64 4.807 > 0.000 0.000 0.000 pd sdmw > 14:26:50.131044 R inode 37:2513579736 128 17.181 > 0.000 0.000 0.000 pd sddv > 14:26:50.138181 R inode 72:3868810400 64 10.951 > 0.000 0.000 0.000 pd sdbz > 14:26:50.138188 R inode 131:2443484784 128 11.792 > 0.000 0.000 0.000 pd sdug > 14:26:50.138003 R inode 102:3696843872 64 11.994 > 0.000 0.000 0.000 pd sdgp > 14:26:50.137099 R inode 145:3370922504 64 13.225 > 0.000 0.000 0.000 pd sdmi > 14:26:50.141576 R inode 62:2668579904 64 9.313 > 0.000 0.000 0.000 pd sdou > 14:26:50.134689 R inode 159:2786164648 64 16.577 > 0.000 0.000 0.000 pd sdpq > 14:26:50.145034 R inode 34:2097217320 64 7.409 > 0.000 0.000 0.000 pd sdmt > 14:26:50.138140 R inode 139:2831038792 64 14.898 > 0.000 0.000 0.000 pd sdlw > 14:26:50.130954 R inode 164:282120312 64 22.274 > 0.000 0.000 0.000 pd sdzd > 14:26:50.137038 R inode 41:3421909608 64 16.314 > 0.000 0.000 0.000 pd sdef > 14:26:50.137606 R inode 104:1870962416 64 16.644 > 0.000 0.000 0.000 pd sdgx > 14:26:50.141306 R inode 65:2276184264 64 16.593 > 0.000 0.000 0.000 pd sdrk > > > > > > Last but not least.. and this is what i really would like to > > accomplish, i would to be able to monitor the latency of metadata > operations. > > you can't do this on the server side as you don't know how much time > you spend on the client , network or anything between the app and the > physical disk, so you can only reliably look at this from the client, > the iohist output only shows you the Server disk i/o processing time, > but that can be a fraction of the overall time (in other cases this > obviously can also be the dominant part depending on your workload). > > the easiest way on the client is to run > > mmfsadm vfsstats enable > from now on vfs stats are collected until you restart GPFS. > > then run : > > vfs statistics currently enabled > started at: Fri Aug 29 13:15:05.380 2014 > duration: 448446.970 sec > > name calls time per call total time > -------------------- -------- -------------- -------------- > statfs 9 0.000002 0.000021 > startIO 246191176 0.005853 1441049.976740 > > to dump what ever you collected so far on this node. > > > In my environment there are users that litterally overhelm our > > storages with metadata request, so even if there is no massive > > throughput or huge waiters, any "ls" could take ages. I would like > > to be able to monitor metadata behaviour. There is a way to to do > > that from the NSD servers? > > not this simple as described above. > > > > > Thanks in advance for any tip/help. > > > > Regards, > > Salvatore_______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at gpfsug.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From orlando.richards at ed.ac.uk Thu Sep 4 14:54:37 2014 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Thu, 04 Sep 2014 14:54:37 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <540869DF.5060100@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <540869DF.5060100@ebi.ac.uk> Message-ID: <54086F1D.1000401@ed.ac.uk> On 04/09/14 14:32, Salvatore Di Nardo wrote: > Sorry to bother you again but dstat have some issues with the plugin: > > [root at gss01a util]# dstat --gpfs > /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is > deprecated. Use the subprocess module. > pipes[cmd] = os.popen3(cmd, 't', 0) > Module dstat_gpfs failed to load. (global name 'select' is not > defined) > None of the stats you selected are available. > > I found this solution , but involve dstat recompile.... > > https://github.com/dagwieers/dstat/issues/44 > > Are you aware about any easier solution (we use RHEL6.3) ? > This worked for me the other day on a dev box I was poking at: # rm /usr/share/dstat/dstat_gpfsops* # cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 /usr/share/dstat/dstat_gpfsops.py # dstat --gpfsops /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use the subprocess module. pipes[cmd] = os.popen3(cmd, 't', 0) ---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o----------------------------- cr del op/cl rd wr trunc fsync looku gattr sattr other mb_rd mb_wr pref wrbeh steal clean sync revok logwr logda oth_r oth_w 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... > > Regards, > Salvatore > > On 04/09/14 01:50, Sven Oehme wrote: >> > Hello everybody, >> >> Hi >> >> > here i come here again, this time to ask some hint about how to >> monitor GPFS. >> > >> > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is >> > that they return number based only on the request done in the >> > current host, so i have to run them on all the clients ( over 600 >> > nodes) so its quite unpractical. Instead i would like to know from >> > the servers whats going on, and i came across the vio_s statistics >> > wich are less documented and i dont know exacly what they mean. >> > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that >> > runs VIO_S. >> > >> > My problems with the output of this command: >> > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 >> > >> > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second >> > timestamp: 1409763206/477366 >> > recovery group: * >> > declustered array: * >> > vdisk: * >> > client reads: 2584229 >> > client short writes: 55299693 >> > client medium writes: 190071 >> > client promoted full track writes: 465145 >> > client full track writes: 9249 >> > flushed update writes: 4187708 >> > flushed promoted full track writes: 123 >> > migrate operations: 114 >> > scrub operations: 450590 >> > log writes: 28509602 >> > >> > it sais "VIOPS per second", but they seem to me just counters as >> > every time i re-run the command, the numbers increase by a bit.. >> > Can anyone confirm if those numbers are counter or if they are OPS/sec. >> >> the numbers are accumulative so everytime you run them they just show >> the value since start (or last reset) time. >> >> > >> > On a closer eye about i dont understand what most of thosevalues >> > mean. For example, what exacly are "flushed promoted full track >> write" ?? >> > I tried to find a documentation about this output , but could not >> > find any. can anyone point me a link where output of vio_s is explained? >> > >> > Another thing i dont understand about those numbers is if they are >> > just operations, or the number of blocks that was read/write/etc . >> >> its just operations and if i would explain what the numbers mean i >> might confuse you even more because this is not what you are really >> looking for. >> what you are looking for is what the client io's look like on the >> Server side, while the VIO layer is the Server side to the disks, so >> one lever lower than what you are looking for from what i could read >> out of the description above. >> >> so the Layer you care about is the NSD Server layer, which sits on top >> of the VIO layer (which is essentially the SW RAID Layer in GNR) >> >> > I'm asking that because if they are just ops, i don't know how much >> > they could be usefull. For example one write operation could eman >> > write 1 block or write a file of 100GB. If those are oprations, >> > there is a way to have the oupunt in bytes or blocks? >> >> there are multiple ways to get infos on the NSD layer, one would be to >> use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats counts >> again. >> >> the alternative option is to use mmdiag --iohist. this shows you a >> history of the last X numbers of io operations on either the client or >> the server side like on a client : >> >> # mmdiag --iohist >> >> === mmdiag: iohist === >> >> I/O history: >> >> I/O start time RW Buf type disk:sectorNum nSec time ms qTime >> ms RpcTimes ms Type Device/NSD ID NSD server >> --------------- -- ----------- ----------------- ----- ------- >> -------- ----------------- ---- ------------------ --------------- >> 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 >> 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:22.182723 R inode 1:1071252480 8 6.970 0.000 >> 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 >> 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:53.668262 R inode 2:1081373696 8 14.117 >> 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 >> 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:53.692019 R inode 2:1064356608 8 14.899 >> 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:53.707100 R inode 2:1077830152 8 16.499 >> 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 >> 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:53.728082 R inode 2:1081918976 8 7.760 0.000 >> 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:57.877416 R metadata 2:678978560 16 13.343 0.000 >> 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 >> 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:57.906556 R inode 2:1083476520 8 11.723 >> 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 >> 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:57.926592 R inode 1:1076503480 8 8.087 0.000 >> 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 >> 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:57.941441 R inode 2:1069885984 8 11.686 >> 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:57.953294 R inode 2:1083476936 8 8.951 0.000 >> 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:57.965475 R inode 1:1076503504 8 0.477 0.000 >> 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:57.965755 R inode 2:1083476488 8 0.410 0.000 >> 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:57.965787 R inode 2:1083476512 8 0.439 0.000 >> 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 >> >> you basically see if its a inode , data block , what size it has (in >> sectors) , which nsd server you did send this request to, etc. >> >> on the Server side you see the type , which physical disk it goes to >> and also what size of disk i/o it causes like : >> >> 14:26:50.129995 R inode 12:3211886376 64 14.261 >> 0.000 0.000 0.000 pd sdis >> 14:26:50.137102 R inode 19:3003969520 64 9.004 >> 0.000 0.000 0.000 pd sdad >> 14:26:50.136116 R inode 55:3591710992 64 11.057 >> 0.000 0.000 0.000 pd sdoh >> 14:26:50.141510 R inode 21:3066810504 64 5.909 >> 0.000 0.000 0.000 pd sdaf >> 14:26:50.130529 R inode 89:2962370072 64 17.437 >> 0.000 0.000 0.000 pd sddi >> 14:26:50.131063 R inode 78:1889457000 64 17.062 >> 0.000 0.000 0.000 pd sdsj >> 14:26:50.143403 R inode 36:3323035688 64 4.807 >> 0.000 0.000 0.000 pd sdmw >> 14:26:50.131044 R inode 37:2513579736 128 17.181 >> 0.000 0.000 0.000 pd sddv >> 14:26:50.138181 R inode 72:3868810400 64 10.951 >> 0.000 0.000 0.000 pd sdbz >> 14:26:50.138188 R inode 131:2443484784 128 11.792 >> 0.000 0.000 0.000 pd sdug >> 14:26:50.138003 R inode 102:3696843872 64 11.994 >> 0.000 0.000 0.000 pd sdgp >> 14:26:50.137099 R inode 145:3370922504 64 13.225 >> 0.000 0.000 0.000 pd sdmi >> 14:26:50.141576 R inode 62:2668579904 64 9.313 >> 0.000 0.000 0.000 pd sdou >> 14:26:50.134689 R inode 159:2786164648 64 16.577 >> 0.000 0.000 0.000 pd sdpq >> 14:26:50.145034 R inode 34:2097217320 64 7.409 >> 0.000 0.000 0.000 pd sdmt >> 14:26:50.138140 R inode 139:2831038792 64 14.898 >> 0.000 0.000 0.000 pd sdlw >> 14:26:50.130954 R inode 164:282120312 64 22.274 >> 0.000 0.000 0.000 pd sdzd >> 14:26:50.137038 R inode 41:3421909608 64 16.314 >> 0.000 0.000 0.000 pd sdef >> 14:26:50.137606 R inode 104:1870962416 64 16.644 >> 0.000 0.000 0.000 pd sdgx >> 14:26:50.141306 R inode 65:2276184264 64 16.593 >> 0.000 0.000 0.000 pd sdrk >> >> >> > >> > Last but not least.. and this is what i really would like to >> > accomplish, i would to be able to monitor the latency of metadata >> operations. >> >> you can't do this on the server side as you don't know how much time >> you spend on the client , network or anything between the app and the >> physical disk, so you can only reliably look at this from the client, >> the iohist output only shows you the Server disk i/o processing time, >> but that can be a fraction of the overall time (in other cases this >> obviously can also be the dominant part depending on your workload). >> >> the easiest way on the client is to run >> >> mmfsadm vfsstats enable >> from now on vfs stats are collected until you restart GPFS. >> >> then run : >> >> vfs statistics currently enabled >> started at: Fri Aug 29 13:15:05.380 2014 >> duration: 448446.970 sec >> >> name calls time per call total time >> -------------------- -------- -------------- -------------- >> statfs 9 0.000002 0.000021 >> startIO 246191176 0.005853 1441049.976740 >> >> to dump what ever you collected so far on this node. >> >> > In my environment there are users that litterally overhelm our >> > storages with metadata request, so even if there is no massive >> > throughput or huge waiters, any "ls" could take ages. I would like >> > to be able to monitor metadata behaviour. There is a way to to do >> > that from the NSD servers? >> >> not this simple as described above. >> >> > >> > Thanks in advance for any tip/help. >> > >> > Regards, >> > Salvatore_______________________________________________ >> > gpfsug-discuss mailing list >> > gpfsug-discuss at gpfsug.org >> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- -- Dr Orlando Richards Research Facilities (ECDF) Systems Leader Information Services IT Infrastructure Division Tel: 0131 650 4994 skype: orlando.richards The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From sdinardo at ebi.ac.uk Thu Sep 4 15:07:42 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 04 Sep 2014 15:07:42 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <54086F1D.1000401@ed.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <540869DF.5060100@ebi.ac.uk> <54086F1D.1000401@ed.ac.uk> Message-ID: <5408722E.6060309@ebi.ac.uk> On 04/09/14 14:54, Orlando Richards wrote: > > > On 04/09/14 14:32, Salvatore Di Nardo wrote: >> Sorry to bother you again but dstat have some issues with the plugin: >> >> [root at gss01a util]# dstat --gpfs >> /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is >> deprecated. Use the subprocess module. >> pipes[cmd] = os.popen3(cmd, 't', 0) >> Module dstat_gpfs failed to load. (global name 'select' is not >> defined) >> None of the stats you selected are available. >> >> I found this solution , but involve dstat recompile.... >> >> https://github.com/dagwieers/dstat/issues/44 >> >> Are you aware about any easier solution (we use RHEL6.3) ? >> > > This worked for me the other day on a dev box I was poking at: > > # rm /usr/share/dstat/dstat_gpfsops* > > # cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 > /usr/share/dstat/dstat_gpfsops.py > > # dstat --gpfsops > /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use > the subprocess module. > pipes[cmd] = os.popen3(cmd, 't', 0) > ---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o----------------------------- > > cr del op/cl rd wr trunc fsync looku gattr sattr other > mb_rd mb_wr pref wrbeh steal clean sync revok logwr logda oth_r oth_w > 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 > > ... > NICE!! The only problem is that the box seems lacking those python scripts: ls /usr/lpp/mmfs/samples/util/ makefile README tsbackup tsbackup.C tsbackup.h tsfindinode tsfindinode.c tsgetusage tsgetusage.c tsinode tsinode.c tslistall tsreaddir tsreaddir.c tstimes tstimes.c Do you mind sending me those py files? They should be 3 as i see e gpfs options: gpfs, gpfs-ops, gpfsops (dunno what are the differences ) Regards, Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From orlando.richards at ed.ac.uk Thu Sep 4 15:14:02 2014 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Thu, 04 Sep 2014 15:14:02 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <5408722E.6060309@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <540869DF.5060100@ebi.ac.uk> <54086F1D.1000401@ed.ac.uk> <5408722E.6060309@ebi.ac.uk> Message-ID: <540873AA.5070401@ed.ac.uk> On 04/09/14 15:07, Salvatore Di Nardo wrote: > > On 04/09/14 14:54, Orlando Richards wrote: >> >> >> On 04/09/14 14:32, Salvatore Di Nardo wrote: >>> Sorry to bother you again but dstat have some issues with the plugin: >>> >>> [root at gss01a util]# dstat --gpfs >>> /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is >>> deprecated. Use the subprocess module. >>> pipes[cmd] = os.popen3(cmd, 't', 0) >>> Module dstat_gpfs failed to load. (global name 'select' is not >>> defined) >>> None of the stats you selected are available. >>> >>> I found this solution , but involve dstat recompile.... >>> >>> https://github.com/dagwieers/dstat/issues/44 >>> >>> Are you aware about any easier solution (we use RHEL6.3) ? >>> >> >> This worked for me the other day on a dev box I was poking at: >> >> # rm /usr/share/dstat/dstat_gpfsops* >> >> # cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 >> /usr/share/dstat/dstat_gpfsops.py >> >> # dstat --gpfsops >> /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use >> the subprocess module. >> pipes[cmd] = os.popen3(cmd, 't', 0) >> ---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o----------------------------- >> >> cr del op/cl rd wr trunc fsync looku gattr sattr other >> mb_rd mb_wr pref wrbeh steal clean sync revok logwr logda oth_r oth_w >> 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 >> >> ... >> > > NICE!! The only problem is that the box seems lacking those python scripts: > > ls /usr/lpp/mmfs/samples/util/ > makefile README tsbackup tsbackup.C tsbackup.h tsfindinode > tsfindinode.c tsgetusage tsgetusage.c tsinode tsinode.c > tslistall tsreaddir tsreaddir.c tstimes tstimes.c > It came from the gpfs.base rpm: # rpm -qf /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 gpfs.base-3.5.0-13.x86_64 > Do you mind sending me those py files? They should be 3 as i see e gpfs > options: gpfs, gpfs-ops, gpfsops (dunno what are the differences ) > Only the gpfsops.py is included in the bundle - one for dstat 0.6 and one for dstat 0.7. I've attached it to this mail as well (it seems to be GPL'd). > Regards, > Salvatore > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- -- Dr Orlando Richards Research Facilities (ECDF) Systems Leader Information Services IT Infrastructure Division Tel: 0131 650 4994 skype: orlando.richards The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -------------- next part -------------- # # Copyright (C) 2009, 2010 IBM Corporation # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2, or (at your option) # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. # global string, select, os, re, fnmatch import string, select, os, re, fnmatch # Dstat class to display selected gpfs performance counters returned by the # mmpmon "vfs_s", "ioc_s", "vio_s", "vflush_s", and "lroc_s" commands. # # The set of counters displayed can be customized via environment variables: # # DSTAT_GPFS_WHAT # # Selects which of the five mmpmon commands to display. # It is a comma separated list of any of the following: # "vfs": show mmpmon "vfs_s" counters # "ioc": show mmpmon "ioc_s" counters related to NSD client I/O # "nsd": show mmpmon "ioc_s" counters related to NSD server I/O # "vio": show mmpmon "vio_s" counters # "vflush": show mmpmon "vflush_s" counters # "lroc": show mmpmon "lroc_s" counters # "all": equivalent to specifying all of the above # # Example: # # DSTAT_GPFS_WHAT=vfs,lroc dstat -M gpfsops # # will display counters for mmpmon "vfs_s" and "lroc" commands. # # The default setting is "vfs,ioc", i.e., by default only "vfs_s" and NSD # client related "ioc_s" counters are displayed. # # DSTAT_GPFS_VFS # DSTAT_GPFS_IOC # DSTAT_GPFS_VIO # DSTAT_GPFS_VFLUSH # DSTAT_GPFS_LROC # # Allow finer grain control over exactly which values will be displayed for # each of the five mmpmon commands. Each variable is a comma separated list # of counter names with optional column header string. # # Example: # # export DSTAT_GPFS_VFS='create, remove, rd/wr=read+write' # export DSTAT_GPFS_IOC='sync*, logwrap*, oth_rd=*_rd, oth_wr=*_wr' # dstat -M gpfsops # # Under "vfs-ops" this will display three columns, showing creates, deletes # (removes), and a third column labelled "rd/wr" with a combined count of # read and write operations. # Under "disk-i/o" it will display four columns, showing all disk I/Os # initiated by sync, and log wrap, plus two columns labeled "oth_rd" and # "oth_wr" showing counts of all other disk reads and disk writes, # respectively. # # Note: setting one of these environment variables overrides the # corrosponding setting in DSTAT_GPFS_WHAT. For example, setting # DSTAT_GPFS_VFS="" will omit all "vfs_s" counters regardless of whether # "vfs" appears in DSTAT_GPFS_WHAT or not. # # Counter sets are specified as a comma-separated list of entries of one # of the following forms # # counter # label = counter # label = counter1 + counter2 + ... # # If no label is specified, the name of the counter is used as the column # header (truncated to 5 characters). # Counter names may contain shell-style wildcards. For example, the # pattern "sync*" matches the two ioc_s counters "sync_rd" and "sync_wr" and # therefore produce a column containing the combined count of disk reads and # disk writes initiated by sync. If a counter appears in or matches a name # pattern in more than one entry, it is included only in the count under the # first entry in which it appears. For example, adding an entry "other = *" # at the end of the list will add a column labeled "other" that shows the # sum of all counter values *not* included in any of the previous columns. # # DSTAT_GPFS_LIST=1 dstat -M gpfsops # # This will show all available counter names and the default definition # for which sets of counter values are displayed. # # An alternative to setting environment variables is to create a file # ~/.dstat_gpfs_rc # with python statements that sets any of the following variables # vfs_wanted: equivalent to setting DSTAT_GPFS_VFS # ioc_wanted: equivalent to setting DSTAT_GPFS_IOC # vio_wanted: equivalent to setting DSTAT_GPFS_VIO # vflush_wanted: equivalent to setting DSTAT_GPFS_VFLUSH # lroc_wanted: equivalent to setting DSTAT_GPFS_LROC # # For example, the following ~/.dstat_gpfs_rc file will produce the same # result as the environment variables in the example above: # # vfs_wanted = 'create, remove, rd/wr=read+write' # ioc_wanted = 'sync*, logwrap*, oth_rd=*_rd, oth_wr=*_wr' # # See also the default vfs_wanted, ioc_wanted, and vio_wanted settings in # the dstat_gpfsops __init__ method below. class dstat_plugin(dstat): def __init__(self): # list of all stats counters returned by mmpmon "vfs_s", "ioc_s", "vio_s", "vflush_s", and "lroc_s" # always ignore the first few chars like : io_s _io_s_ _n_ 172.31.136.2 _nn_ mgmt001st001 _rc_ 0 _t_ 1322526286 _tu_ 415518 vfs_keys = ('_access_', '_close_', '_create_', '_fclear_', '_fsync_', '_fsync_range_', '_ftrunc_', '_getattr_', '_link_', '_lockctl_', '_lookup_', '_map_lloff_', '_mkdir_', '_mknod_', '_open_', '_read_', '_write_', '_mmapRead_', '_mmapWrite_', '_aioRead_', '_aioWrite_','_readdir_', '_readlink_', '_readpage_', '_remove_', '_rename_', '_rmdir_', '_setacl_', '_setattr_', '_symlink_', '_unmap_', '_writepage_', '_tsfattr_', '_tsfsattr_', '_flock_', '_setxattr_', '_getxattr_', '_listxattr_', '_removexattr_', '_encode_fh_', '_decode_fh_', '_get_dentry_', '_get_parent_', '_mount_', '_statfs_', '_sync_', '_vget_') ioc_keys = ('_other_rd_', '_other_wr_','_mb_rd_', '_mb_wr_', '_steal_rd_', '_steal_wr_', '_cleaner_rd_', '_cleaner_wr_', '_sync_rd_', '_sync_wr_', '_logwrap_rd_', '_logwrap_wr_', '_revoke_rd_', '_revoke_wr_', '_prefetch_rd_', '_prefetch_wr_', '_logdata_rd_', '_logdata_wr_', '_nsdworker_rd_', '_nsdworker_wr_','_nsdlocal_rd_','_nsdlocal_wr_', '_vdisk_rd_','_vdisk_wr_', '_pdisk_rd_','_pdisk_wr_', '_logtip_rd_', '_logtip_wr_') vio_keys = ('_r_', '_sw_', '_mw_', '_pfw_', '_ftw_', '_fuw_', '_fpw_', '_m_', '_s_', '_l_', '_rgd_', '_meta_') vflush_keys = ('_ndt_', '_ngdb_', '_nfwlmb_', '_nfipt_', '_nfwwt_', '_ahwm_', '_susp_', '_uwrttf_', '_fftc_', '_nalth_', '_nasth_', '_nsigth_', '_ntgtth_') lroc_keys = ('_Inode_s_', '_Inode_sf_', '_Inode_smb_', '_Inode_r_', '_Inode_rf_', '_Inode_rmb_', '_Inode_i_', '_Inode_imb_', '_Directory_s_', '_Directory_sf_', '_Directory_smb_', '_Directory_r_', '_Directory_rf_', '_Directory_rmb_', '_Directory_i_', '_Directory_imb_', '_Data_s_', '_Data_sf_', '_Data_smb_', '_Data_r_', '_Data_rf_', '_Data_rmb_', '_Data_i_', '_Data_imb_', '_agt_i_', '_agt_i_rm_', '_agt_i_rM_', '_agt_i_ra_', '_agt_r_', '_agt_r_rm_', '_agt_r_rM_', '_agt_r_ra_', '_ssd_w_', '_ssd_w_p_', '_ssd_w_rm_', '_ssd_w_rM_', '_ssd_w_ra_', '_ssd_r_', '_ssd_r_p_', '_ssd_r_rm_', '_ssd_r_rM_', '_ssd_r_ra_') # Default counters to display for each mmpmon category vfs_wanted = '''cr = create + mkdir + link + symlink, del = remove + rmdir, op/cl = open + close + map_lloff + unmap, rd = read + readdir + readlink + mmapRead + readpage + aioRead + aioWrite, wr = write + mmapWrite + writepage, trunc = ftrunc + fclear, fsync = fsync + fsync_range, lookup, gattr = access + getattr + getxattr + getacl, sattr = setattr + setxattr + setacl, other = * ''' ioc_wanted1 = '''mb_rd, mb_wr, pref=prefetch_rd, wrbeh=prefetch_wr, steal*, cleaner*, sync*, revoke*, logwrap*, logdata*, oth_rd = other_rd, oth_wr = other_wr ''' ioc_wanted2 = '''rns_r=nsdworker_rd, rns_w=nsdworker_wr, lns_r=nsdlocal_rd, lns_w=nsdlocal_wr, vd_r=vdisk_rd, vd_w=vdisk_wr, pd_r=pdisk_rd, pd_w=pdisk_wr, ''' vio_wanted = '''ClRead=r, ClShWr=sw, ClMdWr=mw, ClPFTWr=pfw, ClFTWr=ftw, FlUpWr=fuw, FlPFTWr=fpw, Migrte=m, Scrub=s, LgWr=l, RGDsc=rgd, Meta=meta ''' vflush_wanted = '''DiTrk = ndt, DiBuf = ngdb, FwLog = nfwlmb, FinPr = nfipt, WraTh = nfwwt, HiWMa = ahwm, Suspd = susp, WrThF = uwrttf, Force = fftc, TrgTh = ntgtth, other = nalth + nasth + nsigth ''' lroc_wanted = '''StorS = Inode_s + Directory_s + Data_s, StorF = Inode_sf + Directory_sf + Data_sf, FetcS = Inode_r + Directory_r + Data_r, FetcF = Inode_rf + Directory_rf + Data_rf, InVAL = Inode_i + Directory_i + Data_i ''' # Coarse counter selection via DSTAT_GPFS_WHAT if 'DSTAT_GPFS_WHAT' in os.environ: what_wanted = os.environ['DSTAT_GPFS_WHAT'].split(',') else: what_wanted = [ 'vfs', 'ioc' ] # If ".dstat_gpfs_rc" exists in user's home directory, run it. # Otherwise, use DSTAT_GPFS_WHAT for counter selection and look for other # DSTAT_GPFS_XXX environment variables for additional customization. userprofile = os.path.join(os.environ['HOME'], '.dstat_gpfs_rc') if os.path.exists(userprofile): ioc_wanted = ioc_wanted1 + ioc_wanted2 exec file(userprofile) else: if 'all' not in what_wanted: if 'vfs' not in what_wanted: vfs_wanted = '' if 'ioc' not in what_wanted: ioc_wanted1 = '' if 'nsd' not in what_wanted: ioc_wanted2 = '' if 'vio' not in what_wanted: vio_wanted = '' if 'vflush' not in what_wanted: vflush_wanted = '' if 'lroc' not in what_wanted: lroc_wanted = '' ioc_wanted = ioc_wanted1 + ioc_wanted2 # Fine grain counter cusomization via DSTAT_GPFS_XXX if 'DSTAT_GPFS_VFS' in os.environ: vfs_wanted = os.environ['DSTAT_GPFS_VFS'] if 'DSTAT_GPFS_IOC' in os.environ: ioc_wanted = os.environ['DSTAT_GPFS_IOC'] if 'DSTAT_GPFS_VIO' in os.environ: vio_wanted = os.environ['DSTAT_GPFS_VIO'] if 'DSTAT_GPFS_VFLUSH' in os.environ: vflush_wanted = os.environ['DSTAT_GPFS_VFLUSH'] if 'DSTAT_GPFS_LROC' in os.environ: lroc_wanted = os.environ['DSTAT_GPFS_LROC'] self.debug = 0 vars1, nick1, keymap1 = self.make_keymap(vfs_keys, vfs_wanted, 'gpfs-vfs-') vars2, nick2, keymap2 = self.make_keymap(ioc_keys, ioc_wanted, 'gpfs-io-') vars3, nick3, keymap3 = self.make_keymap(vio_keys, vio_wanted, 'gpfs-vio-') vars4, nick4, keymap4 = self.make_keymap(vflush_keys, vflush_wanted, 'gpfs-vflush-') vars5, nick5, keymap5 = self.make_keymap(lroc_keys, lroc_wanted, 'gpfs-lroc-') if 'DSTAT_GPFS_LIST' in os.environ or self.debug: self.show_keymap('vfs_s', 'DSTAT_GPFS_VFS', vfs_keys, vfs_wanted, vars1, keymap1, 'gpfs-vfs-') self.show_keymap('ioc_s', 'DSTAT_GPFS_IOC', ioc_keys, ioc_wanted, vars2, keymap2, 'gpfs-io-') self.show_keymap('vio_s', 'DSTAT_GPFS_VIO', vio_keys, vio_wanted, vars3, keymap3, 'gpfs-vio-') self.show_keymap('vflush_stat', 'DSTAT_GPFS_VFLUSH', vflush_keys, vflush_wanted, vars4, keymap4, 'gpfs-vflush-') self.show_keymap('lroc_s', 'DSTAT_GPFS_LROC', lroc_keys, lroc_wanted, vars5, keymap5, 'gpfs-lroc-') print self.vars = vars1 + vars2 + vars3 + vars4 + vars5 self.varsrate = vars1 + vars2 + vars3 + vars5 self.varsconst = vars4 self.nick = nick1 + nick2 + nick3 + nick4 + nick5 self.vfs_keymap = keymap1 self.ioc_keymap = keymap2 self.vio_keymap = keymap3 self.vflush_keymap = keymap4 self.lroc_keymap = keymap5 names = [] self.addtitle(names, 'gpfs vfs ops', len(vars1)) self.addtitle(names, 'gpfs disk i/o', len(vars2)) self.addtitle(names, 'gpfs vio', len(vars3)) self.addtitle(names, 'gpfs vflush', len(vars4)) self.addtitle(names, 'gpfs lroc', len(vars5)) self.name = '#'.join(names) self.type = 'd' self.width = 5 self.scale = 1000 def make_keymap(self, keys, wanted, prefix): '''Parse the list of counter values to be displayd "keys" is the list of all available counters "wanted" is a string of the form "name1 = key1 + key2 + ..., name2 = key3 + key4 ..." Returns a list of all names found, e.g. ['name1', 'name2', ...], and a dictionary that maps counters to names, e.g., { 'key1': 'name1', 'key2': 'name1', 'key3': 'name2', ... }, ''' vars = [] nick = [] kmap = {} ## print re.split(r'\s*,\s*', wanted.strip()) for n in re.split(r'\s*,\s*', wanted.strip()): l = re.split(r'\s*=\s*', n, 2) if len(l) == 2: v = l[0] kl = re.split(r'\s*\+\s*', l[1]) elif l[0]: v = l[0].strip('*') kl = l else: continue nick.append(v[0:5]) v = prefix + v.replace('/', '-') vars.append(v) for s in kl: for k in keys: if fnmatch.fnmatch(k.strip('_'), s) and k not in kmap: kmap[k] = v return vars, nick, kmap def show_keymap(self, label, envname, keys, wanted, vars, kmap, prefix): 'show available counter names and current counter set definition' linewd = 100 print '\nAvailable counters for "%s":' % label mlen = max([len(k.strip('_')) for k in keys]) ncols = linewd // (mlen + 1) nrows = (len(keys) + ncols - 1) // ncols for r in range(nrows): print ' ', for c in range(ncols): i = c *nrows + r if not i < len(keys): break print keys[i].strip('_').ljust(mlen), print print '\nCurrent counter set selection:' print "\n%s='%s'\n" % (envname, re.sub(r'\s+', '', wanted).strip().replace(',', ', ')) if not vars: return mlen = 5 for v in vars: if v.startswith(prefix): s = v[len(prefix):] else: s = v n = ' %s = ' % s[0:mlen].rjust(mlen) kl = [ k.strip('_') for k in keys if kmap.get(k) == v ] i = 0 while i < len(kl): slen = len(n) + 3 + len(kl[i]) j = i + 1 while j < len(kl) and slen + 3 + len(kl[j]) < linewd: slen += 3 + len(kl[j]) j += 1 print n + ' + '.join(kl[i:j]) i = j n = ' %s + ' % ''.rjust(mlen) def addtitle(self, names, name, ncols): 'pad title given by "name" with minus signs to span "ncols" columns' if ncols == 1: names.append(name.split()[-1].center(6*ncols - 1)) elif ncols > 1: names.append(name.center(6*ncols - 1)) def check(self): 'start mmpmon command' if os.access('/usr/lpp/mmfs/bin/mmpmon', os.X_OK): try: self.stdin, self.stdout, self.stderr = dpopen('/usr/lpp/mmfs/bin/mmpmon -p -s') self.stdin.write('reset\n') readpipe(self.stdout) except IOError: raise Exception, 'Cannot interface with gpfs mmpmon binary' return True raise Exception, 'Needs GPFS mmpmon binary' def extract_vfs(self): 'collect "vfs_s" counter values' self.stdin.write('vfs_s\n') l = [] for line in readpipe(self.stdout): if not line: continue l += line.split() for i in range(11, len(l), 3): try: self.set2[self.vfs_keymap[l[i]]] += long(l[i+1]) except KeyError: pass def extract_ioc(self): 'collect "ioc_s" counter values' self.stdin.write('ioc_s\n') l = [] for line in readpipe(self.stdout): if not line: continue l += line.split() for i in range(11, len(l), 3): try: self.set2[self.ioc_keymap[l[i]+'rd_']] += long(l[i+1]) except KeyError: pass try: self.set2[self.ioc_keymap[l[i]+'wr_']] += long(l[i+2]) except KeyError: pass def extract_vio(self): 'collect "vio_s" counter values' self.stdin.write('vio_s\n') l = [] for line in readpipe(self.stdout): if not line: continue l += line.split() for i in range(19, len(l), 2): try: if l[i] in self.vio_keymap: self.set2[self.vio_keymap[l[i]]] += long(l[i+1]) except KeyError: pass def extract_vflush(self): 'collect "vflush_stat" counter values' self.stdin.write('vflush_stat\n') l = [] for line in readpipe(self.stdout): if not line: continue l += line.split() for i in range(11, len(l), 2): try: if l[i] in self.vflush_keymap: self.set2[self.vflush_keymap[l[i]]] += long(l[i+1]) except KeyError: pass def extract_lroc(self): 'collect "lroc_s" counter values' self.stdin.write('lroc_s\n') l = [] for line in readpipe(self.stdout): if not line: continue l += line.split() for i in range(11, len(l), 2): try: if l[i] in self.lroc_keymap: self.set2[self.lroc_keymap[l[i]]] += long(l[i+1]) except KeyError: pass def extract(self): try: for name in self.vars: self.set2[name] = 0 self.extract_ioc() self.extract_vfs() self.extract_vio() self.extract_vflush() self.extract_lroc() for name in self.varsrate: self.val[name] = (self.set2[name] - self.set1[name]) * 1.0 / elapsed for name in self.varsconst: self.val[name] = self.set2[name] except IOError, e: for name in self.vars: self.val[name] = -1 ## print 'dstat_gpfs: lost pipe to mmpmon,', e except Exception, e: for name in self.vars: self.val[name] = -1 print 'dstat_gpfs: exception', e if self.debug >= 0: self.debug -= 1 if step == op.delay: self.set1.update(self.set2) From ewahl at osc.edu Thu Sep 4 15:13:48 2014 From: ewahl at osc.edu (Ed Wahl) Date: Thu, 4 Sep 2014 14:13:48 +0000 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> References: <54074F90.7000303@ebi.ac.uk>, <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> Message-ID: Another known issue with slow "ls" can be the annoyance that is 'sssd' under newer OSs (rhel 6) and properly configuring this for remote auth. I know on my nsd's I never did and the first ls in a directory where the cache is expired takes forever to make all the remote LDAP calls to get the UID info. bleh. Ed ________________________________ From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of service at metamodul.com [service at metamodul.com] Sent: Thursday, September 04, 2014 6:05 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] gpfs performance monitoring > , any "ls" could take ages. Check if you large directories either with many files or simply large. Verify if you have NFS exported GPFS. Verify that your cache settings on the clients are large enough ( maxStatCache , maxFilesToCache , sharedMemLimit ) Verify that you have dedicated metadata luns ( metadataOnly ) Reference: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Tuning%20Parameters Note: If possible monitor your metadata luns on the storage directly. hth Hajo -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Thu Sep 4 15:18:02 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 04 Sep 2014 15:18:02 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <540873AA.5070401@ed.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <540869DF.5060100@ebi.ac.uk> <54086F1D.1000401@ed.ac.uk> <5408722E.6060309@ebi.ac.uk> <540873AA.5070401@ed.ac.uk> Message-ID: <5408749A.9080306@ebi.ac.uk> On 04/09/14 15:14, Orlando Richards wrote: > > > On 04/09/14 15:07, Salvatore Di Nardo wrote: >> >> On 04/09/14 14:54, Orlando Richards wrote: >>> >>> >>> On 04/09/14 14:32, Salvatore Di Nardo wrote: >>>> Sorry to bother you again but dstat have some issues with the plugin: >>>> >>>> [root at gss01a util]# dstat --gpfs >>>> /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is >>>> deprecated. Use the subprocess module. >>>> pipes[cmd] = os.popen3(cmd, 't', 0) >>>> Module dstat_gpfs failed to load. (global name 'select' is not >>>> defined) >>>> None of the stats you selected are available. >>>> >>>> I found this solution , but involve dstat recompile.... >>>> >>>> https://github.com/dagwieers/dstat/issues/44 >>>> >>>> Are you aware about any easier solution (we use RHEL6.3) ? >>>> >>> >>> This worked for me the other day on a dev box I was poking at: >>> >>> # rm /usr/share/dstat/dstat_gpfsops* >>> >>> # cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 >>> /usr/share/dstat/dstat_gpfsops.py >>> >>> # dstat --gpfsops >>> /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use >>> the subprocess module. >>> pipes[cmd] = os.popen3(cmd, 't', 0) >>> ---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o----------------------------- >>> >>> >>> cr del op/cl rd wr trunc fsync looku gattr sattr other >>> mb_rd mb_wr pref wrbeh steal clean sync revok logwr logda oth_r oth_w >>> 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 0 0 0 >>> >>> ... >>> >> >> NICE!! The only problem is that the box seems lacking those python >> scripts: >> >> ls /usr/lpp/mmfs/samples/util/ >> makefile README tsbackup tsbackup.C tsbackup.h tsfindinode >> tsfindinode.c tsgetusage tsgetusage.c tsinode tsinode.c >> tslistall tsreaddir tsreaddir.c tstimes tstimes.c >> > > It came from the gpfs.base rpm: > > # rpm -qf /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 > gpfs.base-3.5.0-13.x86_64 > > >> Do you mind sending me those py files? They should be 3 as i see e gpfs >> options: gpfs, gpfs-ops, gpfsops (dunno what are the differences ) >> > > Only the gpfsops.py is included in the bundle - one for dstat 0.6 and > one for dstat 0.7. > > > I've attached it to this mail as well (it seems to be GPL'd). > Thanks. From J.R.Jones at soton.ac.uk Thu Sep 4 16:15:48 2014 From: J.R.Jones at soton.ac.uk (Jones J.R.) Date: Thu, 4 Sep 2014 15:15:48 +0000 Subject: [gpfsug-discuss] Building the portability layer for Xeon Phi Message-ID: <1409843748.7733.31.camel@uos-204812.clients.soton.ac.uk> Hi folks Has anyone managed to successfully build the portability layer for Xeon Phi? At the moment we are having to export the GPFS mounts from the host machine over NFS, which is proving rather unreliable. Jess From oehmes at us.ibm.com Fri Sep 5 01:48:40 2014 From: oehmes at us.ibm.com (Sven Oehme) Date: Thu, 4 Sep 2014 17:48:40 -0700 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <54084258.90508@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <54084258.90508@ebi.ac.uk> Message-ID: ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ gpfsug-discuss-bounces at gpfsug.org wrote on 09/04/2014 03:43:36 AM: > From: Salvatore Di Nardo > To: gpfsug main discussion list > Date: 09/04/2014 03:44 AM > Subject: Re: [gpfsug-discuss] gpfs performance monitoring > Sent by: gpfsug-discuss-bounces at gpfsug.org > > On 04/09/14 01:50, Sven Oehme wrote: > > Hello everybody, > > Hi > > > here i come here again, this time to ask some hint about how to > monitor GPFS. > > > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is > > that they return number based only on the request done in the > > current host, so i have to run them on all the clients ( over 600 > > nodes) so its quite unpractical. Instead i would like to know from > > the servers whats going on, and i came across the vio_s statistics > > wich are less documented and i dont know exacly what they mean. > > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that > > runs VIO_S. > > > > My problems with the output of this command: > > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 > > > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second > > timestamp: 1409763206/477366 > > recovery group: * > > declustered array: * > > vdisk: * > > client reads: 2584229 > > client short writes: 55299693 > > client medium writes: 190071 > > client promoted full track writes: 465145 > > client full track writes: 9249 > > flushed update writes: 4187708 > > flushed promoted full track writes: 123 > > migrate operations: 114 > > scrub operations: 450590 > > log writes: 28509602 > > > > it sais "VIOPS per second", but they seem to me just counters as > > every time i re-run the command, the numbers increase by a bit.. > > Can anyone confirm if those numbers are counter or if they are OPS/sec. > > the numbers are accumulative so everytime you run them they just > show the value since start (or last reset) time. > OK, you confirmed my toughts, thatks > > > > > On a closer eye about i dont understand what most of thosevalues > > mean. For example, what exacly are "flushed promoted full track write" ?? > > I tried to find a documentation about this output , but could not > > find any. can anyone point me a link where output of vio_s is explained? > > > > Another thing i dont understand about those numbers is if they are > > just operations, or the number of blocks that was read/write/etc . > > its just operations and if i would explain what the numbers mean i > might confuse you even more because this is not what you are really > looking for. > what you are looking for is what the client io's look like on the > Server side, while the VIO layer is the Server side to the disks, so > one lever lower than what you are looking for from what i could read > out of the description above. > No.. what I'm looking its exactly how the disks are busy to keep the > requests. Obviously i'm not looking just that, but I feel the needs > to monitor also those things. Ill explain you why. > > It happens when our storage is quite busy ( 180Gb/s of read/write ) > that the FS start to be slowin normal cd or ls requests. This might > be normal, but in those situation i want to know where the > bottleneck is. Is the server CPU? Memory? Network? Spindles? knowing > where the bottlenek is might help me to understand if we can tweak > the system a bit more. if cd or ls is very slow in GPFS in the majority of the cases it has nothing to do with NSD Server bottlenecks, only indirect. the main reason ls is slow in the field is you have some very powerful nodes that all do buffered writes into the same directory into 1 or multiple files while you do the ls on a different node. what happens now is that the ls you did run most likely is a alias for ls -l or something even more complex with color display, etc, but the point is it most likely returns file size. GPFS doesn't lie about the filesize, we only return accurate stat informations and while this is arguable, its a fact today. so what happens is that the stat on each file triggers a token revoke on the node that currently writing to the file you do stat on, lets say it has 1 gb of dirty data in its memory for this file (as its writes data buffered) this 1 GB of data now gets written to the NSD server, the client updates the inode info and returns the correct size. lets say you have very fast network and you have a fast storage device like GSS (which i see you have) it will be able to do this in a few 100 ms, but the problem is this now happens serialized for each single file in this directory that people write into as for each we need to get the exact stat info to satisfy your ls -l request. this is what takes so long, not the fact that the storage device might be slow or to much metadata activity is going on , this is token , means network traffic and obviously latency dependent. the best way to see this is to look at waiters on the client where you run the ls and see what they are waiting for. there are various ways to tune this to get better 'felt' ls responses but its not completely going away if all you try to with ls is if there is a file in the directory run unalias ls and check if ls after that runs fast as it shouldn't do the -l under the cover anymore. > > If its the CPU on the servers then there is no much to do beside > replacing or add more servers.If its not the CPU, maybe more memory > would help? Maybe its just the network that filled up? so i can add > more links > > Or if we reached the point there the bottleneck its the spindles, > then there is no much point o look somethere else, we just reached > the hardware limit.. > > Sometimes, it also happens that there is very low IO (10Gb/s ), > almost no cpu usage on the servers but huge slownes ( ls can take 10 > seconds). Why that happens? There is not much data ops , but we > think there is a huge ammount of metadata ops. So what i want to > know is if the metadata vdisks are busy or not. If this is our > problem, could some SSD disks dedicated to metadata help? the answer if ssd's would help or not are hard to say without knowing the root case and as i tried to explain above the most likely case is token revoke, not disk i/o. obviously as more busy your disks are as longer the token revoke will take. > > > In particular im, a bit puzzled with the design of our GSS storage. > Each recovery groups have 3 declustered arrays, and each declustered > aray have 1 data and 1 metadata vdisk, but in the end both metadata > and data vdisks use the same spindles. The problem that, its that I > dont understand if we have a metadata bottleneck there. Maybe some > SSD disks in a dedicated declustered array would perform much > better, but this is just theory. I really would like to be able to > monitor IO activities on the metadata vdisks. the short answer is we WANT the metadata disks to be with the data disks on the same spindles. compared to other storage systems, GSS is capable to handle different raid codes for different virtual disks on the same physical disks, this way we create raid1'ish 'LUNS' for metadata and raid6'is 'LUNS' for data so the small i/o penalty for a metadata is very small compared to a read/modify/write on the data disks. > > > > > so the Layer you care about is the NSD Server layer, which sits on > top of the VIO layer (which is essentially the SW RAID Layer in GNR) > > > I'm asking that because if they are just ops, i don't know how much > > they could be usefull. For example one write operation could eman > > write 1 block or write a file of 100GB. If those are oprations, > > there is a way to have the oupunt in bytes or blocks? > > there are multiple ways to get infos on the NSD layer, one would be > to use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats > counts again. > > Counters its not a problem. I can collect them and create some > graphs in a monitoring tool. I will check that. if you (let) upgrade your system to GSS 2.0 you get a graphical monitoring as part of it. if you want i can send you some direct email outside the group with additional informations on that. > > the alternative option is to use mmdiag --iohist. this shows you a > history of the last X numbers of io operations on either the client > or the server side like on a client : > > # mmdiag --iohist > > === mmdiag: iohist === > > I/O history: > > I/O start time RW Buf type disk:sectorNum nSec time ms > qTime ms RpcTimes ms Type Device/NSD ID NSD server > --------------- -- ----------- ----------------- ----- ------- > -------- ----------------- ---- ------------------ --------------- > 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 > 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:22.182723 R inode 1:1071252480 8 6.970 > 0.000 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 > 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.668262 R inode 2:1081373696 8 14.117 > 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 > 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.692019 R inode 2:1064356608 8 14.899 > 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.707100 R inode 2:1077830152 8 16.499 > 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 > 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.728082 R inode 2:1081918976 8 7.760 > 0.000 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.877416 R metadata 2:678978560 16 13.343 > 0.000 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 > 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.906556 R inode 2:1083476520 8 11.723 > 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 > 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.926592 R inode 1:1076503480 8 8.087 > 0.000 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 > 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.941441 R inode 2:1069885984 8 11.686 > 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.953294 R inode 2:1083476936 8 8.951 > 0.000 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965475 R inode 1:1076503504 8 0.477 > 0.000 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.965755 R inode 2:1083476488 8 0.410 > 0.000 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965787 R inode 2:1083476512 8 0.439 > 0.000 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 > > you basically see if its a inode , data block , what size it has (in > sectors) , which nsd server you did send this request to, etc. > > on the Server side you see the type , which physical disk it goes to > and also what size of disk i/o it causes like : > > 14:26:50.129995 R inode 12:3211886376 64 14.261 > 0.000 0.000 0.000 pd sdis > 14:26:50.137102 R inode 19:3003969520 64 9.004 > 0.000 0.000 0.000 pd sdad > 14:26:50.136116 R inode 55:3591710992 64 11.057 > 0.000 0.000 0.000 pd sdoh > 14:26:50.141510 R inode 21:3066810504 64 5.909 > 0.000 0.000 0.000 pd sdaf > 14:26:50.130529 R inode 89:2962370072 64 17.437 > 0.000 0.000 0.000 pd sddi > 14:26:50.131063 R inode 78:1889457000 64 17.062 > 0.000 0.000 0.000 pd sdsj > 14:26:50.143403 R inode 36:3323035688 64 4.807 > 0.000 0.000 0.000 pd sdmw > 14:26:50.131044 R inode 37:2513579736 128 17.181 > 0.000 0.000 0.000 pd sddv > 14:26:50.138181 R inode 72:3868810400 64 10.951 > 0.000 0.000 0.000 pd sdbz > 14:26:50.138188 R inode 131:2443484784 128 11.792 > 0.000 0.000 0.000 pd sdug > 14:26:50.138003 R inode 102:3696843872 64 11.994 > 0.000 0.000 0.000 pd sdgp > 14:26:50.137099 R inode 145:3370922504 64 13.225 > 0.000 0.000 0.000 pd sdmi > 14:26:50.141576 R inode 62:2668579904 64 9.313 > 0.000 0.000 0.000 pd sdou > 14:26:50.134689 R inode 159:2786164648 64 16.577 > 0.000 0.000 0.000 pd sdpq > 14:26:50.145034 R inode 34:2097217320 64 7.409 > 0.000 0.000 0.000 pd sdmt > 14:26:50.138140 R inode 139:2831038792 64 14.898 > 0.000 0.000 0.000 pd sdlw > 14:26:50.130954 R inode 164:282120312 64 22.274 > 0.000 0.000 0.000 pd sdzd > 14:26:50.137038 R inode 41:3421909608 64 16.314 > 0.000 0.000 0.000 pd sdef > 14:26:50.137606 R inode 104:1870962416 64 16.644 > 0.000 0.000 0.000 pd sdgx > 14:26:50.141306 R inode 65:2276184264 64 16.593 > 0.000 0.000 0.000 pd sdrk > > > mmdiag --iohist its another think i looked at it, but i could not > find good explanation for all the "buf type" ( third column ) > allocSeg > data > iallocSeg > indBlock > inode > LLIndBlock > logData > logDesc > logWrap > metadata > vdiskAULog > vdiskBuf > vdiskFWLog > vdiskMDLog > vdiskMeta > vdiskRGDesc > If i want to monifor metadata operation whan should i look at? just inodes =inodes , *alloc* = file or data allocation blocks , *ind* = indirect blocks (for very large files) and metadata , everyhing else is data or internal i/o's > the metadata flag or also inode? this command takes also long to > run, especially if i run it a second time it hangs for a lot before > to rerun again, so i'm not sure that run it every 30secs or minute > its viable, but i will look also into that. THere is any > documentation that descibes clearly the whole output? what i found > its quite generic and don't go into details... the reason it takes so long is because it collects 10's of thousands of i/os in a table and to not slow down the system when we dump the data we copy it to a separate buffer so we don't need locks :-) you can adjust the number of entries you want to collect by adjusting the ioHistorySize config parameter > > > > Last but not least.. and this is what i really would like to > > accomplish, i would to be able to monitor the latency of metadata > operations. > > you can't do this on the server side as you don't know how much time > you spend on the client , network or anything between the app and > the physical disk, so you can only reliably look at this from the > client, the iohist output only shows you the Server disk i/o > processing time, but that can be a fraction of the overall time (in > other cases this obviously can also be the dominant part depending > on your workload). > > the easiest way on the client is to run > > mmfsadm vfsstats enable > from now on vfs stats are collected until you restart GPFS. > > then run : > > vfs statistics currently enabled > started at: Fri Aug 29 13:15:05.380 2014 > duration: 448446.970 sec > > name calls time per call total time > -------------------- -------- -------------- -------------- > statfs 9 0.000002 0.000021 > startIO 246191176 0.005853 1441049.976740 > > to dump what ever you collected so far on this node. > > We already do that, but as I said, I want to check specifically how > gss servers are keeping the requests to identify or exlude server > side bottlenecks. > > > Thanks for your help, you gave me definitely few things where to look at. > > Salvatore > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at us.ibm.com Fri Sep 5 01:53:17 2014 From: oehmes at us.ibm.com (Sven Oehme) Date: Thu, 4 Sep 2014 17:53:17 -0700 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <5408722E.6060309@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <540869DF.5060100@ebi.ac.uk> <54086F1D.1000401@ed.ac.uk> <5408722E.6060309@ebi.ac.uk> Message-ID: if you don't have the files you need to update to a newer version of the GPFS client software on the node. they started shipping with 3.5.0.13 even you get the files you still wouldn't see many values as they never got exposed before. some more details are in a presentation i gave earlier this year which is archived in the list or here --> http://www.gpfsug.org/wp-content/uploads/2014/05/UG10_GPFS_Performance_Session_v10.pdf Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ From: Salvatore Di Nardo To: gpfsug-discuss at gpfsug.org Date: 09/04/2014 07:08 AM Subject: Re: [gpfsug-discuss] gpfs performance monitoring Sent by: gpfsug-discuss-bounces at gpfsug.org On 04/09/14 14:54, Orlando Richards wrote: On 04/09/14 14:32, Salvatore Di Nardo wrote: Sorry to bother you again but dstat have some issues with the plugin: [root at gss01a util]# dstat --gpfs /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use the subprocess module. pipes[cmd] = os.popen3(cmd, 't', 0) Module dstat_gpfs failed to load. (global name 'select' is not defined) None of the stats you selected are available. I found this solution , but involve dstat recompile.... https://github.com/dagwieers/dstat/issues/44 Are you aware about any easier solution (we use RHEL6.3) ? This worked for me the other day on a dev box I was poking at: # rm /usr/share/dstat/dstat_gpfsops* # cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 /usr/share/dstat/dstat_gpfsops.py # dstat --gpfsops /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use the subprocess module. pipes[cmd] = os.popen3(cmd, 't', 0) ---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o----------------------------- cr del op/cl rd wr trunc fsync looku gattr sattr other mb_rd mb_wr pref wrbeh steal clean sync revok logwr logda oth_r oth_w 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... NICE!! The only problem is that the box seems lacking those python scripts: ls /usr/lpp/mmfs/samples/util/ makefile README tsbackup tsbackup.C tsbackup.h tsfindinode tsfindinode.c tsgetusage tsgetusage.c tsinode tsinode.c tslistall tsreaddir tsreaddir.c tstimes tstimes.c Do you mind sending me those py files? They should be 3 as i see e gpfs options: gpfs, gpfs-ops, gpfsops (dunno what are the differences ) Regards, Salvatore _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan at buzzard.me.uk Fri Sep 5 10:29:27 2014 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Fri, 05 Sep 2014 10:29:27 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <54084258.90508@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <54084258.90508@ebi.ac.uk> Message-ID: <1409909367.30257.151.camel@buzzard.phy.strath.ac.uk> On Thu, 2014-09-04 at 11:43 +0100, Salvatore Di Nardo wrote: [SNIP] > > Sometimes, it also happens that there is very low IO (10Gb/s ), almost > no cpu usage on the servers but huge slownes ( ls can take 10 > seconds). Why that happens? There is not much data ops , but we think > there is a huge ammount of metadata ops. So what i want to know is if > the metadata vdisks are busy or not. If this is our problem, could > some SSD disks dedicated to metadata help? > This is almost always because you are using an external LDAP/NIS server for GECOS information and the values that you need are not cached for whatever reason and you are having to look them up again. Note that the standard aliasing for RHEL based distros of ls also causes it to do a stat on every file for the colouring etc. Also be aware that if you are trying to fill out your cd with TAB auto-completion you will run into similar issues. That is had you typed the path for the cd out in full you would get in instantly, doing a couple of letters and hitting cd it could take a while. You can test this on a RHEL based distro by doing "/bin/ls -n" The idea being to avoid any aliasing and not look up GECOS data and just report the raw numerical stuff. What I would suggest is that you set the cache time on UID/GID lookups for positive lookups to a long time, in general as long as possible because the values should almost never change. Even for a positive look up of a group membership I would have that cached for a couple of hours. For negative lookups something like five or 10 minutes is a good starting point. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From sdinardo at ebi.ac.uk Fri Sep 5 11:56:37 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Fri, 05 Sep 2014 11:56:37 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: References: <54074F90.7000303@ebi.ac.uk> <54084258.90508@ebi.ac.uk> Message-ID: <540996E5.5000502@ebi.ac.uk> Little clarification: Our ls its plain ls, there is no alias. Consider that all those things are already set up properly as EBI run hi computing farms from many years, so those things are already fixed loong time ago. We have very little experience with GPFS, but good knowledge with LSF farms and own multiple NFS stotages ( several petabyte sized). about NIS, all clients run NSCD that cashes all informations to avoid such tipe of slownes, in fact then ls isslow, also ls -n is slow. Beside that, also a "cd" sometimes hangs, so it have nothing to do with getting attributes. Just to clarify a bit more. Now GSS usually seems working fine, we have users that run jobs on the farms that pushes 180Gb/s read ( reading and writing files of 100GB size). GPFS works very well there, where other systems had performance problems accessing portion of data in so huge files. Sadly, on the other hand, other users run jobs that do suge ammount of metadata operations, like toons of ls in directory with many files, or creating a silly amount of temporary files just to synchronize the jobs between the farm nodes, or just to store temporary data for few milliseconds and them immediately delete those temporary files. Imagine to create constantly thousands files just to write few bytes and they delete them after few milliseconds... When those thing happens we see 10-15Gb/sec throughput, low CPU usage on the server ( 80% iddle), but any cd, or ls or wathever takes few seconds. So my question is, if the bottleneck could be the spindles, or if the clients could be tuned a bit more? I read your PDF and all the paramenters seems already well configured except "maxFilesToCache", but I'm not sure how we should configure few of those parameters on the clients. As an example I cannot immagine a client that require 38g pagepool size. so what's the correct *pagepool* on a client? what about those others? *maxFilestoCache** **maxBufferdescs** **worker1threads** **worker3threads* Right now all the clients have 1 GB pagepool size. In theory, we can afford to use more ( i thing we can easily go up to 8GB) as they have plenty or available memory. If this could help, we can do that, but the client really really need more than 1G? They are just clients after all, so the memory in theory should be used for jobs not just for "caching". Last question about "maxFIlesToCache" you say that must be large on small cluster but small on large clusters. What do you consider 6 servers and almost 700 clients? on clienst we have: maxFilesToCache 4000 on servers we have maxFilesToCache 12288 Regards, Salvatore On 05/09/14 01:48, Sven Oehme wrote: > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > gpfsug-discuss-bounces at gpfsug.org wrote on 09/04/2014 03:43:36 AM: > > > From: Salvatore Di Nardo > > To: gpfsug main discussion list > > Date: 09/04/2014 03:44 AM > > Subject: Re: [gpfsug-discuss] gpfs performance monitoring > > Sent by: gpfsug-discuss-bounces at gpfsug.org > > > > On 04/09/14 01:50, Sven Oehme wrote: > > > Hello everybody, > > > > Hi > > > > > here i come here again, this time to ask some hint about how to > > monitor GPFS. > > > > > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is > > > that they return number based only on the request done in the > > > current host, so i have to run them on all the clients ( over 600 > > > nodes) so its quite unpractical. Instead i would like to know from > > > the servers whats going on, and i came across the vio_s statistics > > > wich are less documented and i dont know exacly what they mean. > > > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that > > > runs VIO_S. > > > > > > My problems with the output of this command: > > > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 > > > > > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second > > > timestamp: 1409763206/477366 > > > recovery group: * > > > declustered array: * > > > vdisk: * > > > client reads: 2584229 > > > client short writes: 55299693 > > > client medium writes: 190071 > > > client promoted full track writes: 465145 > > > client full track writes: 9249 > > > flushed update writes: 4187708 > > > flushed promoted full track writes: 123 > > > migrate operations: 114 > > > scrub operations: 450590 > > > log writes: 28509602 > > > > > > it sais "VIOPS per second", but they seem to me just counters as > > > every time i re-run the command, the numbers increase by a bit.. > > > Can anyone confirm if those numbers are counter or if they are > OPS/sec. > > > > the numbers are accumulative so everytime you run them they just > > show the value since start (or last reset) time. > > OK, you confirmed my toughts, thatks > > > > > > > > > On a closer eye about i dont understand what most of thosevalues > > > mean. For example, what exacly are "flushed promoted full track > write" ?? > > > I tried to find a documentation about this output , but could not > > > find any. can anyone point me a link where output of vio_s is > explained? > > > > > > Another thing i dont understand about those numbers is if they are > > > just operations, or the number of blocks that was read/write/etc . > > > > its just operations and if i would explain what the numbers mean i > > might confuse you even more because this is not what you are really > > looking for. > > what you are looking for is what the client io's look like on the > > Server side, while the VIO layer is the Server side to the disks, so > > one lever lower than what you are looking for from what i could read > > out of the description above. > > No.. what I'm looking its exactly how the disks are busy to keep the > > requests. Obviously i'm not looking just that, but I feel the needs > > to monitor also those things. Ill explain you why. > > > > It happens when our storage is quite busy ( 180Gb/s of read/write ) > > that the FS start to be slowin normal cd or ls requests. This might > > be normal, but in those situation i want to know where the > > bottleneck is. Is the server CPU? Memory? Network? Spindles? knowing > > where the bottlenek is might help me to understand if we can tweak > > the system a bit more. > > if cd or ls is very slow in GPFS in the majority of the cases it has > nothing to do with NSD Server bottlenecks, only indirect. > the main reason ls is slow in the field is you have some very powerful > nodes that all do buffered writes into the same directory into 1 or > multiple files while you do the ls on a different node. what happens > now is that the ls you did run most likely is a alias for ls -l or > something even more complex with color display, etc, but the point is > it most likely returns file size. GPFS doesn't lie about the filesize, > we only return accurate stat informations and while this is arguable, > its a fact today. > so what happens is that the stat on each file triggers a token revoke > on the node that currently writing to the file you do stat on, lets > say it has 1 gb of dirty data in its memory for this file (as its > writes data buffered) this 1 GB of data now gets written to the NSD > server, the client updates the inode info and returns the correct size. > lets say you have very fast network and you have a fast storage device > like GSS (which i see you have) it will be able to do this in a few > 100 ms, but the problem is this now happens serialized for each single > file in this directory that people write into as for each we need to > get the exact stat info to satisfy your ls -l request. > this is what takes so long, not the fact that the storage device might > be slow or to much metadata activity is going on , this is token , > means network traffic and obviously latency dependent. > > the best way to see this is to look at waiters on the client where you > run the ls and see what they are waiting for. > > there are various ways to tune this to get better 'felt' ls responses > but its not completely going away > if all you try to with ls is if there is a file in the directory run > unalias ls and check if ls after that runs fast as it shouldn't do the > -l under the cover anymore. > > > > > If its the CPU on the servers then there is no much to do beside > > replacing or add more servers.If its not the CPU, maybe more memory > > would help? Maybe its just the network that filled up? so i can add > > more links > > > > Or if we reached the point there the bottleneck its the spindles, > > then there is no much point o look somethere else, we just reached > > the hardware limit.. > > > > Sometimes, it also happens that there is very low IO (10Gb/s ), > > almost no cpu usage on the servers but huge slownes ( ls can take 10 > > seconds). Why that happens? There is not much data ops , but we > > think there is a huge ammount of metadata ops. So what i want to > > know is if the metadata vdisks are busy or not. If this is our > > problem, could some SSD disks dedicated to metadata help? > > the answer if ssd's would help or not are hard to say without knowing > the root case and as i tried to explain above the most likely case is > token revoke, not disk i/o. obviously as more busy your disks are as > longer the token revoke will take. > > > > > > > In particular im, a bit puzzled with the design of our GSS storage. > > Each recovery groups have 3 declustered arrays, and each declustered > > aray have 1 data and 1 metadata vdisk, but in the end both metadata > > and data vdisks use the same spindles. The problem that, its that I > > dont understand if we have a metadata bottleneck there. Maybe some > > SSD disks in a dedicated declustered array would perform much > > better, but this is just theory. I really would like to be able to > > monitor IO activities on the metadata vdisks. > > the short answer is we WANT the metadata disks to be with the data > disks on the same spindles. compared to other storage systems, GSS is > capable to handle different raid codes for different virtual disks on > the same physical disks, this way we create raid1'ish 'LUNS' for > metadata and raid6'is 'LUNS' for data so the small i/o penalty for a > metadata is very small compared to a read/modify/write on the data disks. > > > > > > > > > > > > so the Layer you care about is the NSD Server layer, which sits on > > top of the VIO layer (which is essentially the SW RAID Layer in GNR) > > > > > I'm asking that because if they are just ops, i don't know how much > > > they could be usefull. For example one write operation could eman > > > write 1 block or write a file of 100GB. If those are oprations, > > > there is a way to have the oupunt in bytes or blocks? > > > > there are multiple ways to get infos on the NSD layer, one would be > > to use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats > > counts again. > > > > Counters its not a problem. I can collect them and create some > > graphs in a monitoring tool. I will check that. > > if you (let) upgrade your system to GSS 2.0 you get a graphical > monitoring as part of it. if you want i can send you some direct email > outside the group with additional informations on that. > > > > > the alternative option is to use mmdiag --iohist. this shows you a > > history of the last X numbers of io operations on either the client > > or the server side like on a client : > > > > # mmdiag --iohist > > > > === mmdiag: iohist === > > > > I/O history: > > > > I/O start time RW Buf type disk:sectorNum nSec time ms > > qTime ms RpcTimes ms Type Device/NSD ID NSD server > > --------------- -- ----------- ----------------- ----- ------- > > -------- ----------------- ---- ------------------ --------------- > > 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 > > 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:22.182723 R inode 1:1071252480 8 6.970 > > 0.000 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 > > 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:53.668262 R inode 2:1081373696 8 14.117 > > 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 > > 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:53.692019 R inode 2:1064356608 8 14.899 > > 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:53.707100 R inode 2:1077830152 8 16.499 > > 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 > > 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:53.728082 R inode 2:1081918976 8 7.760 > > 0.000 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:57.877416 R metadata 2:678978560 16 13.343 > > 0.000 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 > > 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:57.906556 R inode 2:1083476520 8 11.723 > > 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 > > 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:57.926592 R inode 1:1076503480 8 8.087 > > 0.000 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 > > 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:57.941441 R inode 2:1069885984 8 11.686 > > 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:57.953294 R inode 2:1083476936 8 8.951 > > 0.000 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:57.965475 R inode 1:1076503504 8 0.477 > > 0.000 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:57.965755 R inode 2:1083476488 8 0.410 > > 0.000 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:57.965787 R inode 2:1083476512 8 0.439 > > 0.000 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 > > > > you basically see if its a inode , data block , what size it has (in > > sectors) , which nsd server you did send this request to, etc. > > > > on the Server side you see the type , which physical disk it goes to > > and also what size of disk i/o it causes like : > > > > 14:26:50.129995 R inode 12:3211886376 64 14.261 > > 0.000 0.000 0.000 pd sdis > > 14:26:50.137102 R inode 19:3003969520 64 9.004 > > 0.000 0.000 0.000 pd sdad > > 14:26:50.136116 R inode 55:3591710992 64 11.057 > > 0.000 0.000 0.000 pd sdoh > > 14:26:50.141510 R inode 21:3066810504 64 5.909 > > 0.000 0.000 0.000 pd sdaf > > 14:26:50.130529 R inode 89:2962370072 64 17.437 > > 0.000 0.000 0.000 pd sddi > > 14:26:50.131063 R inode 78:1889457000 64 17.062 > > 0.000 0.000 0.000 pd sdsj > > 14:26:50.143403 R inode 36:3323035688 64 4.807 > > 0.000 0.000 0.000 pd sdmw > > 14:26:50.131044 R inode 37:2513579736 128 17.181 > > 0.000 0.000 0.000 pd sddv > > 14:26:50.138181 R inode 72:3868810400 64 10.951 > > 0.000 0.000 0.000 pd sdbz > > 14:26:50.138188 R inode 131:2443484784 128 11.792 > > 0.000 0.000 0.000 pd sdug > > 14:26:50.138003 R inode 102:3696843872 64 11.994 > > 0.000 0.000 0.000 pd sdgp > > 14:26:50.137099 R inode 145:3370922504 64 13.225 > > 0.000 0.000 0.000 pd sdmi > > 14:26:50.141576 R inode 62:2668579904 64 9.313 > > 0.000 0.000 0.000 pd sdou > > 14:26:50.134689 R inode 159:2786164648 64 16.577 > > 0.000 0.000 0.000 pd sdpq > > 14:26:50.145034 R inode 34:2097217320 64 7.409 > > 0.000 0.000 0.000 pd sdmt > > 14:26:50.138140 R inode 139:2831038792 64 14.898 > > 0.000 0.000 0.000 pd sdlw > > 14:26:50.130954 R inode 164:282120312 64 22.274 > > 0.000 0.000 0.000 pd sdzd > > 14:26:50.137038 R inode 41:3421909608 64 16.314 > > 0.000 0.000 0.000 pd sdef > > 14:26:50.137606 R inode 104:1870962416 64 16.644 > > 0.000 0.000 0.000 pd sdgx > > 14:26:50.141306 R inode 65:2276184264 64 16.593 > > 0.000 0.000 0.000 pd sdrk > > > > > > > mmdiag --iohist its another think i looked at it, but i could not > > find good explanation for all the "buf type" ( third column ) > > > allocSeg > > data > > iallocSeg > > indBlock > > inode > > LLIndBlock > > logData > > logDesc > > logWrap > > metadata > > vdiskAULog > > vdiskBuf > > vdiskFWLog > > vdiskMDLog > > vdiskMeta > > vdiskRGDesc > > If i want to monifor metadata operation whan should i look at? just > > inodes =inodes , *alloc* = file or data allocation blocks , *ind* = > indirect blocks (for very large files) and metadata , everyhing else > is data or internal i/o's > > > the metadata flag or also inode? this command takes also long to > > run, especially if i run it a second time it hangs for a lot before > > to rerun again, so i'm not sure that run it every 30secs or minute > > its viable, but i will look also into that. THere is any > > documentation that descibes clearly the whole output? what i found > > its quite generic and don't go into details... > > the reason it takes so long is because it collects 10's of thousands > of i/os in a table and to not slow down the system when we dump the > data we copy it to a separate buffer so we don't need locks :-) > you can adjust the number of entries you want to collect by adjusting > the ioHistorySize config parameter > > > > > > > > Last but not least.. and this is what i really would like to > > > accomplish, i would to be able to monitor the latency of metadata > > operations. > > > > you can't do this on the server side as you don't know how much time > > you spend on the client , network or anything between the app and > > the physical disk, so you can only reliably look at this from the > > client, the iohist output only shows you the Server disk i/o > > processing time, but that can be a fraction of the overall time (in > > other cases this obviously can also be the dominant part depending > > on your workload). > > > > the easiest way on the client is to run > > > > mmfsadm vfsstats enable > > from now on vfs stats are collected until you restart GPFS. > > > > then run : > > > > vfs statistics currently enabled > > started at: Fri Aug 29 13:15:05.380 2014 > > duration: 448446.970 sec > > > > name calls time per call total time > > -------------------- -------- -------------- -------------- > > statfs 9 0.000002 0.000021 > > startIO 246191176 0.005853 1441049.976740 > > > > to dump what ever you collected so far on this node. > > > > > We already do that, but as I said, I want to check specifically how > > gss servers are keeping the requests to identify or exlude server > > side bottlenecks. > > > > > > Thanks for your help, you gave me definitely few things where to > look at. > > > > Salvatore > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at gpfsug.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From chekh at stanford.edu Fri Sep 5 22:17:47 2014 From: chekh at stanford.edu (Alex Chekholko) Date: Fri, 05 Sep 2014 14:17:47 -0700 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <540996E5.5000502@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <54084258.90508@ebi.ac.uk> <540996E5.5000502@ebi.ac.uk> Message-ID: <540A287B.1050202@stanford.edu> On 9/5/14, 3:56 AM, Salvatore Di Nardo wrote: > Little clarification: > Our ls its plain ls, there is no alias. ... > Last question about "maxFIlesToCache" you say that must be large on > small cluster but small on large clusters. What do you consider 6 > servers and almost 700 clients? > > on clienst we have: > maxFilesToCache 4000 > > on servers we have > maxFilesToCache 12288 > > One thing to do is to try your 'ls', see it is slow, then immediately run it again. If it is fast the second and consecutive times, it's because now the stat info is coming out of local cache. e.g. /usr/bin/time ls /path/to/some/dir && /usr/bin/time ls /path/to/some/dir The second time is likely to be almost immediate. So long as your local cache is big enough. I see on one of our older clusters we have: tokenMemLimit 2G maxFilesToCache 40000 maxStatCache 80000 You can also interrogate the local cache to see how full it is. Of course, if many nodes are writing to same dirs, then the cache will need to be invalidated often which causes some overhead. Big local cache is good if clients are usually working in different directories. Regards, -- chekh at stanford.edu From oehmes at us.ibm.com Sat Sep 6 01:12:42 2014 From: oehmes at us.ibm.com (Sven Oehme) Date: Fri, 5 Sep 2014 17:12:42 -0700 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <540996E5.5000502@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <54084258.90508@ebi.ac.uk> <540996E5.5000502@ebi.ac.uk> Message-ID: on your GSS nodes you have tuning files we suggest customers to use for mixed workloads clients. the files in /usr/lpp/mmfs/samples/gss/ if you create a nodeclass for all your clients you can run /usr/lpp/mmfs/samples/gss/gssClientConfig.sh NODECLASS and it applies all the settings to them so they will be active on next restart of the gpfs daemon. this should be a very good starting point for your config. please try that and let me know if it doesn't. there are also several enhancements in GPFS 4.1 which reduce contention in multiple areas, which would help as well, if you have the choice to update the nodes. btw. the GSS 2.0 package will update your GSS nodes to 4.1 also Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ From: Salvatore Di Nardo To: gpfsug main discussion list Date: 09/05/2014 03:57 AM Subject: Re: [gpfsug-discuss] gpfs performance monitoring Sent by: gpfsug-discuss-bounces at gpfsug.org Little clarification: Our ls its plain ls, there is no alias. Consider that all those things are already set up properly as EBI run hi computing farms from many years, so those things are already fixed loong time ago. We have very little experience with GPFS, but good knowledge with LSF farms and own multiple NFS stotages ( several petabyte sized). about NIS, all clients run NSCD that cashes all informations to avoid such tipe of slownes, in fact then ls isslow, also ls -n is slow. Beside that, also a "cd" sometimes hangs, so it have nothing to do with getting attributes. Just to clarify a bit more. Now GSS usually seems working fine, we have users that run jobs on the farms that pushes 180Gb/s read ( reading and writing files of 100GB size). GPFS works very well there, where other systems had performance problems accessing portion of data in so huge files. Sadly, on the other hand, other users run jobs that do suge ammount of metadata operations, like toons of ls in directory with many files, or creating a silly amount of temporary files just to synchronize the jobs between the farm nodes, or just to store temporary data for few milliseconds and them immediately delete those temporary files. Imagine to create constantly thousands files just to write few bytes and they delete them after few milliseconds... When those thing happens we see 10-15Gb/sec throughput, low CPU usage on the server ( 80% iddle), but any cd, or ls or wathever takes few seconds. So my question is, if the bottleneck could be the spindles, or if the clients could be tuned a bit more? I read your PDF and all the paramenters seems already well configured except "maxFilesToCache", but I'm not sure how we should configure few of those parameters on the clients. As an example I cannot immagine a client that require 38g pagepool size. so what's the correct pagepool on a client? what about those others? maxFilestoCache maxBufferdescs worker1threads worker3threads Right now all the clients have 1 GB pagepool size. In theory, we can afford to use more ( i thing we can easily go up to 8GB) as they have plenty or available memory. If this could help, we can do that, but the client really really need more than 1G? They are just clients after all, so the memory in theory should be used for jobs not just for "caching". Last question about "maxFIlesToCache" you say that must be large on small cluster but small on large clusters. What do you consider 6 servers and almost 700 clients? on clienst we have: maxFilesToCache 4000 on servers we have maxFilesToCache 12288 Regards, Salvatore On 05/09/14 01:48, Sven Oehme wrote: ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ gpfsug-discuss-bounces at gpfsug.org wrote on 09/04/2014 03:43:36 AM: > From: Salvatore Di Nardo > To: gpfsug main discussion list > Date: 09/04/2014 03:44 AM > Subject: Re: [gpfsug-discuss] gpfs performance monitoring > Sent by: gpfsug-discuss-bounces at gpfsug.org > > On 04/09/14 01:50, Sven Oehme wrote: > > Hello everybody, > > Hi > > > here i come here again, this time to ask some hint about how to > monitor GPFS. > > > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is > > that they return number based only on the request done in the > > current host, so i have to run them on all the clients ( over 600 > > nodes) so its quite unpractical. Instead i would like to know from > > the servers whats going on, and i came across the vio_s statistics > > wich are less documented and i dont know exacly what they mean. > > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that > > runs VIO_S. > > > > My problems with the output of this command: > > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 > > > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second > > timestamp: 1409763206/477366 > > recovery group: * > > declustered array: * > > vdisk: * > > client reads: 2584229 > > client short writes: 55299693 > > client medium writes: 190071 > > client promoted full track writes: 465145 > > client full track writes: 9249 > > flushed update writes: 4187708 > > flushed promoted full track writes: 123 > > migrate operations: 114 > > scrub operations: 450590 > > log writes: 28509602 > > > > it sais "VIOPS per second", but they seem to me just counters as > > every time i re-run the command, the numbers increase by a bit.. > > Can anyone confirm if those numbers are counter or if they are OPS/sec. > > the numbers are accumulative so everytime you run them they just > show the value since start (or last reset) time. > OK, you confirmed my toughts, thatks > > > > > On a closer eye about i dont understand what most of thosevalues > > mean. For example, what exacly are "flushed promoted full track write" ?? > > I tried to find a documentation about this output , but could not > > find any. can anyone point me a link where output of vio_s is explained? > > > > Another thing i dont understand about those numbers is if they are > > just operations, or the number of blocks that was read/write/etc . > > its just operations and if i would explain what the numbers mean i > might confuse you even more because this is not what you are really > looking for. > what you are looking for is what the client io's look like on the > Server side, while the VIO layer is the Server side to the disks, so > one lever lower than what you are looking for from what i could read > out of the description above. > No.. what I'm looking its exactly how the disks are busy to keep the > requests. Obviously i'm not looking just that, but I feel the needs > to monitor also those things. Ill explain you why. > > It happens when our storage is quite busy ( 180Gb/s of read/write ) > that the FS start to be slowin normal cd or ls requests. This might > be normal, but in those situation i want to know where the > bottleneck is. Is the server CPU? Memory? Network? Spindles? knowing > where the bottlenek is might help me to understand if we can tweak > the system a bit more. if cd or ls is very slow in GPFS in the majority of the cases it has nothing to do with NSD Server bottlenecks, only indirect. the main reason ls is slow in the field is you have some very powerful nodes that all do buffered writes into the same directory into 1 or multiple files while you do the ls on a different node. what happens now is that the ls you did run most likely is a alias for ls -l or something even more complex with color display, etc, but the point is it most likely returns file size. GPFS doesn't lie about the filesize, we only return accurate stat informations and while this is arguable, its a fact today. so what happens is that the stat on each file triggers a token revoke on the node that currently writing to the file you do stat on, lets say it has 1 gb of dirty data in its memory for this file (as its writes data buffered) this 1 GB of data now gets written to the NSD server, the client updates the inode info and returns the correct size. lets say you have very fast network and you have a fast storage device like GSS (which i see you have) it will be able to do this in a few 100 ms, but the problem is this now happens serialized for each single file in this directory that people write into as for each we need to get the exact stat info to satisfy your ls -l request. this is what takes so long, not the fact that the storage device might be slow or to much metadata activity is going on , this is token , means network traffic and obviously latency dependent. the best way to see this is to look at waiters on the client where you run the ls and see what they are waiting for. there are various ways to tune this to get better 'felt' ls responses but its not completely going away if all you try to with ls is if there is a file in the directory run unalias ls and check if ls after that runs fast as it shouldn't do the -l under the cover anymore. > > If its the CPU on the servers then there is no much to do beside > replacing or add more servers.If its not the CPU, maybe more memory > would help? Maybe its just the network that filled up? so i can add > more links > > Or if we reached the point there the bottleneck its the spindles, > then there is no much point o look somethere else, we just reached > the hardware limit.. > > Sometimes, it also happens that there is very low IO (10Gb/s ), > almost no cpu usage on the servers but huge slownes ( ls can take 10 > seconds). Why that happens? There is not much data ops , but we > think there is a huge ammount of metadata ops. So what i want to > know is if the metadata vdisks are busy or not. If this is our > problem, could some SSD disks dedicated to metadata help? the answer if ssd's would help or not are hard to say without knowing the root case and as i tried to explain above the most likely case is token revoke, not disk i/o. obviously as more busy your disks are as longer the token revoke will take. > > > In particular im, a bit puzzled with the design of our GSS storage. > Each recovery groups have 3 declustered arrays, and each declustered > aray have 1 data and 1 metadata vdisk, but in the end both metadata > and data vdisks use the same spindles. The problem that, its that I > dont understand if we have a metadata bottleneck there. Maybe some > SSD disks in a dedicated declustered array would perform much > better, but this is just theory. I really would like to be able to > monitor IO activities on the metadata vdisks. the short answer is we WANT the metadata disks to be with the data disks on the same spindles. compared to other storage systems, GSS is capable to handle different raid codes for different virtual disks on the same physical disks, this way we create raid1'ish 'LUNS' for metadata and raid6'is 'LUNS' for data so the small i/o penalty for a metadata is very small compared to a read/modify/write on the data disks. > > > > > so the Layer you care about is the NSD Server layer, which sits on > top of the VIO layer (which is essentially the SW RAID Layer in GNR) > > > I'm asking that because if they are just ops, i don't know how much > > they could be usefull. For example one write operation could eman > > write 1 block or write a file of 100GB. If those are oprations, > > there is a way to have the oupunt in bytes or blocks? > > there are multiple ways to get infos on the NSD layer, one would be > to use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats > counts again. > > Counters its not a problem. I can collect them and create some > graphs in a monitoring tool. I will check that. if you (let) upgrade your system to GSS 2.0 you get a graphical monitoring as part of it. if you want i can send you some direct email outside the group with additional informations on that. > > the alternative option is to use mmdiag --iohist. this shows you a > history of the last X numbers of io operations on either the client > or the server side like on a client : > > # mmdiag --iohist > > === mmdiag: iohist === > > I/O history: > > I/O start time RW Buf type disk:sectorNum nSec time ms > qTime ms RpcTimes ms Type Device/NSD ID NSD server > --------------- -- ----------- ----------------- ----- ------- > -------- ----------------- ---- ------------------ --------------- > 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 > 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:22.182723 R inode 1:1071252480 8 6.970 > 0.000 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 > 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.668262 R inode 2:1081373696 8 14.117 > 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 > 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.692019 R inode 2:1064356608 8 14.899 > 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.707100 R inode 2:1077830152 8 16.499 > 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 > 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.728082 R inode 2:1081918976 8 7.760 > 0.000 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.877416 R metadata 2:678978560 16 13.343 > 0.000 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 > 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.906556 R inode 2:1083476520 8 11.723 > 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 > 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.926592 R inode 1:1076503480 8 8.087 > 0.000 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 > 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.941441 R inode 2:1069885984 8 11.686 > 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.953294 R inode 2:1083476936 8 8.951 > 0.000 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965475 R inode 1:1076503504 8 0.477 > 0.000 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.965755 R inode 2:1083476488 8 0.410 > 0.000 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965787 R inode 2:1083476512 8 0.439 > 0.000 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 > > you basically see if its a inode , data block , what size it has (in > sectors) , which nsd server you did send this request to, etc. > > on the Server side you see the type , which physical disk it goes to > and also what size of disk i/o it causes like : > > 14:26:50.129995 R inode 12:3211886376 64 14.261 > 0.000 0.000 0.000 pd sdis > 14:26:50.137102 R inode 19:3003969520 64 9.004 > 0.000 0.000 0.000 pd sdad > 14:26:50.136116 R inode 55:3591710992 64 11.057 > 0.000 0.000 0.000 pd sdoh > 14:26:50.141510 R inode 21:3066810504 64 5.909 > 0.000 0.000 0.000 pd sdaf > 14:26:50.130529 R inode 89:2962370072 64 17.437 > 0.000 0.000 0.000 pd sddi > 14:26:50.131063 R inode 78:1889457000 64 17.062 > 0.000 0.000 0.000 pd sdsj > 14:26:50.143403 R inode 36:3323035688 64 4.807 > 0.000 0.000 0.000 pd sdmw > 14:26:50.131044 R inode 37:2513579736 128 17.181 > 0.000 0.000 0.000 pd sddv > 14:26:50.138181 R inode 72:3868810400 64 10.951 > 0.000 0.000 0.000 pd sdbz > 14:26:50.138188 R inode 131:2443484784 128 11.792 > 0.000 0.000 0.000 pd sdug > 14:26:50.138003 R inode 102:3696843872 64 11.994 > 0.000 0.000 0.000 pd sdgp > 14:26:50.137099 R inode 145:3370922504 64 13.225 > 0.000 0.000 0.000 pd sdmi > 14:26:50.141576 R inode 62:2668579904 64 9.313 > 0.000 0.000 0.000 pd sdou > 14:26:50.134689 R inode 159:2786164648 64 16.577 > 0.000 0.000 0.000 pd sdpq > 14:26:50.145034 R inode 34:2097217320 64 7.409 > 0.000 0.000 0.000 pd sdmt > 14:26:50.138140 R inode 139:2831038792 64 14.898 > 0.000 0.000 0.000 pd sdlw > 14:26:50.130954 R inode 164:282120312 64 22.274 > 0.000 0.000 0.000 pd sdzd > 14:26:50.137038 R inode 41:3421909608 64 16.314 > 0.000 0.000 0.000 pd sdef > 14:26:50.137606 R inode 104:1870962416 64 16.644 > 0.000 0.000 0.000 pd sdgx > 14:26:50.141306 R inode 65:2276184264 64 16.593 > 0.000 0.000 0.000 pd sdrk > > > mmdiag --iohist its another think i looked at it, but i could not > find good explanation for all the "buf type" ( third column ) > allocSeg > data > iallocSeg > indBlock > inode > LLIndBlock > logData > logDesc > logWrap > metadata > vdiskAULog > vdiskBuf > vdiskFWLog > vdiskMDLog > vdiskMeta > vdiskRGDesc > If i want to monifor metadata operation whan should i look at? just inodes =inodes , *alloc* = file or data allocation blocks , *ind* = indirect blocks (for very large files) and metadata , everyhing else is data or internal i/o's > the metadata flag or also inode? this command takes also long to > run, especially if i run it a second time it hangs for a lot before > to rerun again, so i'm not sure that run it every 30secs or minute > its viable, but i will look also into that. THere is any > documentation that descibes clearly the whole output? what i found > its quite generic and don't go into details... the reason it takes so long is because it collects 10's of thousands of i/os in a table and to not slow down the system when we dump the data we copy it to a separate buffer so we don't need locks :-) you can adjust the number of entries you want to collect by adjusting the ioHistorySize config parameter > > > > Last but not least.. and this is what i really would like to > > accomplish, i would to be able to monitor the latency of metadata > operations. > > you can't do this on the server side as you don't know how much time > you spend on the client , network or anything between the app and > the physical disk, so you can only reliably look at this from the > client, the iohist output only shows you the Server disk i/o > processing time, but that can be a fraction of the overall time (in > other cases this obviously can also be the dominant part depending > on your workload). > > the easiest way on the client is to run > > mmfsadm vfsstats enable > from now on vfs stats are collected until you restart GPFS. > > then run : > > vfs statistics currently enabled > started at: Fri Aug 29 13:15:05.380 2014 > duration: 448446.970 sec > > name calls time per call total time > -------------------- -------- -------------- -------------- > statfs 9 0.000002 0.000021 > startIO 246191176 0.005853 1441049.976740 > > to dump what ever you collected so far on this node. > > We already do that, but as I said, I want to check specifically how > gss servers are keeping the requests to identify or exlude server > side bottlenecks. > > > Thanks for your help, you gave me definitely few things where to look at. > > Salvatore > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From luke.raimbach at oerc.ox.ac.uk Tue Sep 9 11:23:47 2014 From: luke.raimbach at oerc.ox.ac.uk (Luke Raimbach) Date: Tue, 9 Sep 2014 10:23:47 +0000 Subject: [gpfsug-discuss] mmdiag output questions Message-ID: Hi All, When tracing a problem recently (which turned out to be a NIC failure), mmdiag proved useful in tracing broken cluster connections. I have some questions about the output of mmdiag using the --network switch: Occasionally I see nodes in the same cluster grouped, but in no readily identifiable way - for example, the following output has three headings "Device bon0:" with some nodes listed, but the nodes don't seem to share anything in common like status, err, ostype, etc. Also, is anyone able to explain what might be seen under the err column? Do these correspond to GPFS error codes as one might see in mmfs.log.latest? What is the sock column displaying - the number of open sockets or the socket state? Lastly, the sent/recvd columns seem very low. Is there a rolling time window within which these statistics are kept in some internal mmfsd buffer? Cheers. === mmdiag: network === Pending messages: (none) Inter-node communication configuration: tscTcpPort 1191 my address 10.100.10.51/22 (eth0) my addr list 10.200.21.1/16 (bond0)/cpdn.oerc.local 10.100.10.51/22 (eth0) my node number 9 TCP Connections between nodes: Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype gpfs01 10.200.1.1 connected 0 32 110 110 Linux/L gpfs02 10.200.2.1 connected 0 36 104 104 Linux/L linux 10.200.101.1 connected 0 37 0 0 Linux/L jupiter 10.200.102.1 connected 0 35 0 0 Windows/L cnfs0 10.200.10.10 connected 0 39 0 0 Linux/L cnfs1 10.200.10.11 init 0 -1 0 0 Linux/L Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype cnfs2 10.200.10.12 connected 0 33 5 5 Linux/L cnfs3 10.200.10.13 init 0 -1 0 0 Linux/L cpdn-ppc02 10.200.61.1 init 0 -1 0 0 Linux/L cpdn-ppc03 10.200.62.1 init 0 -1 0 0 Linux/L Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype cpdn-ppc01 10.200.60.1 connected 0 38 0 0 Linux/L diag verbs: VERBS RDMA class not initialized Conversely, the output of mmdiag --network on the file system manager node for the same cluster looks like this: === mmdiag: network === Pending messages: (none) Inter-node communication configuration: tscTcpPort 1191 my address 10.100.10.21/22 (eth0) my addr list 10.200.1.1/16 (bond0)/cpdn.oerc.local 10.100.10.21/22 (eth0) my node number 1 TCP Connections between nodes: Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype gpfs02 10.200.2.1 connected 0 73 219 219 Linux/L linux 10.200.101.1 connected 0 49 180 181 Linux/L jupiter 10.200.102.1 connected 0 33 3 3 Windows/L cnfs0 10.200.10.10 connected 0 61 3 3 Linux/L cnfs1 10.200.10.11 connected 0 81 0 0 Linux/L cnfs2 10.200.10.12 connected 0 64 23 23 Linux/L cnfs3 10.200.10.13 connected 0 60 2 2 Linux/L tsm01 10.200.21.1 connected 0 50 110 110 Linux/L cpdn-ppc02 10.200.61.1 connected 0 63 0 0 Linux/L cpdn-ppc03 10.200.62.1 connected 0 65 0 0 Linux/L cpdn-ppc01 10.200.60.1 connected 0 62 94 94 Linux/L diag verbs: VERBS RDMA class not initialized All neatly connected! -- Luke Raimbach IT Manager Oxford e-Research Centre 7 Keble Road, Oxford, OX1 3QG +44(0)1865 610639 From chair at gpfsug.org Wed Sep 10 15:33:24 2014 From: chair at gpfsug.org (Jez Tucker (Chair)) Date: Wed, 10 Sep 2014 15:33:24 +0100 Subject: [gpfsug-discuss] GPFS Request for Enhancements Message-ID: <54106134.7010902@gpfsug.org> Hi all Just a quick reminder that the RFEs that you all gave feedback at the last UG on are live on IBM's RFE site: goo.gl/1K6LBa Please take the time to have a look and add your votes to the GPFS RFEs. Jez -------------- next part -------------- An HTML attachment was scrubbed... URL: From dmetcalfe at ocf.co.uk Thu Sep 11 21:18:58 2014 From: dmetcalfe at ocf.co.uk (Daniel Metcalfe) Date: Thu, 11 Sep 2014 21:18:58 +0100 Subject: [gpfsug-discuss] mmdiag output questions In-Reply-To: References: Message-ID: Hi Luke, I've seen the same apparent grouping of nodes, I don't believe the nodes are actually being grouped but instead the "Device Bond0:" and column headers are being re-printed to screen whenever there is a node that has the "init" status followed by a node that is "connected". It is something I've noticed on many different versions of GPFS so I imagine it's a "feature". I've not noticed anything but '0' in the err column so I'm not sure if these correspond to error codes in the GPFS logs. If you run the command "mmfsadm dump tscomm", you'll see a bit more detail than the mmdiag -network shows. This suggests the sock column is number of sockets. I've seen the low numbers to for sent / recv using mmdiag --network, again the mmfsadm command above gives a better representation I've found. All that being said, if you want to get in touch with us then we'll happily open a PMR for you and find out the answer to any of your questions. Kind regards, Danny Metcalfe Systems Engineer OCF plc Tel: 0114 257 2200 [cid:image001.jpg at 01CFCE04.575B8380] Twitter Fax: 0114 257 0022 [cid:image002.jpg at 01CFCE04.575B8380] Blog Mob: 07960 503404 [cid:image003.jpg at 01CFCE04.575B8380] Web Please note, any emails relating to an OCF Support request must always be sent to support at ocf.co.uk for a ticket number to be generated or existing support ticket to be updated. Should this not be done then OCF cannot be held responsible for requests not dealt with in a timely manner. OCF plc is a company registered in England and Wales. Registered number 4132533. Registered office address: OCF plc, 5 Rotunda Business Centre, Thorncliffe Park, Chapeltown, Sheffield, S35 2PG This message is private and confidential. If you have received this message in error, please notify us immediately and remove it from your system. -----Original Message----- From: gpfsug-discuss-bounces at gpfsug.org [mailto:gpfsug-discuss-bounces at gpfsug.org] On Behalf Of Luke Raimbach Sent: 09 September 2014 11:24 To: gpfsug-discuss at gpfsug.org Subject: [gpfsug-discuss] mmdiag output questions Hi All, When tracing a problem recently (which turned out to be a NIC failure), mmdiag proved useful in tracing broken cluster connections. I have some questions about the output of mmdiag using the --network switch: Occasionally I see nodes in the same cluster grouped, but in no readily identifiable way - for example, the following output has three headings "Device bon0:" with some nodes listed, but the nodes don't seem to share anything in common like status, err, ostype, etc. Also, is anyone able to explain what might be seen under the err column? Do these correspond to GPFS error codes as one might see in mmfs.log.latest? What is the sock column displaying - the number of open sockets or the socket state? Lastly, the sent/recvd columns seem very low. Is there a rolling time window within which these statistics are kept in some internal mmfsd buffer? Cheers. === mmdiag: network === Pending messages: (none) Inter-node communication configuration: tscTcpPort 1191 my address 10.100.10.51/22 (eth0) my addr list 10.200.21.1/16 (bond0)/cpdn.oerc.local 10.100.10.51/22 (eth0) my node number 9 TCP Connections between nodes: Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype gpfs01 10.200.1.1 connected 0 32 110 110 Linux/L gpfs02 10.200.2.1 connected 0 36 104 104 Linux/L linux 10.200.101.1 connected 0 37 0 0 Linux/L jupiter 10.200.102.1 connected 0 35 0 0 Windows/L cnfs0 10.200.10.10 connected 0 39 0 0 Linux/L cnfs1 10.200.10.11 init 0 -1 0 0 Linux/L Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype cnfs2 10.200.10.12 connected 0 33 5 5 Linux/L cnfs3 10.200.10.13 init 0 -1 0 0 Linux/L cpdn-ppc02 10.200.61.1 init 0 -1 0 0 Linux/L cpdn-ppc03 10.200.62.1 init 0 -1 0 0 Linux/L Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype cpdn-ppc01 10.200.60.1 connected 0 38 0 0 Linux/L diag verbs: VERBS RDMA class not initialized Conversely, the output of mmdiag --network on the file system manager node for the same cluster looks like this: === mmdiag: network === Pending messages: (none) Inter-node communication configuration: tscTcpPort 1191 my address 10.100.10.21/22 (eth0) my addr list 10.200.1.1/16 (bond0)/cpdn.oerc.local 10.100.10.21/22 (eth0) my node number 1 TCP Connections between nodes: Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype gpfs02 10.200.2.1 connected 0 73 219 219 Linux/L linux 10.200.101.1 connected 0 49 180 181 Linux/L jupiter 10.200.102.1 connected 0 33 3 3 Windows/L cnfs0 10.200.10.10 connected 0 61 3 3 Linux/L cnfs1 10.200.10.11 connected 0 81 0 0 Linux/L cnfs2 10.200.10.12 connected 0 64 23 23 Linux/L cnfs3 10.200.10.13 connected 0 60 2 2 Linux/L tsm01 10.200.21.1 connected 0 50 110 110 Linux/L cpdn-ppc02 10.200.61.1 connected 0 63 0 0 Linux/L cpdn-ppc03 10.200.62.1 connected 0 65 0 0 Linux/L cpdn-ppc01 10.200.60.1 connected 0 62 94 94 Linux/L diag verbs: VERBS RDMA class not initialized All neatly connected! -- Luke Raimbach IT Manager Oxford e-Research Centre 7 Keble Road, Oxford, OX1 3QG +44(0)1865 610639 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ----- No virus found in this message. Checked by AVG - www.avg.com Version: 2014.0.4765 / Virus Database: 4015/8158 - Release Date: 09/05/14 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 4696 bytes Desc: image001.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.jpg Type: image/jpeg Size: 4725 bytes Desc: image002.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.jpg Type: image/jpeg Size: 4820 bytes Desc: image003.jpg URL: From stuartb at 4gh.net Tue Sep 23 16:47:09 2014 From: stuartb at 4gh.net (Stuart Barkley) Date: Tue, 23 Sep 2014 11:47:09 -0400 (EDT) Subject: [gpfsug-discuss] filesets and mountpoint naming Message-ID: When we first started using GPFS we created several filesystems and just directly mounted them where seemed appropriate. We have something like: /home /scratch /projects /reference /applications We are finding the overhead of separate filesystems to be troublesome and are looking at using filesets inside fewer filesystems to accomplish our goals (we will probably keep /home separate for now). We can put symbolic links in place to provide the same user experience, but I'm looking for suggestions as to where to mount the actual gpfs filesystems. We have multiple compute clusters with multiple gpfs systems, one cluster has a traditional gpfs system and a separate gss system which will obviously need multiple mount points. We also want to consider possible future cross cluster mounts. Some thoughts are to just do filesystems as: /gpfs01, /gpfs02, etc. /mnt/gpfs01, etc /mnt/clustera/gpfs01, etc. What have other people done? Are you happy with it? What would you do differently? Thanks, Stuart -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone From sabujp at gmail.com Thu Sep 25 13:39:14 2014 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Thu, 25 Sep 2014 07:39:14 -0500 Subject: [gpfsug-discuss] really slow cnfs vip failover due to exportfs -u in bin/mmnfsfuncs unexportAll and unexportFS Message-ID: Hi all, We're running 3.5.0.19 with CNFS and noticed really slow CNFS vip failover times > 4.5mins . It looks like it's being caused by all the exportfs -u calls being made in the unexportAll and the unexportFS function in bin/mmnfsfuncs . What's the purpose of running exportfs -u on all the exported directories? We're running only NFSv3 and have lots of exports and for security reasons can't have one giant NFS export. That may be a possibility with GPFS4.1 and NFSv4 but we won't be migrating to that anytime soon. Assume the network went down for the cnfs server or the system panicked/crashed, what would be the purpose of exportfs -u be in that case, so what's the purpose at all? Thanks, Sabuj -------------- next part -------------- An HTML attachment was scrubbed... URL: From sabujp at gmail.com Thu Sep 25 14:11:18 2014 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Thu, 25 Sep 2014 08:11:18 -0500 Subject: [gpfsug-discuss] really slow cnfs vip failover due to exportfs -u in bin/mmnfsfuncs unexportAll and unexportFS In-Reply-To: References: Message-ID: our support engineer suggests adding & to the end of the exportfs -u lines in the mmnfsfunc script, which is a good workaround, can this be added to future gpfs 3.5 and 4.1 rels (haven't even looked at 4.1 yet). I was looking at the unexportfs all in nfs-utils/exportfs.c and it looks like the limiting factor there would be all the hostname lookups? I don't see what exportfs -u is doing other than doing slow reverse lookups and removing the export from the nfs stack. On Thu, Sep 25, 2014 at 7:39 AM, Sabuj Pattanayek wrote: > Hi all, > > We're running 3.5.0.19 with CNFS and noticed really slow CNFS vip failover > times > 4.5mins . It looks like it's being caused by all the exportfs -u > calls being made in the unexportAll and the unexportFS function in > bin/mmnfsfuncs . What's the purpose of running exportfs -u on all the > exported directories? We're running only NFSv3 and have lots of exports and > for security reasons can't have one giant NFS export. That may be a > possibility with GPFS4.1 and NFSv4 but we won't be migrating to that > anytime soon. > > Assume the network went down for the cnfs server or the system > panicked/crashed, what would be the purpose of exportfs -u be in that case, > so what's the purpose at all? > > Thanks, > Sabuj > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sabujp at gmail.com Thu Sep 25 14:15:19 2014 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Thu, 25 Sep 2014 08:15:19 -0500 Subject: [gpfsug-discuss] really slow cnfs vip failover due to exportfs -u in bin/mmnfsfuncs unexportAll and unexportFS In-Reply-To: References: Message-ID: yes, it's doing a getaddrinfo() call for every hostname that's a fqdn and not an ip addr, which we have lots of in our export entries since sometimes clients update their dns (ip's). On Thu, Sep 25, 2014 at 8:11 AM, Sabuj Pattanayek wrote: > our support engineer suggests adding & to the end of the exportfs -u lines > in the mmnfsfunc script, which is a good workaround, can this be added to > future gpfs 3.5 and 4.1 rels (haven't even looked at 4.1 yet). I was > looking at the unexportfs all in nfs-utils/exportfs.c and it looks like the > limiting factor there would be all the hostname lookups? I don't see what > exportfs -u is doing other than doing slow reverse lookups and removing the > export from the nfs stack. > > On Thu, Sep 25, 2014 at 7:39 AM, Sabuj Pattanayek > wrote: > >> Hi all, >> >> We're running 3.5.0.19 with CNFS and noticed really slow CNFS vip >> failover times > 4.5mins . It looks like it's being caused by all the >> exportfs -u calls being made in the unexportAll and the unexportFS function >> in bin/mmnfsfuncs . What's the purpose of running exportfs -u on all the >> exported directories? We're running only NFSv3 and have lots of exports and >> for security reasons can't have one giant NFS export. That may be a >> possibility with GPFS4.1 and NFSv4 but we won't be migrating to that >> anytime soon. >> >> Assume the network went down for the cnfs server or the system >> panicked/crashed, what would be the purpose of exportfs -u be in that case, >> so what's the purpose at all? >> >> Thanks, >> Sabuj >> >> >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Mon Sep 1 20:44:45 2014 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Mon, 1 Sep 2014 19:44:45 +0000 Subject: [gpfsug-discuss] GPFS admin host name vs subnets Message-ID: I was just reading through the docs at: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/GPFS+Network+Communication+Overview And was wondering about using admin host name bs using subnets. My reading of the page is that if say I have a 1GbE network and a 40GbE network, I could have an admin host name on the 1GbE network. But equally from the docs, it looks like I could also use subnets to achieve the same whilst allowing the admin network to be a fall back for data if necessary. For example, create the cluster using the primary name on the 1GbE network, then use the subnets property to use set the network on the 40GbE network as the first and the network on the 1GbE network as the second in the list, thus GPFS data will pass over the 40GbE network in preference and the 1GbE network will, by default only be used for admin traffic as the admin host name will just be the name of the host on the 1GbE network. Is my reading of the docs correct? Or do I really want to be creating the cluster using the 40GbE network hostnames and set the admin node name to the name of the 1GbE network interface? (there's actually also an FDR switch in there somewhere for verbs as well) Thanks Simon From ewahl at osc.edu Tue Sep 2 14:44:29 2014 From: ewahl at osc.edu (Ed Wahl) Date: Tue, 2 Sep 2014 13:44:29 +0000 Subject: [gpfsug-discuss] GPFS admin host name vs subnets In-Reply-To: References: Message-ID: Seems like you are on the correct track. This is similar to my setup. subnett'ed daemon 10GbE, 1GbE with main being QDR RDMA, admin 1GbE. To my mind the most important part is Setting "privateSubnetOverride" to 1. This allows both your 1GbE and your 40GbE to be on a private subnet. Serving block over public IPs just seems wrong on SO many levels. Whether truly private/internal or not. And how many people use public IPs internally? Wait, maybe I don't want to know... Using 'verbsRdma enable' for your FDR seems to override Daemon node name for block, at least in my experience. I love the fallback to 10GbE and then 1GbE in case of disaster when using IB. Lately we seem to be generating bugs in OpenSM at a frightening rate so that has been _extremely_ helpful. Now if we could just monitor when it happens more easily than running mmfsadm test verbs conn, say by logging a failure of RDMA? Ed OSC ________________________________________ From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of Simon Thompson (Research Computing - IT Services) [S.J.Thompson at bham.ac.uk] Sent: Monday, September 01, 2014 3:44 PM To: gpfsug main discussion list Subject: [gpfsug-discuss] GPFS admin host name vs subnets I was just reading through the docs at: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/GPFS+Network+Communication+Overview And was wondering about using admin host name bs using subnets. My reading of the page is that if say I have a 1GbE network and a 40GbE network, I could have an admin host name on the 1GbE network. But equally from the docs, it looks like I could also use subnets to achieve the same whilst allowing the admin network to be a fall back for data if necessary. For example, create the cluster using the primary name on the 1GbE network, then use the subnets property to use set the network on the 40GbE network as the first and the network on the 1GbE network as the second in the list, thus GPFS data will pass over the 40GbE network in preference and the 1GbE network will, by default only be used for admin traffic as the admin host name will just be the name of the host on the 1GbE network. Is my reading of the docs correct? Or do I really want to be creating the cluster using the 40GbE network hostnames and set the admin node name to the name of the 1GbE network interface? (there's actually also an FDR switch in there somewhere for verbs as well) Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From oehmes at gmail.com Tue Sep 2 15:11:03 2014 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 2 Sep 2014 07:11:03 -0700 Subject: [gpfsug-discuss] GPFS admin host name vs subnets In-Reply-To: References: Message-ID: Ed, if you enable RDMA, GPFS will always use this as preferred data transfer. if you have subnets configured, GPFS will prefer this for communication with higher priority as the default interface. so the order is RDMA , subnets, default. if RDMA will fail for whatever reason we will use the subnets defined interface and if that fails as well we will use the default interface. the easiest way to see what is used is to run mmdiag --network (only avail on more recent versions of GPFS) it will tell you if RDMA is enabled between individual nodes as well as if a subnet connection is used or not : [root at client05 ~]# mmdiag --network === mmdiag: network === Pending messages: (none) Inter-node communication configuration: tscTcpPort 1191 my address 192.167.13.5/16 (eth0) my addr list 192.1.13.5/16 (ib1) 192.0.13.5/16 (ib0)/ client04.clientad.almaden.ibm.com 192.167.13.5/16 (eth0) my node number 17 TCP Connections between nodes: Device ib0: hostname node destination status err sock sent(MB) recvd(MB) ostype client04n1 192.0.4.1 connected 0 69 0 37 Linux/L client04n2 192.0.4.2 connected 0 70 0 37 Linux/L client04n3 192.0.4.3 connected 0 68 0 0 Linux/L Device ib1: hostname node destination status err sock sent(MB) recvd(MB) ostype clientcl21 192.1.201.21 connected 0 65 0 0 Linux/L clientcl25 192.1.201.25 connected 0 66 0 0 Linux/L clientcl26 192.1.201.26 connected 0 67 0 0 Linux/L clientcl21 192.1.201.21 connected 0 71 0 0 Linux/L clientcl22 192.1.201.22 connected 0 63 0 0 Linux/L client10 192.1.13.10 connected 0 73 0 0 Linux/L client08 192.1.13.8 connected 0 72 0 0 Linux/L RDMA Connections between nodes: Fabric 1 - Device mlx4_0 Port 1 Width 4x Speed FDR lid 13 hostname idx CM state VS buff RDMA_CT(ERR) RDMA_RCV_MB RDMA_SND_MB VS_CT(ERR) VS_SND_MB VS_RCV_MB WAIT_CON_SLOT WAIT_NODE_SLOT clientcl21 0 N RTS (Y)903 0 (0 ) 0 0 192 (0 ) 0 0 0 0 client04n1 0 N RTS (Y)477 0 (0 ) 0 0 12367404(0 ) 107905 594 0 0 client04n1 1 N RTS (Y)477 0 (0 ) 0 0 12367404(0 ) 107901 593 0 0 client04n2 0 N RTS (Y)477 0 (0 ) 0 0 12371352(0 ) 107911 594 0 0 client04n2 2 N RTS (Y)477 0 (0 ) 0 0 12371352(0 ) 107902 594 0 0 clientcl21 0 N RTS (Y)880 0 (0 ) 0 0 11 (0 ) 0 0 0 0 client04n3 0 N RTS (Y)969 0 (0 ) 0 0 5 (0 ) 0 0 0 0 clientcl26 0 N RTS (Y)702 0 (0 ) 0 0 35 (0 ) 0 0 0 0 client08 0 N RTS (Y)637 0 (0 ) 0 0 16 (0 ) 0 0 0 0 clientcl25 0 N RTS (Y)574 0 (0 ) 0 0 14 (0 ) 0 0 0 0 clientcl22 0 N RTS (Y)507 0 (0 ) 0 0 2 (0 ) 0 0 0 0 client10 0 N RTS (Y)568 0 (0 ) 0 0 121 (0 ) 0 0 0 0 Fabric 2 - Device mlx4_0 Port 2 Width 4x Speed FDR lid 65 hostname idx CM state VS buff RDMA_CT(ERR) RDMA_RCV_MB RDMA_SND_MB VS_CT(ERR) VS_SND_MB VS_RCV_MB WAIT_CON_SLOT WAIT_NODE_SLOT clientcl21 1 N RTS (Y)904 0 (0 ) 0 0 192 (0 ) 0 0 0 0 client04n1 2 N RTS (Y)477 0 (0 ) 0 0 12367404(0 ) 107897 593 0 0 client04n2 1 N RTS (Y)477 0 (0 ) 0 0 12371352(0 ) 107903 594 0 0 clientcl21 1 N RTS (Y)881 0 (0 ) 0 0 10 (0 ) 0 0 0 0 clientcl26 1 N RTS (Y)701 0 (0 ) 0 0 35 (0 ) 0 0 0 0 client08 1 N RTS (Y)637 0 (0 ) 0 0 16 (0 ) 0 0 0 0 clientcl25 1 N RTS (Y)574 0 (0 ) 0 0 14 (0 ) 0 0 0 0 clientcl22 1 N RTS (Y)507 0 (0 ) 0 0 2 (0 ) 0 0 0 0 client10 1 N RTS (Y)568 0 (0 ) 0 0 121 (0 ) 0 0 0 0 in this example you can see thet my client (client05) has multiple subnets configured as well as RDMA. so to connected to the various TCP devices (ib0 and ib1) to different cluster nodes and also has a RDMA connection to a different set of nodes. as you can see there is basically no traffic on the TCP devices, as all the traffic uses the 2 defined RDMA fabrics. there is not a single connection using the daemon interface (eth0) as all nodes are either connected via subnets or via RDMA. hope this helps. Sven On Tue, Sep 2, 2014 at 6:44 AM, Ed Wahl wrote: > Seems like you are on the correct track. This is similar to my setup. > subnett'ed daemon 10GbE, 1GbE with main being QDR RDMA, admin 1GbE. To > my mind the most important part is Setting "privateSubnetOverride" to 1. > This allows both your 1GbE and your 40GbE to be on a private subnet. > Serving block over public IPs just seems wrong on SO many levels. Whether > truly private/internal or not. And how many people use public IPs > internally? Wait, maybe I don't want to know... > > Using 'verbsRdma enable' for your FDR seems to override Daemon node > name for block, at least in my experience. I love the fallback to 10GbE > and then 1GbE in case of disaster when using IB. Lately we seem to be > generating bugs in OpenSM at a frightening rate so that has been > _extremely_ helpful. Now if we could just monitor when it happens more > easily than running mmfsadm test verbs conn, say by logging a failure of > RDMA? > > > Ed > OSC > > ________________________________________ > From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] > on behalf of Simon Thompson (Research Computing - IT Services) [ > S.J.Thompson at bham.ac.uk] > Sent: Monday, September 01, 2014 3:44 PM > To: gpfsug main discussion list > Subject: [gpfsug-discuss] GPFS admin host name vs subnets > > I was just reading through the docs at: > > > https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/GPFS+Network+Communication+Overview > > And was wondering about using admin host name bs using subnets. My reading > of the page is that if say I have a 1GbE network and a 40GbE network, I > could have an admin host name on the 1GbE network. But equally from the > docs, it looks like I could also use subnets to achieve the same whilst > allowing the admin network to be a fall back for data if necessary. > > For example, create the cluster using the primary name on the 1GbE > network, then use the subnets property to use set the network on the 40GbE > network as the first and the network on the 1GbE network as the second in > the list, thus GPFS data will pass over the 40GbE network in preference and > the 1GbE network will, by default only be used for admin traffic as the > admin host name will just be the name of the host on the 1GbE network. > > Is my reading of the docs correct? Or do I really want to be creating the > cluster using the 40GbE network hostnames and set the admin node name to > the name of the 1GbE network interface? > > (there's actually also an FDR switch in there somewhere for verbs as well) > > Thanks > > Simon > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Wed Sep 3 18:27:44 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Wed, 03 Sep 2014 18:27:44 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring Message-ID: <54074F90.7000303@ebi.ac.uk> Hello everybody, here i come here again, this time to ask some hint about how to monitor GPFS. I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is that they return number based only on the request done in the current host, so i have to run them on all the clients ( over 600 nodes) so its quite unpractical. Instead i would like to know from the servers whats going on, and i came across the vio_s statistics wich are less documented and i dont know exacly what they mean. There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that runs VIO_S. My problems with the output of this command: echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second timestamp: 1409763206/477366 recovery group: * declustered array: * vdisk: * client reads: 2584229 client short writes: 55299693 client medium writes: 190071 client promoted full track writes: 465145 client full track writes: 9249 flushed update writes: 4187708 flushed promoted full track writes: 123 migrate operations: 114 scrub operations: 450590 log writes: 28509602 it sais "VIOPS per second", but they seem to me just counters as every time i re-run the command, the numbers increase by a bit.. Can anyone confirm if those numbers are counter or if they are OPS/sec. On a closer eye about i dont understand what most of thosevalues mean. For example, what exacly are "flushed promoted full track write" ?? I tried to find a documentation about this output , but could not find any. can anyone point me a link where output of vio_s is explained? Another thing i dont understand about those numbers is if they are just operations, or the number of blocks that was read/write/etc . I'm asking that because if they are just ops, i don't know how much they could be usefull. For example one write operation could eman write 1 block or write a file of 100GB. If those are oprations, there is a way to have the oupunt in bytes or blocks? Last but not least.. and this is what i really would like to accomplish, i would to be able to monitor the latency of metadata operations. In my environment there are users that litterally overhelm our storages with metadata request, so even if there is no massive throughput or huge waiters, any "ls" could take ages. I would like to be able to monitor metadata behaviour. There is a way to to do that from the NSD servers? Thanks in advance for any tip/help. Regards, Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From chekh at stanford.edu Wed Sep 3 21:55:14 2014 From: chekh at stanford.edu (Alex Chekholko) Date: Wed, 03 Sep 2014 13:55:14 -0700 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <54074F90.7000303@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> Message-ID: <54078032.2050605@stanford.edu> The usual way to do that is to re-architect your filesystem so that the system pool is metadata-only, and then you can just look at the storage layer and see total metadata throughput that way. Otherwise your metadata ops are mixed in with your data ops. Of course, both NSDs and clients also have metadata caches. On 09/03/2014 10:27 AM, Salvatore Di Nardo wrote: > > Last but not least.. and this is what i really would like to accomplish, > i would to be able to monitor the latency of metadata operations. > In my environment there are users that litterally overhelm our storages > with metadata request, so even if there is no massive throughput or huge > waiters, any "ls" could take ages. I would like to be able to monitor > metadata behaviour. There is a way to to do that from the NSD servers? -- Alex Chekholko chekh at stanford.edu 347-401-4860 From oehmes at us.ibm.com Thu Sep 4 01:50:25 2014 From: oehmes at us.ibm.com (Sven Oehme) Date: Wed, 3 Sep 2014 17:50:25 -0700 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <54074F90.7000303@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> Message-ID: > Hello everybody, Hi > here i come here again, this time to ask some hint about how to monitor GPFS. > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is > that they return number based only on the request done in the > current host, so i have to run them on all the clients ( over 600 > nodes) so its quite unpractical. Instead i would like to know from > the servers whats going on, and i came across the vio_s statistics > wich are less documented and i dont know exacly what they mean. > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that > runs VIO_S. > > My problems with the output of this command: > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second > timestamp: 1409763206/477366 > recovery group: * > declustered array: * > vdisk: * > client reads: 2584229 > client short writes: 55299693 > client medium writes: 190071 > client promoted full track writes: 465145 > client full track writes: 9249 > flushed update writes: 4187708 > flushed promoted full track writes: 123 > migrate operations: 114 > scrub operations: 450590 > log writes: 28509602 > > it sais "VIOPS per second", but they seem to me just counters as > every time i re-run the command, the numbers increase by a bit.. > Can anyone confirm if those numbers are counter or if they are OPS/sec. the numbers are accumulative so everytime you run them they just show the value since start (or last reset) time. > > On a closer eye about i dont understand what most of thosevalues > mean. For example, what exacly are "flushed promoted full track write" ?? > I tried to find a documentation about this output , but could not > find any. can anyone point me a link where output of vio_s is explained? > > Another thing i dont understand about those numbers is if they are > just operations, or the number of blocks that was read/write/etc . its just operations and if i would explain what the numbers mean i might confuse you even more because this is not what you are really looking for. what you are looking for is what the client io's look like on the Server side, while the VIO layer is the Server side to the disks, so one lever lower than what you are looking for from what i could read out of the description above. so the Layer you care about is the NSD Server layer, which sits on top of the VIO layer (which is essentially the SW RAID Layer in GNR) > I'm asking that because if they are just ops, i don't know how much > they could be usefull. For example one write operation could eman > write 1 block or write a file of 100GB. If those are oprations, > there is a way to have the oupunt in bytes or blocks? there are multiple ways to get infos on the NSD layer, one would be to use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats counts again. the alternative option is to use mmdiag --iohist. this shows you a history of the last X numbers of io operations on either the client or the server side like on a client : # mmdiag --iohist === mmdiag: iohist === I/O history: I/O start time RW Buf type disk:sectorNum nSec time ms qTime ms RpcTimes ms Type Device/NSD ID NSD server --------------- -- ----------- ----------------- ----- ------- -------- ----------------- ---- ------------------ --------------- 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 14:25:22.182723 R inode 1:1071252480 8 6.970 0.000 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 14:25:53.668262 R inode 2:1081373696 8 14.117 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 14:25:53.692019 R inode 2:1064356608 8 14.899 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 14:25:53.707100 R inode 2:1077830152 8 16.499 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 14:25:53.728082 R inode 2:1081918976 8 7.760 0.000 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 14:25:57.877416 R metadata 2:678978560 16 13.343 0.000 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 14:25:57.906556 R inode 2:1083476520 8 11.723 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 14:25:57.926592 R inode 1:1076503480 8 8.087 0.000 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 14:25:57.941441 R inode 2:1069885984 8 11.686 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 14:25:57.953294 R inode 2:1083476936 8 8.951 0.000 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 14:25:57.965475 R inode 1:1076503504 8 0.477 0.000 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 14:25:57.965755 R inode 2:1083476488 8 0.410 0.000 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 14:25:57.965787 R inode 2:1083476512 8 0.439 0.000 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 you basically see if its a inode , data block , what size it has (in sectors) , which nsd server you did send this request to, etc. on the Server side you see the type , which physical disk it goes to and also what size of disk i/o it causes like : 14:26:50.129995 R inode 12:3211886376 64 14.261 0.000 0.000 0.000 pd sdis 14:26:50.137102 R inode 19:3003969520 64 9.004 0.000 0.000 0.000 pd sdad 14:26:50.136116 R inode 55:3591710992 64 11.057 0.000 0.000 0.000 pd sdoh 14:26:50.141510 R inode 21:3066810504 64 5.909 0.000 0.000 0.000 pd sdaf 14:26:50.130529 R inode 89:2962370072 64 17.437 0.000 0.000 0.000 pd sddi 14:26:50.131063 R inode 78:1889457000 64 17.062 0.000 0.000 0.000 pd sdsj 14:26:50.143403 R inode 36:3323035688 64 4.807 0.000 0.000 0.000 pd sdmw 14:26:50.131044 R inode 37:2513579736 128 17.181 0.000 0.000 0.000 pd sddv 14:26:50.138181 R inode 72:3868810400 64 10.951 0.000 0.000 0.000 pd sdbz 14:26:50.138188 R inode 131:2443484784 128 11.792 0.000 0.000 0.000 pd sdug 14:26:50.138003 R inode 102:3696843872 64 11.994 0.000 0.000 0.000 pd sdgp 14:26:50.137099 R inode 145:3370922504 64 13.225 0.000 0.000 0.000 pd sdmi 14:26:50.141576 R inode 62:2668579904 64 9.313 0.000 0.000 0.000 pd sdou 14:26:50.134689 R inode 159:2786164648 64 16.577 0.000 0.000 0.000 pd sdpq 14:26:50.145034 R inode 34:2097217320 64 7.409 0.000 0.000 0.000 pd sdmt 14:26:50.138140 R inode 139:2831038792 64 14.898 0.000 0.000 0.000 pd sdlw 14:26:50.130954 R inode 164:282120312 64 22.274 0.000 0.000 0.000 pd sdzd 14:26:50.137038 R inode 41:3421909608 64 16.314 0.000 0.000 0.000 pd sdef 14:26:50.137606 R inode 104:1870962416 64 16.644 0.000 0.000 0.000 pd sdgx 14:26:50.141306 R inode 65:2276184264 64 16.593 0.000 0.000 0.000 pd sdrk > > Last but not least.. and this is what i really would like to > accomplish, i would to be able to monitor the latency of metadata operations. you can't do this on the server side as you don't know how much time you spend on the client , network or anything between the app and the physical disk, so you can only reliably look at this from the client, the iohist output only shows you the Server disk i/o processing time, but that can be a fraction of the overall time (in other cases this obviously can also be the dominant part depending on your workload). the easiest way on the client is to run mmfsadm vfsstats enable from now on vfs stats are collected until you restart GPFS. then run : vfs statistics currently enabled started at: Fri Aug 29 13:15:05.380 2014 duration: 448446.970 sec name calls time per call total time -------------------- -------- -------------- -------------- statfs 9 0.000002 0.000021 startIO 246191176 0.005853 1441049.976740 to dump what ever you collected so far on this node. > In my environment there are users that litterally overhelm our > storages with metadata request, so even if there is no massive > throughput or huge waiters, any "ls" could take ages. I would like > to be able to monitor metadata behaviour. There is a way to to do > that from the NSD servers? not this simple as described above. > > Thanks in advance for any tip/help. > > Regards, > Salvatore_______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Thu Sep 4 11:05:18 2014 From: service at metamodul.com (service at metamodul.com) Date: Thu, 4 Sep 2014 12:05:18 +0200 (CEST) Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <54074F90.7000303@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> Message-ID: <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> > , any "ls" could take ages. Check if you large directories either with many files or simply large. Verify if you have NFS exported GPFS. Verify that your cache settings on the clients are large enough ( maxStatCache , maxFilesToCache , sharedMemLimit ) Verify that you have dedicated metadata luns ( metadataOnly ) Reference: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Tuning%20Parameters Note: If possible monitor your metadata luns on the storage directly. hth Hajo -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Thu Sep 4 11:43:36 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 04 Sep 2014 11:43:36 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: References: <54074F90.7000303@ebi.ac.uk> Message-ID: <54084258.90508@ebi.ac.uk> On 04/09/14 01:50, Sven Oehme wrote: > > Hello everybody, > > Hi > > > here i come here again, this time to ask some hint about how to > monitor GPFS. > > > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is > > that they return number based only on the request done in the > > current host, so i have to run them on all the clients ( over 600 > > nodes) so its quite unpractical. Instead i would like to know from > > the servers whats going on, and i came across the vio_s statistics > > wich are less documented and i dont know exacly what they mean. > > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that > > runs VIO_S. > > > > My problems with the output of this command: > > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 > > > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second > > timestamp: 1409763206/477366 > > recovery group: * > > declustered array: * > > vdisk: * > > client reads: 2584229 > > client short writes: 55299693 > > client medium writes: 190071 > > client promoted full track writes: 465145 > > client full track writes: 9249 > > flushed update writes: 4187708 > > flushed promoted full track writes: 123 > > migrate operations: 114 > > scrub operations: 450590 > > log writes: 28509602 > > > > it sais "VIOPS per second", but they seem to me just counters as > > every time i re-run the command, the numbers increase by a bit.. > > Can anyone confirm if those numbers are counter or if they are OPS/sec. > > the numbers are accumulative so everytime you run them they just show > the value since start (or last reset) time. OK, you confirmed my toughts, thatks > > > > > On a closer eye about i dont understand what most of thosevalues > > mean. For example, what exacly are "flushed promoted full track > write" ?? > > I tried to find a documentation about this output , but could not > > find any. can anyone point me a link where output of vio_s is explained? > > > > Another thing i dont understand about those numbers is if they are > > just operations, or the number of blocks that was read/write/etc . > > its just operations and if i would explain what the numbers mean i > might confuse you even more because this is not what you are really > looking for. > what you are looking for is what the client io's look like on the > Server side, while the VIO layer is the Server side to the disks, so > one lever lower than what you are looking for from what i could read > out of the description above. No.. what I'm looking its exactly how the disks are busy to keep the requests. Obviously i'm not looking just that, but I feel the needs to monitor _*also*_ those things. Ill explain you why. It happens when our storage is quite busy ( 180Gb/s of read/write ) that the FS start to be slowin normal /*cd*/ or /*ls*/ requests. This might be normal, but in those situation i want to know where the bottleneck is. Is the server CPU? Memory? Network? Spindles? knowing where the bottlenek is might help me to understand if we can tweak the system a bit more. If its the CPU on the servers then there is no much to do beside replacing or add more servers.If its not the CPU, maybe more memory would help? Maybe its just the network that filled up? so i can add more links Or if we reached the point there the bottleneck its the spindles, then there is no much point o look somethere else, we just reached the hardware limit.. Sometimes, it also happens that there is very low IO (10Gb/s ), almost no cpu usage on the servers but huge slownes ( ls can take 10 seconds). Why that happens? There is not much data ops , but we think there is a huge ammount of metadata ops. So what i want to know is if the metadata vdisks are busy or not. If this is our problem, could some SSD disks dedicated to metadata help? In particular im, a bit puzzled with the design of our GSS storage. Each recovery groups have 3 declustered arrays, and each declustered aray have 1 data and 1 metadata vdisk, but in the end both metadata and data vdisks use the same spindles. The problem that, its that I dont understand if we have a metadata bottleneck there. Maybe some SSD disks in a dedicated declustered array would perform much better, but this is just theory. I really would like to be able to monitor IO activities on the metadata vdisks. > > > so the Layer you care about is the NSD Server layer, which sits on top > of the VIO layer (which is essentially the SW RAID Layer in GNR) > > > I'm asking that because if they are just ops, i don't know how much > > they could be usefull. For example one write operation could eman > > write 1 block or write a file of 100GB. If those are oprations, > > there is a way to have the oupunt in bytes or blocks? > > there are multiple ways to get infos on the NSD layer, one would be to > use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats counts > again. Counters its not a problem. I can collect them and create some graphs in a monitoring tool. I will check that. > > the alternative option is to use mmdiag --iohist. this shows you a > history of the last X numbers of io operations on either the client or > the server side like on a client : > > # mmdiag --iohist > > === mmdiag: iohist === > > I/O history: > > I/O start time RW Buf type disk:sectorNum nSec time ms qTime > ms RpcTimes ms Type Device/NSD ID NSD server > --------------- -- ----------- ----------------- ----- ------- > -------- ----------------- ---- ------------------ --------------- > 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 > 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:22.182723 R inode 1:1071252480 8 6.970 0.000 > 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 > 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.668262 R inode 2:1081373696 8 14.117 > 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 > 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.692019 R inode 2:1064356608 8 14.899 > 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.707100 R inode 2:1077830152 8 16.499 > 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 > 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.728082 R inode 2:1081918976 8 7.760 0.000 > 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.877416 R metadata 2:678978560 16 13.343 0.000 > 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 > 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.906556 R inode 2:1083476520 8 11.723 > 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 > 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.926592 R inode 1:1076503480 8 8.087 0.000 > 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 > 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.941441 R inode 2:1069885984 8 11.686 > 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.953294 R inode 2:1083476936 8 8.951 0.000 > 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965475 R inode 1:1076503504 8 0.477 0.000 > 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.965755 R inode 2:1083476488 8 0.410 0.000 > 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965787 R inode 2:1083476512 8 0.439 0.000 > 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 > > you basically see if its a inode , data block , what size it has (in > sectors) , which nsd server you did send this request to, etc. > > on the Server side you see the type , which physical disk it goes to > and also what size of disk i/o it causes like : > > 14:26:50.129995 R inode 12:3211886376 64 14.261 > 0.000 0.000 0.000 pd sdis > 14:26:50.137102 R inode 19:3003969520 64 9.004 > 0.000 0.000 0.000 pd sdad > 14:26:50.136116 R inode 55:3591710992 64 11.057 > 0.000 0.000 0.000 pd sdoh > 14:26:50.141510 R inode 21:3066810504 64 5.909 > 0.000 0.000 0.000 pd sdaf > 14:26:50.130529 R inode 89:2962370072 64 17.437 > 0.000 0.000 0.000 pd sddi > 14:26:50.131063 R inode 78:1889457000 64 17.062 > 0.000 0.000 0.000 pd sdsj > 14:26:50.143403 R inode 36:3323035688 64 4.807 > 0.000 0.000 0.000 pd sdmw > 14:26:50.131044 R inode 37:2513579736 128 17.181 > 0.000 0.000 0.000 pd sddv > 14:26:50.138181 R inode 72:3868810400 64 10.951 > 0.000 0.000 0.000 pd sdbz > 14:26:50.138188 R inode 131:2443484784 128 11.792 > 0.000 0.000 0.000 pd sdug > 14:26:50.138003 R inode 102:3696843872 64 11.994 > 0.000 0.000 0.000 pd sdgp > 14:26:50.137099 R inode 145:3370922504 64 13.225 > 0.000 0.000 0.000 pd sdmi > 14:26:50.141576 R inode 62:2668579904 64 9.313 > 0.000 0.000 0.000 pd sdou > 14:26:50.134689 R inode 159:2786164648 64 16.577 > 0.000 0.000 0.000 pd sdpq > 14:26:50.145034 R inode 34:2097217320 64 7.409 > 0.000 0.000 0.000 pd sdmt > 14:26:50.138140 R inode 139:2831038792 64 14.898 > 0.000 0.000 0.000 pd sdlw > 14:26:50.130954 R inode 164:282120312 64 22.274 > 0.000 0.000 0.000 pd sdzd > 14:26:50.137038 R inode 41:3421909608 64 16.314 > 0.000 0.000 0.000 pd sdef > 14:26:50.137606 R inode 104:1870962416 64 16.644 > 0.000 0.000 0.000 pd sdgx > 14:26:50.141306 R inode 65:2276184264 64 16.593 > 0.000 0.000 0.000 pd sdrk > > mmdiag --iohist its another think i looked at it, but i could not find good explanation for all the "buf type" ( third column ) allocSeg data iallocSeg indBlock inode LLIndBlock logData logDesc logWrap metadata vdiskAULog vdiskBuf vdiskFWLog vdiskMDLog vdiskMeta vdiskRGDesc If i want to monifor metadata operation whan should i look at? just the metadata flag or also inode? this command takes also long to run, especially if i run it a second time it hangs for a lot before to rerun again, so i'm not sure that run it every 30secs or minute its viable, but i will look also into that. THere is any documentation that descibes clearly the whole output? what i found its quite generic and don't go into details... > > > > Last but not least.. and this is what i really would like to > > accomplish, i would to be able to monitor the latency of metadata > operations. > > you can't do this on the server side as you don't know how much time > you spend on the client , network or anything between the app and the > physical disk, so you can only reliably look at this from the client, > the iohist output only shows you the Server disk i/o processing time, > but that can be a fraction of the overall time (in other cases this > obviously can also be the dominant part depending on your workload). > > the easiest way on the client is to run > > mmfsadm vfsstats enable > from now on vfs stats are collected until you restart GPFS. > > then run : > > vfs statistics currently enabled > started at: Fri Aug 29 13:15:05.380 2014 > duration: 448446.970 sec > > name calls time per call total time > -------------------- -------- -------------- -------------- > statfs 9 0.000002 0.000021 > startIO 246191176 0.005853 1441049.976740 > > to dump what ever you collected so far on this node. > We already do that, but as I said, I want to check specifically how gss servers are keeping the requests to identify or exlude server side bottlenecks. Thanks for your help, you gave me definitely few things where to look at. Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Thu Sep 4 11:58:51 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 04 Sep 2014 11:58:51 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> References: <54074F90.7000303@ebi.ac.uk> <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> Message-ID: <540845EB.1020202@ebi.ac.uk> Little clarification, the filsystemn its not always slow. It happens that became very slow with particular users jobs in the farm. Maybe its just an indication thant we have huge ammount of metadata requestes, that's why i want to be able to monitor them On 04/09/14 11:05, service at metamodul.com wrote: > > , any "ls" could take ages. > Check if you large directories either with many files or simply large. it happens that the files are very large ( over 100G), but there usually ther are no many files. > Verify if you have NFS exported GPFS. No NFS > Verify that your cache settings on the clients are large enough ( > maxStatCache , maxFilesToCache , sharedMemLimit ) will look at them, but i'm not sure that the best number will be on the client. Obviously i cannot use all the memory of the client because those blients are meant to run jobs.... > Verify that you have dedicated metadata luns ( metadataOnly ) Yes, we have dedicate vdisks for metadata, but they are in the same declustered arrays/recoverygroups, so they whare the same spindles > Reference: > https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Tuning%20Parameters > > Note: > If possible monitor your metadata luns on the storage directly. that?s exactly than I'm trying to do !!!! :-D > hth > Hajo > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Thu Sep 4 13:04:21 2014 From: service at metamodul.com (service at metamodul.com) Date: Thu, 4 Sep 2014 14:04:21 +0200 (CEST) Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <540845EB.1020202@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> <540845EB.1020202@ebi.ac.uk> Message-ID: <1374296431.733076.1409832261718.open-xchange@oxbaltgw12.schlund.de> ... , any "ls" could take ages. >Check if you large directories either with many files or simply large. >> it happens that the files are very large ( over 100G), but there usually >> ther are no many files. >>> Please check that the directory size is not large. In a worst case you have a directory with 10MB in size but it contains only one file. In any way GPFS must fetch the whole directory structure might causing unnecassery IO. Thus my request that you check your directory sizes. >Verify that your cache settings on the clients are large enough ( maxStatCache >, maxFilesToCache , sharedMemLimit ) >>will look at them, but i'm not sure that the best number will be on the >>client. Obviously i cannot use all the memory of the client because those >>blients are meant to run jobs.... Use lsof on the client to determine the amount of open filese. mmdiag --stats ( >From my memory ) shows a little bit about the cache usage. maxStatCache does not use that much memory. > Verify that you have dedicated metadata luns ( metadataOnly ) >> Yes, we have dedicate vdisks for metadata, but they are in the same >> declustered arrays/recoverygroups, so they whare the same spindles Thats imho not a good approach. Metadata operation are small and random, data io is large and streaming. Just think you have a highway full of large trucks and you try to get with a high speed bike to your destination. You will be blocked. The same problem you have at your destiation. If many large trucks would like to get their stuff off there is no time for somebody with a small parcel. Thats the same reason why you should not access tape storage and disk storage via the same FC adapter. ( Streaming IO version v. random/small IO ) So even without your current problem and motivation for measureing i would strongly suggest to have at least dediacted SSD for metadata and if possible even dedicated NSD server for the metadata. Meaning have a dedicated path for your data and a dedicated path for your metadata. All from a users point of view Hajo -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Thu Sep 4 14:25:09 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 04 Sep 2014 14:25:09 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <1374296431.733076.1409832261718.open-xchange@oxbaltgw12.schlund.de> References: <54074F90.7000303@ebi.ac.uk> <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> <540845EB.1020202@ebi.ac.uk> <1374296431.733076.1409832261718.open-xchange@oxbaltgw12.schlund.de> Message-ID: <54086835.6050603@ebi.ac.uk> > >> Yes, we have dedicate vdisks for metadata, but they are in the same > declustered arrays/recoverygroups, so they whare the same spindles > > Thats imho not a good approach. Metadata operation are small and > random, data io is large and streaming. > > Just think you have a highway full of large trucks and you try to get > with a high speed bike to your destination. You will be blocked. > The same problem you have at your destiation. If many large trucks > would like to get their stuff off there is no time for somebody with a > small parcel. > > Thats the same reason why you should not access tape storage and disk > storage via the same FC adapter. ( Streaming IO version v. > random/small IO ) > > So even without your current problem and motivation for measureing i > would strongly suggest to have at least dediacted SSD for metadata and > if possible even dedicated NSD server for the metadata. > Meaning have a dedicated path for your data and a dedicated path for > your metadata. > > All from a users point of view > Hajo > That's where i was puzzled too. GSS its a gpfs appliance and came configured this way. Also official GSS documentation suggest to create separate vdisks for data and meatadata, but in the same declustered arrays. I always felt this a strange choice, specially if we consider that metadata require a very small abbount of space, so few ssd could do the trick.... From sdinardo at ebi.ac.uk Thu Sep 4 14:32:15 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 04 Sep 2014 14:32:15 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: References: <54074F90.7000303@ebi.ac.uk> Message-ID: <540869DF.5060100@ebi.ac.uk> Sorry to bother you again but dstat have some issues with the plugin: [root at gss01a util]# dstat --gpfs /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use the subprocess module. pipes[cmd] = os.popen3(cmd, 't', 0) Module dstat_gpfs failed to load. (global name 'select' is not defined) None of the stats you selected are available. I found this solution , but involve dstat recompile.... https://github.com/dagwieers/dstat/issues/44 Are you aware about any easier solution (we use RHEL6.3) ? Regards, Salvatore On 04/09/14 01:50, Sven Oehme wrote: > > Hello everybody, > > Hi > > > here i come here again, this time to ask some hint about how to > monitor GPFS. > > > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is > > that they return number based only on the request done in the > > current host, so i have to run them on all the clients ( over 600 > > nodes) so its quite unpractical. Instead i would like to know from > > the servers whats going on, and i came across the vio_s statistics > > wich are less documented and i dont know exacly what they mean. > > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that > > runs VIO_S. > > > > My problems with the output of this command: > > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 > > > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second > > timestamp: 1409763206/477366 > > recovery group: * > > declustered array: * > > vdisk: * > > client reads: 2584229 > > client short writes: 55299693 > > client medium writes: 190071 > > client promoted full track writes: 465145 > > client full track writes: 9249 > > flushed update writes: 4187708 > > flushed promoted full track writes: 123 > > migrate operations: 114 > > scrub operations: 450590 > > log writes: 28509602 > > > > it sais "VIOPS per second", but they seem to me just counters as > > every time i re-run the command, the numbers increase by a bit.. > > Can anyone confirm if those numbers are counter or if they are OPS/sec. > > the numbers are accumulative so everytime you run them they just show > the value since start (or last reset) time. > > > > > On a closer eye about i dont understand what most of thosevalues > > mean. For example, what exacly are "flushed promoted full track > write" ?? > > I tried to find a documentation about this output , but could not > > find any. can anyone point me a link where output of vio_s is explained? > > > > Another thing i dont understand about those numbers is if they are > > just operations, or the number of blocks that was read/write/etc . > > its just operations and if i would explain what the numbers mean i > might confuse you even more because this is not what you are really > looking for. > what you are looking for is what the client io's look like on the > Server side, while the VIO layer is the Server side to the disks, so > one lever lower than what you are looking for from what i could read > out of the description above. > > so the Layer you care about is the NSD Server layer, which sits on top > of the VIO layer (which is essentially the SW RAID Layer in GNR) > > > I'm asking that because if they are just ops, i don't know how much > > they could be usefull. For example one write operation could eman > > write 1 block or write a file of 100GB. If those are oprations, > > there is a way to have the oupunt in bytes or blocks? > > there are multiple ways to get infos on the NSD layer, one would be to > use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats counts > again. > > the alternative option is to use mmdiag --iohist. this shows you a > history of the last X numbers of io operations on either the client or > the server side like on a client : > > # mmdiag --iohist > > === mmdiag: iohist === > > I/O history: > > I/O start time RW Buf type disk:sectorNum nSec time ms qTime > ms RpcTimes ms Type Device/NSD ID NSD server > --------------- -- ----------- ----------------- ----- ------- > -------- ----------------- ---- ------------------ --------------- > 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 > 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:22.182723 R inode 1:1071252480 8 6.970 0.000 > 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 > 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.668262 R inode 2:1081373696 8 14.117 > 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 > 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.692019 R inode 2:1064356608 8 14.899 > 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.707100 R inode 2:1077830152 8 16.499 > 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 > 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.728082 R inode 2:1081918976 8 7.760 0.000 > 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.877416 R metadata 2:678978560 16 13.343 0.000 > 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 > 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.906556 R inode 2:1083476520 8 11.723 > 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 > 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.926592 R inode 1:1076503480 8 8.087 0.000 > 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 > 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.941441 R inode 2:1069885984 8 11.686 > 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.953294 R inode 2:1083476936 8 8.951 0.000 > 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965475 R inode 1:1076503504 8 0.477 0.000 > 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.965755 R inode 2:1083476488 8 0.410 0.000 > 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965787 R inode 2:1083476512 8 0.439 0.000 > 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 > > you basically see if its a inode , data block , what size it has (in > sectors) , which nsd server you did send this request to, etc. > > on the Server side you see the type , which physical disk it goes to > and also what size of disk i/o it causes like : > > 14:26:50.129995 R inode 12:3211886376 64 14.261 > 0.000 0.000 0.000 pd sdis > 14:26:50.137102 R inode 19:3003969520 64 9.004 > 0.000 0.000 0.000 pd sdad > 14:26:50.136116 R inode 55:3591710992 64 11.057 > 0.000 0.000 0.000 pd sdoh > 14:26:50.141510 R inode 21:3066810504 64 5.909 > 0.000 0.000 0.000 pd sdaf > 14:26:50.130529 R inode 89:2962370072 64 17.437 > 0.000 0.000 0.000 pd sddi > 14:26:50.131063 R inode 78:1889457000 64 17.062 > 0.000 0.000 0.000 pd sdsj > 14:26:50.143403 R inode 36:3323035688 64 4.807 > 0.000 0.000 0.000 pd sdmw > 14:26:50.131044 R inode 37:2513579736 128 17.181 > 0.000 0.000 0.000 pd sddv > 14:26:50.138181 R inode 72:3868810400 64 10.951 > 0.000 0.000 0.000 pd sdbz > 14:26:50.138188 R inode 131:2443484784 128 11.792 > 0.000 0.000 0.000 pd sdug > 14:26:50.138003 R inode 102:3696843872 64 11.994 > 0.000 0.000 0.000 pd sdgp > 14:26:50.137099 R inode 145:3370922504 64 13.225 > 0.000 0.000 0.000 pd sdmi > 14:26:50.141576 R inode 62:2668579904 64 9.313 > 0.000 0.000 0.000 pd sdou > 14:26:50.134689 R inode 159:2786164648 64 16.577 > 0.000 0.000 0.000 pd sdpq > 14:26:50.145034 R inode 34:2097217320 64 7.409 > 0.000 0.000 0.000 pd sdmt > 14:26:50.138140 R inode 139:2831038792 64 14.898 > 0.000 0.000 0.000 pd sdlw > 14:26:50.130954 R inode 164:282120312 64 22.274 > 0.000 0.000 0.000 pd sdzd > 14:26:50.137038 R inode 41:3421909608 64 16.314 > 0.000 0.000 0.000 pd sdef > 14:26:50.137606 R inode 104:1870962416 64 16.644 > 0.000 0.000 0.000 pd sdgx > 14:26:50.141306 R inode 65:2276184264 64 16.593 > 0.000 0.000 0.000 pd sdrk > > > > > > Last but not least.. and this is what i really would like to > > accomplish, i would to be able to monitor the latency of metadata > operations. > > you can't do this on the server side as you don't know how much time > you spend on the client , network or anything between the app and the > physical disk, so you can only reliably look at this from the client, > the iohist output only shows you the Server disk i/o processing time, > but that can be a fraction of the overall time (in other cases this > obviously can also be the dominant part depending on your workload). > > the easiest way on the client is to run > > mmfsadm vfsstats enable > from now on vfs stats are collected until you restart GPFS. > > then run : > > vfs statistics currently enabled > started at: Fri Aug 29 13:15:05.380 2014 > duration: 448446.970 sec > > name calls time per call total time > -------------------- -------- -------------- -------------- > statfs 9 0.000002 0.000021 > startIO 246191176 0.005853 1441049.976740 > > to dump what ever you collected so far on this node. > > > In my environment there are users that litterally overhelm our > > storages with metadata request, so even if there is no massive > > throughput or huge waiters, any "ls" could take ages. I would like > > to be able to monitor metadata behaviour. There is a way to to do > > that from the NSD servers? > > not this simple as described above. > > > > > Thanks in advance for any tip/help. > > > > Regards, > > Salvatore_______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at gpfsug.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From orlando.richards at ed.ac.uk Thu Sep 4 14:54:37 2014 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Thu, 04 Sep 2014 14:54:37 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <540869DF.5060100@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <540869DF.5060100@ebi.ac.uk> Message-ID: <54086F1D.1000401@ed.ac.uk> On 04/09/14 14:32, Salvatore Di Nardo wrote: > Sorry to bother you again but dstat have some issues with the plugin: > > [root at gss01a util]# dstat --gpfs > /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is > deprecated. Use the subprocess module. > pipes[cmd] = os.popen3(cmd, 't', 0) > Module dstat_gpfs failed to load. (global name 'select' is not > defined) > None of the stats you selected are available. > > I found this solution , but involve dstat recompile.... > > https://github.com/dagwieers/dstat/issues/44 > > Are you aware about any easier solution (we use RHEL6.3) ? > This worked for me the other day on a dev box I was poking at: # rm /usr/share/dstat/dstat_gpfsops* # cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 /usr/share/dstat/dstat_gpfsops.py # dstat --gpfsops /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use the subprocess module. pipes[cmd] = os.popen3(cmd, 't', 0) ---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o----------------------------- cr del op/cl rd wr trunc fsync looku gattr sattr other mb_rd mb_wr pref wrbeh steal clean sync revok logwr logda oth_r oth_w 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... > > Regards, > Salvatore > > On 04/09/14 01:50, Sven Oehme wrote: >> > Hello everybody, >> >> Hi >> >> > here i come here again, this time to ask some hint about how to >> monitor GPFS. >> > >> > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is >> > that they return number based only on the request done in the >> > current host, so i have to run them on all the clients ( over 600 >> > nodes) so its quite unpractical. Instead i would like to know from >> > the servers whats going on, and i came across the vio_s statistics >> > wich are less documented and i dont know exacly what they mean. >> > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that >> > runs VIO_S. >> > >> > My problems with the output of this command: >> > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 >> > >> > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second >> > timestamp: 1409763206/477366 >> > recovery group: * >> > declustered array: * >> > vdisk: * >> > client reads: 2584229 >> > client short writes: 55299693 >> > client medium writes: 190071 >> > client promoted full track writes: 465145 >> > client full track writes: 9249 >> > flushed update writes: 4187708 >> > flushed promoted full track writes: 123 >> > migrate operations: 114 >> > scrub operations: 450590 >> > log writes: 28509602 >> > >> > it sais "VIOPS per second", but they seem to me just counters as >> > every time i re-run the command, the numbers increase by a bit.. >> > Can anyone confirm if those numbers are counter or if they are OPS/sec. >> >> the numbers are accumulative so everytime you run them they just show >> the value since start (or last reset) time. >> >> > >> > On a closer eye about i dont understand what most of thosevalues >> > mean. For example, what exacly are "flushed promoted full track >> write" ?? >> > I tried to find a documentation about this output , but could not >> > find any. can anyone point me a link where output of vio_s is explained? >> > >> > Another thing i dont understand about those numbers is if they are >> > just operations, or the number of blocks that was read/write/etc . >> >> its just operations and if i would explain what the numbers mean i >> might confuse you even more because this is not what you are really >> looking for. >> what you are looking for is what the client io's look like on the >> Server side, while the VIO layer is the Server side to the disks, so >> one lever lower than what you are looking for from what i could read >> out of the description above. >> >> so the Layer you care about is the NSD Server layer, which sits on top >> of the VIO layer (which is essentially the SW RAID Layer in GNR) >> >> > I'm asking that because if they are just ops, i don't know how much >> > they could be usefull. For example one write operation could eman >> > write 1 block or write a file of 100GB. If those are oprations, >> > there is a way to have the oupunt in bytes or blocks? >> >> there are multiple ways to get infos on the NSD layer, one would be to >> use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats counts >> again. >> >> the alternative option is to use mmdiag --iohist. this shows you a >> history of the last X numbers of io operations on either the client or >> the server side like on a client : >> >> # mmdiag --iohist >> >> === mmdiag: iohist === >> >> I/O history: >> >> I/O start time RW Buf type disk:sectorNum nSec time ms qTime >> ms RpcTimes ms Type Device/NSD ID NSD server >> --------------- -- ----------- ----------------- ----- ------- >> -------- ----------------- ---- ------------------ --------------- >> 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 >> 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:22.182723 R inode 1:1071252480 8 6.970 0.000 >> 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 >> 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:53.668262 R inode 2:1081373696 8 14.117 >> 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 >> 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:53.692019 R inode 2:1064356608 8 14.899 >> 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:53.707100 R inode 2:1077830152 8 16.499 >> 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 >> 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:53.728082 R inode 2:1081918976 8 7.760 0.000 >> 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:57.877416 R metadata 2:678978560 16 13.343 0.000 >> 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 >> 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:57.906556 R inode 2:1083476520 8 11.723 >> 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 >> 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:57.926592 R inode 1:1076503480 8 8.087 0.000 >> 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 >> 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:57.941441 R inode 2:1069885984 8 11.686 >> 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:57.953294 R inode 2:1083476936 8 8.951 0.000 >> 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:57.965475 R inode 1:1076503504 8 0.477 0.000 >> 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:57.965755 R inode 2:1083476488 8 0.410 0.000 >> 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:57.965787 R inode 2:1083476512 8 0.439 0.000 >> 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 >> >> you basically see if its a inode , data block , what size it has (in >> sectors) , which nsd server you did send this request to, etc. >> >> on the Server side you see the type , which physical disk it goes to >> and also what size of disk i/o it causes like : >> >> 14:26:50.129995 R inode 12:3211886376 64 14.261 >> 0.000 0.000 0.000 pd sdis >> 14:26:50.137102 R inode 19:3003969520 64 9.004 >> 0.000 0.000 0.000 pd sdad >> 14:26:50.136116 R inode 55:3591710992 64 11.057 >> 0.000 0.000 0.000 pd sdoh >> 14:26:50.141510 R inode 21:3066810504 64 5.909 >> 0.000 0.000 0.000 pd sdaf >> 14:26:50.130529 R inode 89:2962370072 64 17.437 >> 0.000 0.000 0.000 pd sddi >> 14:26:50.131063 R inode 78:1889457000 64 17.062 >> 0.000 0.000 0.000 pd sdsj >> 14:26:50.143403 R inode 36:3323035688 64 4.807 >> 0.000 0.000 0.000 pd sdmw >> 14:26:50.131044 R inode 37:2513579736 128 17.181 >> 0.000 0.000 0.000 pd sddv >> 14:26:50.138181 R inode 72:3868810400 64 10.951 >> 0.000 0.000 0.000 pd sdbz >> 14:26:50.138188 R inode 131:2443484784 128 11.792 >> 0.000 0.000 0.000 pd sdug >> 14:26:50.138003 R inode 102:3696843872 64 11.994 >> 0.000 0.000 0.000 pd sdgp >> 14:26:50.137099 R inode 145:3370922504 64 13.225 >> 0.000 0.000 0.000 pd sdmi >> 14:26:50.141576 R inode 62:2668579904 64 9.313 >> 0.000 0.000 0.000 pd sdou >> 14:26:50.134689 R inode 159:2786164648 64 16.577 >> 0.000 0.000 0.000 pd sdpq >> 14:26:50.145034 R inode 34:2097217320 64 7.409 >> 0.000 0.000 0.000 pd sdmt >> 14:26:50.138140 R inode 139:2831038792 64 14.898 >> 0.000 0.000 0.000 pd sdlw >> 14:26:50.130954 R inode 164:282120312 64 22.274 >> 0.000 0.000 0.000 pd sdzd >> 14:26:50.137038 R inode 41:3421909608 64 16.314 >> 0.000 0.000 0.000 pd sdef >> 14:26:50.137606 R inode 104:1870962416 64 16.644 >> 0.000 0.000 0.000 pd sdgx >> 14:26:50.141306 R inode 65:2276184264 64 16.593 >> 0.000 0.000 0.000 pd sdrk >> >> >> > >> > Last but not least.. and this is what i really would like to >> > accomplish, i would to be able to monitor the latency of metadata >> operations. >> >> you can't do this on the server side as you don't know how much time >> you spend on the client , network or anything between the app and the >> physical disk, so you can only reliably look at this from the client, >> the iohist output only shows you the Server disk i/o processing time, >> but that can be a fraction of the overall time (in other cases this >> obviously can also be the dominant part depending on your workload). >> >> the easiest way on the client is to run >> >> mmfsadm vfsstats enable >> from now on vfs stats are collected until you restart GPFS. >> >> then run : >> >> vfs statistics currently enabled >> started at: Fri Aug 29 13:15:05.380 2014 >> duration: 448446.970 sec >> >> name calls time per call total time >> -------------------- -------- -------------- -------------- >> statfs 9 0.000002 0.000021 >> startIO 246191176 0.005853 1441049.976740 >> >> to dump what ever you collected so far on this node. >> >> > In my environment there are users that litterally overhelm our >> > storages with metadata request, so even if there is no massive >> > throughput or huge waiters, any "ls" could take ages. I would like >> > to be able to monitor metadata behaviour. There is a way to to do >> > that from the NSD servers? >> >> not this simple as described above. >> >> > >> > Thanks in advance for any tip/help. >> > >> > Regards, >> > Salvatore_______________________________________________ >> > gpfsug-discuss mailing list >> > gpfsug-discuss at gpfsug.org >> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- -- Dr Orlando Richards Research Facilities (ECDF) Systems Leader Information Services IT Infrastructure Division Tel: 0131 650 4994 skype: orlando.richards The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From sdinardo at ebi.ac.uk Thu Sep 4 15:07:42 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 04 Sep 2014 15:07:42 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <54086F1D.1000401@ed.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <540869DF.5060100@ebi.ac.uk> <54086F1D.1000401@ed.ac.uk> Message-ID: <5408722E.6060309@ebi.ac.uk> On 04/09/14 14:54, Orlando Richards wrote: > > > On 04/09/14 14:32, Salvatore Di Nardo wrote: >> Sorry to bother you again but dstat have some issues with the plugin: >> >> [root at gss01a util]# dstat --gpfs >> /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is >> deprecated. Use the subprocess module. >> pipes[cmd] = os.popen3(cmd, 't', 0) >> Module dstat_gpfs failed to load. (global name 'select' is not >> defined) >> None of the stats you selected are available. >> >> I found this solution , but involve dstat recompile.... >> >> https://github.com/dagwieers/dstat/issues/44 >> >> Are you aware about any easier solution (we use RHEL6.3) ? >> > > This worked for me the other day on a dev box I was poking at: > > # rm /usr/share/dstat/dstat_gpfsops* > > # cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 > /usr/share/dstat/dstat_gpfsops.py > > # dstat --gpfsops > /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use > the subprocess module. > pipes[cmd] = os.popen3(cmd, 't', 0) > ---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o----------------------------- > > cr del op/cl rd wr trunc fsync looku gattr sattr other > mb_rd mb_wr pref wrbeh steal clean sync revok logwr logda oth_r oth_w > 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 > > ... > NICE!! The only problem is that the box seems lacking those python scripts: ls /usr/lpp/mmfs/samples/util/ makefile README tsbackup tsbackup.C tsbackup.h tsfindinode tsfindinode.c tsgetusage tsgetusage.c tsinode tsinode.c tslistall tsreaddir tsreaddir.c tstimes tstimes.c Do you mind sending me those py files? They should be 3 as i see e gpfs options: gpfs, gpfs-ops, gpfsops (dunno what are the differences ) Regards, Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From orlando.richards at ed.ac.uk Thu Sep 4 15:14:02 2014 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Thu, 04 Sep 2014 15:14:02 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <5408722E.6060309@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <540869DF.5060100@ebi.ac.uk> <54086F1D.1000401@ed.ac.uk> <5408722E.6060309@ebi.ac.uk> Message-ID: <540873AA.5070401@ed.ac.uk> On 04/09/14 15:07, Salvatore Di Nardo wrote: > > On 04/09/14 14:54, Orlando Richards wrote: >> >> >> On 04/09/14 14:32, Salvatore Di Nardo wrote: >>> Sorry to bother you again but dstat have some issues with the plugin: >>> >>> [root at gss01a util]# dstat --gpfs >>> /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is >>> deprecated. Use the subprocess module. >>> pipes[cmd] = os.popen3(cmd, 't', 0) >>> Module dstat_gpfs failed to load. (global name 'select' is not >>> defined) >>> None of the stats you selected are available. >>> >>> I found this solution , but involve dstat recompile.... >>> >>> https://github.com/dagwieers/dstat/issues/44 >>> >>> Are you aware about any easier solution (we use RHEL6.3) ? >>> >> >> This worked for me the other day on a dev box I was poking at: >> >> # rm /usr/share/dstat/dstat_gpfsops* >> >> # cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 >> /usr/share/dstat/dstat_gpfsops.py >> >> # dstat --gpfsops >> /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use >> the subprocess module. >> pipes[cmd] = os.popen3(cmd, 't', 0) >> ---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o----------------------------- >> >> cr del op/cl rd wr trunc fsync looku gattr sattr other >> mb_rd mb_wr pref wrbeh steal clean sync revok logwr logda oth_r oth_w >> 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 >> >> ... >> > > NICE!! The only problem is that the box seems lacking those python scripts: > > ls /usr/lpp/mmfs/samples/util/ > makefile README tsbackup tsbackup.C tsbackup.h tsfindinode > tsfindinode.c tsgetusage tsgetusage.c tsinode tsinode.c > tslistall tsreaddir tsreaddir.c tstimes tstimes.c > It came from the gpfs.base rpm: # rpm -qf /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 gpfs.base-3.5.0-13.x86_64 > Do you mind sending me those py files? They should be 3 as i see e gpfs > options: gpfs, gpfs-ops, gpfsops (dunno what are the differences ) > Only the gpfsops.py is included in the bundle - one for dstat 0.6 and one for dstat 0.7. I've attached it to this mail as well (it seems to be GPL'd). > Regards, > Salvatore > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- -- Dr Orlando Richards Research Facilities (ECDF) Systems Leader Information Services IT Infrastructure Division Tel: 0131 650 4994 skype: orlando.richards The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -------------- next part -------------- # # Copyright (C) 2009, 2010 IBM Corporation # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2, or (at your option) # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. # global string, select, os, re, fnmatch import string, select, os, re, fnmatch # Dstat class to display selected gpfs performance counters returned by the # mmpmon "vfs_s", "ioc_s", "vio_s", "vflush_s", and "lroc_s" commands. # # The set of counters displayed can be customized via environment variables: # # DSTAT_GPFS_WHAT # # Selects which of the five mmpmon commands to display. # It is a comma separated list of any of the following: # "vfs": show mmpmon "vfs_s" counters # "ioc": show mmpmon "ioc_s" counters related to NSD client I/O # "nsd": show mmpmon "ioc_s" counters related to NSD server I/O # "vio": show mmpmon "vio_s" counters # "vflush": show mmpmon "vflush_s" counters # "lroc": show mmpmon "lroc_s" counters # "all": equivalent to specifying all of the above # # Example: # # DSTAT_GPFS_WHAT=vfs,lroc dstat -M gpfsops # # will display counters for mmpmon "vfs_s" and "lroc" commands. # # The default setting is "vfs,ioc", i.e., by default only "vfs_s" and NSD # client related "ioc_s" counters are displayed. # # DSTAT_GPFS_VFS # DSTAT_GPFS_IOC # DSTAT_GPFS_VIO # DSTAT_GPFS_VFLUSH # DSTAT_GPFS_LROC # # Allow finer grain control over exactly which values will be displayed for # each of the five mmpmon commands. Each variable is a comma separated list # of counter names with optional column header string. # # Example: # # export DSTAT_GPFS_VFS='create, remove, rd/wr=read+write' # export DSTAT_GPFS_IOC='sync*, logwrap*, oth_rd=*_rd, oth_wr=*_wr' # dstat -M gpfsops # # Under "vfs-ops" this will display three columns, showing creates, deletes # (removes), and a third column labelled "rd/wr" with a combined count of # read and write operations. # Under "disk-i/o" it will display four columns, showing all disk I/Os # initiated by sync, and log wrap, plus two columns labeled "oth_rd" and # "oth_wr" showing counts of all other disk reads and disk writes, # respectively. # # Note: setting one of these environment variables overrides the # corrosponding setting in DSTAT_GPFS_WHAT. For example, setting # DSTAT_GPFS_VFS="" will omit all "vfs_s" counters regardless of whether # "vfs" appears in DSTAT_GPFS_WHAT or not. # # Counter sets are specified as a comma-separated list of entries of one # of the following forms # # counter # label = counter # label = counter1 + counter2 + ... # # If no label is specified, the name of the counter is used as the column # header (truncated to 5 characters). # Counter names may contain shell-style wildcards. For example, the # pattern "sync*" matches the two ioc_s counters "sync_rd" and "sync_wr" and # therefore produce a column containing the combined count of disk reads and # disk writes initiated by sync. If a counter appears in or matches a name # pattern in more than one entry, it is included only in the count under the # first entry in which it appears. For example, adding an entry "other = *" # at the end of the list will add a column labeled "other" that shows the # sum of all counter values *not* included in any of the previous columns. # # DSTAT_GPFS_LIST=1 dstat -M gpfsops # # This will show all available counter names and the default definition # for which sets of counter values are displayed. # # An alternative to setting environment variables is to create a file # ~/.dstat_gpfs_rc # with python statements that sets any of the following variables # vfs_wanted: equivalent to setting DSTAT_GPFS_VFS # ioc_wanted: equivalent to setting DSTAT_GPFS_IOC # vio_wanted: equivalent to setting DSTAT_GPFS_VIO # vflush_wanted: equivalent to setting DSTAT_GPFS_VFLUSH # lroc_wanted: equivalent to setting DSTAT_GPFS_LROC # # For example, the following ~/.dstat_gpfs_rc file will produce the same # result as the environment variables in the example above: # # vfs_wanted = 'create, remove, rd/wr=read+write' # ioc_wanted = 'sync*, logwrap*, oth_rd=*_rd, oth_wr=*_wr' # # See also the default vfs_wanted, ioc_wanted, and vio_wanted settings in # the dstat_gpfsops __init__ method below. class dstat_plugin(dstat): def __init__(self): # list of all stats counters returned by mmpmon "vfs_s", "ioc_s", "vio_s", "vflush_s", and "lroc_s" # always ignore the first few chars like : io_s _io_s_ _n_ 172.31.136.2 _nn_ mgmt001st001 _rc_ 0 _t_ 1322526286 _tu_ 415518 vfs_keys = ('_access_', '_close_', '_create_', '_fclear_', '_fsync_', '_fsync_range_', '_ftrunc_', '_getattr_', '_link_', '_lockctl_', '_lookup_', '_map_lloff_', '_mkdir_', '_mknod_', '_open_', '_read_', '_write_', '_mmapRead_', '_mmapWrite_', '_aioRead_', '_aioWrite_','_readdir_', '_readlink_', '_readpage_', '_remove_', '_rename_', '_rmdir_', '_setacl_', '_setattr_', '_symlink_', '_unmap_', '_writepage_', '_tsfattr_', '_tsfsattr_', '_flock_', '_setxattr_', '_getxattr_', '_listxattr_', '_removexattr_', '_encode_fh_', '_decode_fh_', '_get_dentry_', '_get_parent_', '_mount_', '_statfs_', '_sync_', '_vget_') ioc_keys = ('_other_rd_', '_other_wr_','_mb_rd_', '_mb_wr_', '_steal_rd_', '_steal_wr_', '_cleaner_rd_', '_cleaner_wr_', '_sync_rd_', '_sync_wr_', '_logwrap_rd_', '_logwrap_wr_', '_revoke_rd_', '_revoke_wr_', '_prefetch_rd_', '_prefetch_wr_', '_logdata_rd_', '_logdata_wr_', '_nsdworker_rd_', '_nsdworker_wr_','_nsdlocal_rd_','_nsdlocal_wr_', '_vdisk_rd_','_vdisk_wr_', '_pdisk_rd_','_pdisk_wr_', '_logtip_rd_', '_logtip_wr_') vio_keys = ('_r_', '_sw_', '_mw_', '_pfw_', '_ftw_', '_fuw_', '_fpw_', '_m_', '_s_', '_l_', '_rgd_', '_meta_') vflush_keys = ('_ndt_', '_ngdb_', '_nfwlmb_', '_nfipt_', '_nfwwt_', '_ahwm_', '_susp_', '_uwrttf_', '_fftc_', '_nalth_', '_nasth_', '_nsigth_', '_ntgtth_') lroc_keys = ('_Inode_s_', '_Inode_sf_', '_Inode_smb_', '_Inode_r_', '_Inode_rf_', '_Inode_rmb_', '_Inode_i_', '_Inode_imb_', '_Directory_s_', '_Directory_sf_', '_Directory_smb_', '_Directory_r_', '_Directory_rf_', '_Directory_rmb_', '_Directory_i_', '_Directory_imb_', '_Data_s_', '_Data_sf_', '_Data_smb_', '_Data_r_', '_Data_rf_', '_Data_rmb_', '_Data_i_', '_Data_imb_', '_agt_i_', '_agt_i_rm_', '_agt_i_rM_', '_agt_i_ra_', '_agt_r_', '_agt_r_rm_', '_agt_r_rM_', '_agt_r_ra_', '_ssd_w_', '_ssd_w_p_', '_ssd_w_rm_', '_ssd_w_rM_', '_ssd_w_ra_', '_ssd_r_', '_ssd_r_p_', '_ssd_r_rm_', '_ssd_r_rM_', '_ssd_r_ra_') # Default counters to display for each mmpmon category vfs_wanted = '''cr = create + mkdir + link + symlink, del = remove + rmdir, op/cl = open + close + map_lloff + unmap, rd = read + readdir + readlink + mmapRead + readpage + aioRead + aioWrite, wr = write + mmapWrite + writepage, trunc = ftrunc + fclear, fsync = fsync + fsync_range, lookup, gattr = access + getattr + getxattr + getacl, sattr = setattr + setxattr + setacl, other = * ''' ioc_wanted1 = '''mb_rd, mb_wr, pref=prefetch_rd, wrbeh=prefetch_wr, steal*, cleaner*, sync*, revoke*, logwrap*, logdata*, oth_rd = other_rd, oth_wr = other_wr ''' ioc_wanted2 = '''rns_r=nsdworker_rd, rns_w=nsdworker_wr, lns_r=nsdlocal_rd, lns_w=nsdlocal_wr, vd_r=vdisk_rd, vd_w=vdisk_wr, pd_r=pdisk_rd, pd_w=pdisk_wr, ''' vio_wanted = '''ClRead=r, ClShWr=sw, ClMdWr=mw, ClPFTWr=pfw, ClFTWr=ftw, FlUpWr=fuw, FlPFTWr=fpw, Migrte=m, Scrub=s, LgWr=l, RGDsc=rgd, Meta=meta ''' vflush_wanted = '''DiTrk = ndt, DiBuf = ngdb, FwLog = nfwlmb, FinPr = nfipt, WraTh = nfwwt, HiWMa = ahwm, Suspd = susp, WrThF = uwrttf, Force = fftc, TrgTh = ntgtth, other = nalth + nasth + nsigth ''' lroc_wanted = '''StorS = Inode_s + Directory_s + Data_s, StorF = Inode_sf + Directory_sf + Data_sf, FetcS = Inode_r + Directory_r + Data_r, FetcF = Inode_rf + Directory_rf + Data_rf, InVAL = Inode_i + Directory_i + Data_i ''' # Coarse counter selection via DSTAT_GPFS_WHAT if 'DSTAT_GPFS_WHAT' in os.environ: what_wanted = os.environ['DSTAT_GPFS_WHAT'].split(',') else: what_wanted = [ 'vfs', 'ioc' ] # If ".dstat_gpfs_rc" exists in user's home directory, run it. # Otherwise, use DSTAT_GPFS_WHAT for counter selection and look for other # DSTAT_GPFS_XXX environment variables for additional customization. userprofile = os.path.join(os.environ['HOME'], '.dstat_gpfs_rc') if os.path.exists(userprofile): ioc_wanted = ioc_wanted1 + ioc_wanted2 exec file(userprofile) else: if 'all' not in what_wanted: if 'vfs' not in what_wanted: vfs_wanted = '' if 'ioc' not in what_wanted: ioc_wanted1 = '' if 'nsd' not in what_wanted: ioc_wanted2 = '' if 'vio' not in what_wanted: vio_wanted = '' if 'vflush' not in what_wanted: vflush_wanted = '' if 'lroc' not in what_wanted: lroc_wanted = '' ioc_wanted = ioc_wanted1 + ioc_wanted2 # Fine grain counter cusomization via DSTAT_GPFS_XXX if 'DSTAT_GPFS_VFS' in os.environ: vfs_wanted = os.environ['DSTAT_GPFS_VFS'] if 'DSTAT_GPFS_IOC' in os.environ: ioc_wanted = os.environ['DSTAT_GPFS_IOC'] if 'DSTAT_GPFS_VIO' in os.environ: vio_wanted = os.environ['DSTAT_GPFS_VIO'] if 'DSTAT_GPFS_VFLUSH' in os.environ: vflush_wanted = os.environ['DSTAT_GPFS_VFLUSH'] if 'DSTAT_GPFS_LROC' in os.environ: lroc_wanted = os.environ['DSTAT_GPFS_LROC'] self.debug = 0 vars1, nick1, keymap1 = self.make_keymap(vfs_keys, vfs_wanted, 'gpfs-vfs-') vars2, nick2, keymap2 = self.make_keymap(ioc_keys, ioc_wanted, 'gpfs-io-') vars3, nick3, keymap3 = self.make_keymap(vio_keys, vio_wanted, 'gpfs-vio-') vars4, nick4, keymap4 = self.make_keymap(vflush_keys, vflush_wanted, 'gpfs-vflush-') vars5, nick5, keymap5 = self.make_keymap(lroc_keys, lroc_wanted, 'gpfs-lroc-') if 'DSTAT_GPFS_LIST' in os.environ or self.debug: self.show_keymap('vfs_s', 'DSTAT_GPFS_VFS', vfs_keys, vfs_wanted, vars1, keymap1, 'gpfs-vfs-') self.show_keymap('ioc_s', 'DSTAT_GPFS_IOC', ioc_keys, ioc_wanted, vars2, keymap2, 'gpfs-io-') self.show_keymap('vio_s', 'DSTAT_GPFS_VIO', vio_keys, vio_wanted, vars3, keymap3, 'gpfs-vio-') self.show_keymap('vflush_stat', 'DSTAT_GPFS_VFLUSH', vflush_keys, vflush_wanted, vars4, keymap4, 'gpfs-vflush-') self.show_keymap('lroc_s', 'DSTAT_GPFS_LROC', lroc_keys, lroc_wanted, vars5, keymap5, 'gpfs-lroc-') print self.vars = vars1 + vars2 + vars3 + vars4 + vars5 self.varsrate = vars1 + vars2 + vars3 + vars5 self.varsconst = vars4 self.nick = nick1 + nick2 + nick3 + nick4 + nick5 self.vfs_keymap = keymap1 self.ioc_keymap = keymap2 self.vio_keymap = keymap3 self.vflush_keymap = keymap4 self.lroc_keymap = keymap5 names = [] self.addtitle(names, 'gpfs vfs ops', len(vars1)) self.addtitle(names, 'gpfs disk i/o', len(vars2)) self.addtitle(names, 'gpfs vio', len(vars3)) self.addtitle(names, 'gpfs vflush', len(vars4)) self.addtitle(names, 'gpfs lroc', len(vars5)) self.name = '#'.join(names) self.type = 'd' self.width = 5 self.scale = 1000 def make_keymap(self, keys, wanted, prefix): '''Parse the list of counter values to be displayd "keys" is the list of all available counters "wanted" is a string of the form "name1 = key1 + key2 + ..., name2 = key3 + key4 ..." Returns a list of all names found, e.g. ['name1', 'name2', ...], and a dictionary that maps counters to names, e.g., { 'key1': 'name1', 'key2': 'name1', 'key3': 'name2', ... }, ''' vars = [] nick = [] kmap = {} ## print re.split(r'\s*,\s*', wanted.strip()) for n in re.split(r'\s*,\s*', wanted.strip()): l = re.split(r'\s*=\s*', n, 2) if len(l) == 2: v = l[0] kl = re.split(r'\s*\+\s*', l[1]) elif l[0]: v = l[0].strip('*') kl = l else: continue nick.append(v[0:5]) v = prefix + v.replace('/', '-') vars.append(v) for s in kl: for k in keys: if fnmatch.fnmatch(k.strip('_'), s) and k not in kmap: kmap[k] = v return vars, nick, kmap def show_keymap(self, label, envname, keys, wanted, vars, kmap, prefix): 'show available counter names and current counter set definition' linewd = 100 print '\nAvailable counters for "%s":' % label mlen = max([len(k.strip('_')) for k in keys]) ncols = linewd // (mlen + 1) nrows = (len(keys) + ncols - 1) // ncols for r in range(nrows): print ' ', for c in range(ncols): i = c *nrows + r if not i < len(keys): break print keys[i].strip('_').ljust(mlen), print print '\nCurrent counter set selection:' print "\n%s='%s'\n" % (envname, re.sub(r'\s+', '', wanted).strip().replace(',', ', ')) if not vars: return mlen = 5 for v in vars: if v.startswith(prefix): s = v[len(prefix):] else: s = v n = ' %s = ' % s[0:mlen].rjust(mlen) kl = [ k.strip('_') for k in keys if kmap.get(k) == v ] i = 0 while i < len(kl): slen = len(n) + 3 + len(kl[i]) j = i + 1 while j < len(kl) and slen + 3 + len(kl[j]) < linewd: slen += 3 + len(kl[j]) j += 1 print n + ' + '.join(kl[i:j]) i = j n = ' %s + ' % ''.rjust(mlen) def addtitle(self, names, name, ncols): 'pad title given by "name" with minus signs to span "ncols" columns' if ncols == 1: names.append(name.split()[-1].center(6*ncols - 1)) elif ncols > 1: names.append(name.center(6*ncols - 1)) def check(self): 'start mmpmon command' if os.access('/usr/lpp/mmfs/bin/mmpmon', os.X_OK): try: self.stdin, self.stdout, self.stderr = dpopen('/usr/lpp/mmfs/bin/mmpmon -p -s') self.stdin.write('reset\n') readpipe(self.stdout) except IOError: raise Exception, 'Cannot interface with gpfs mmpmon binary' return True raise Exception, 'Needs GPFS mmpmon binary' def extract_vfs(self): 'collect "vfs_s" counter values' self.stdin.write('vfs_s\n') l = [] for line in readpipe(self.stdout): if not line: continue l += line.split() for i in range(11, len(l), 3): try: self.set2[self.vfs_keymap[l[i]]] += long(l[i+1]) except KeyError: pass def extract_ioc(self): 'collect "ioc_s" counter values' self.stdin.write('ioc_s\n') l = [] for line in readpipe(self.stdout): if not line: continue l += line.split() for i in range(11, len(l), 3): try: self.set2[self.ioc_keymap[l[i]+'rd_']] += long(l[i+1]) except KeyError: pass try: self.set2[self.ioc_keymap[l[i]+'wr_']] += long(l[i+2]) except KeyError: pass def extract_vio(self): 'collect "vio_s" counter values' self.stdin.write('vio_s\n') l = [] for line in readpipe(self.stdout): if not line: continue l += line.split() for i in range(19, len(l), 2): try: if l[i] in self.vio_keymap: self.set2[self.vio_keymap[l[i]]] += long(l[i+1]) except KeyError: pass def extract_vflush(self): 'collect "vflush_stat" counter values' self.stdin.write('vflush_stat\n') l = [] for line in readpipe(self.stdout): if not line: continue l += line.split() for i in range(11, len(l), 2): try: if l[i] in self.vflush_keymap: self.set2[self.vflush_keymap[l[i]]] += long(l[i+1]) except KeyError: pass def extract_lroc(self): 'collect "lroc_s" counter values' self.stdin.write('lroc_s\n') l = [] for line in readpipe(self.stdout): if not line: continue l += line.split() for i in range(11, len(l), 2): try: if l[i] in self.lroc_keymap: self.set2[self.lroc_keymap[l[i]]] += long(l[i+1]) except KeyError: pass def extract(self): try: for name in self.vars: self.set2[name] = 0 self.extract_ioc() self.extract_vfs() self.extract_vio() self.extract_vflush() self.extract_lroc() for name in self.varsrate: self.val[name] = (self.set2[name] - self.set1[name]) * 1.0 / elapsed for name in self.varsconst: self.val[name] = self.set2[name] except IOError, e: for name in self.vars: self.val[name] = -1 ## print 'dstat_gpfs: lost pipe to mmpmon,', e except Exception, e: for name in self.vars: self.val[name] = -1 print 'dstat_gpfs: exception', e if self.debug >= 0: self.debug -= 1 if step == op.delay: self.set1.update(self.set2) From ewahl at osc.edu Thu Sep 4 15:13:48 2014 From: ewahl at osc.edu (Ed Wahl) Date: Thu, 4 Sep 2014 14:13:48 +0000 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> References: <54074F90.7000303@ebi.ac.uk>, <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> Message-ID: Another known issue with slow "ls" can be the annoyance that is 'sssd' under newer OSs (rhel 6) and properly configuring this for remote auth. I know on my nsd's I never did and the first ls in a directory where the cache is expired takes forever to make all the remote LDAP calls to get the UID info. bleh. Ed ________________________________ From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of service at metamodul.com [service at metamodul.com] Sent: Thursday, September 04, 2014 6:05 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] gpfs performance monitoring > , any "ls" could take ages. Check if you large directories either with many files or simply large. Verify if you have NFS exported GPFS. Verify that your cache settings on the clients are large enough ( maxStatCache , maxFilesToCache , sharedMemLimit ) Verify that you have dedicated metadata luns ( metadataOnly ) Reference: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Tuning%20Parameters Note: If possible monitor your metadata luns on the storage directly. hth Hajo -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Thu Sep 4 15:18:02 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 04 Sep 2014 15:18:02 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <540873AA.5070401@ed.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <540869DF.5060100@ebi.ac.uk> <54086F1D.1000401@ed.ac.uk> <5408722E.6060309@ebi.ac.uk> <540873AA.5070401@ed.ac.uk> Message-ID: <5408749A.9080306@ebi.ac.uk> On 04/09/14 15:14, Orlando Richards wrote: > > > On 04/09/14 15:07, Salvatore Di Nardo wrote: >> >> On 04/09/14 14:54, Orlando Richards wrote: >>> >>> >>> On 04/09/14 14:32, Salvatore Di Nardo wrote: >>>> Sorry to bother you again but dstat have some issues with the plugin: >>>> >>>> [root at gss01a util]# dstat --gpfs >>>> /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is >>>> deprecated. Use the subprocess module. >>>> pipes[cmd] = os.popen3(cmd, 't', 0) >>>> Module dstat_gpfs failed to load. (global name 'select' is not >>>> defined) >>>> None of the stats you selected are available. >>>> >>>> I found this solution , but involve dstat recompile.... >>>> >>>> https://github.com/dagwieers/dstat/issues/44 >>>> >>>> Are you aware about any easier solution (we use RHEL6.3) ? >>>> >>> >>> This worked for me the other day on a dev box I was poking at: >>> >>> # rm /usr/share/dstat/dstat_gpfsops* >>> >>> # cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 >>> /usr/share/dstat/dstat_gpfsops.py >>> >>> # dstat --gpfsops >>> /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use >>> the subprocess module. >>> pipes[cmd] = os.popen3(cmd, 't', 0) >>> ---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o----------------------------- >>> >>> >>> cr del op/cl rd wr trunc fsync looku gattr sattr other >>> mb_rd mb_wr pref wrbeh steal clean sync revok logwr logda oth_r oth_w >>> 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 0 0 0 >>> >>> ... >>> >> >> NICE!! The only problem is that the box seems lacking those python >> scripts: >> >> ls /usr/lpp/mmfs/samples/util/ >> makefile README tsbackup tsbackup.C tsbackup.h tsfindinode >> tsfindinode.c tsgetusage tsgetusage.c tsinode tsinode.c >> tslistall tsreaddir tsreaddir.c tstimes tstimes.c >> > > It came from the gpfs.base rpm: > > # rpm -qf /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 > gpfs.base-3.5.0-13.x86_64 > > >> Do you mind sending me those py files? They should be 3 as i see e gpfs >> options: gpfs, gpfs-ops, gpfsops (dunno what are the differences ) >> > > Only the gpfsops.py is included in the bundle - one for dstat 0.6 and > one for dstat 0.7. > > > I've attached it to this mail as well (it seems to be GPL'd). > Thanks. From J.R.Jones at soton.ac.uk Thu Sep 4 16:15:48 2014 From: J.R.Jones at soton.ac.uk (Jones J.R.) Date: Thu, 4 Sep 2014 15:15:48 +0000 Subject: [gpfsug-discuss] Building the portability layer for Xeon Phi Message-ID: <1409843748.7733.31.camel@uos-204812.clients.soton.ac.uk> Hi folks Has anyone managed to successfully build the portability layer for Xeon Phi? At the moment we are having to export the GPFS mounts from the host machine over NFS, which is proving rather unreliable. Jess From oehmes at us.ibm.com Fri Sep 5 01:48:40 2014 From: oehmes at us.ibm.com (Sven Oehme) Date: Thu, 4 Sep 2014 17:48:40 -0700 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <54084258.90508@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <54084258.90508@ebi.ac.uk> Message-ID: ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ gpfsug-discuss-bounces at gpfsug.org wrote on 09/04/2014 03:43:36 AM: > From: Salvatore Di Nardo > To: gpfsug main discussion list > Date: 09/04/2014 03:44 AM > Subject: Re: [gpfsug-discuss] gpfs performance monitoring > Sent by: gpfsug-discuss-bounces at gpfsug.org > > On 04/09/14 01:50, Sven Oehme wrote: > > Hello everybody, > > Hi > > > here i come here again, this time to ask some hint about how to > monitor GPFS. > > > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is > > that they return number based only on the request done in the > > current host, so i have to run them on all the clients ( over 600 > > nodes) so its quite unpractical. Instead i would like to know from > > the servers whats going on, and i came across the vio_s statistics > > wich are less documented and i dont know exacly what they mean. > > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that > > runs VIO_S. > > > > My problems with the output of this command: > > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 > > > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second > > timestamp: 1409763206/477366 > > recovery group: * > > declustered array: * > > vdisk: * > > client reads: 2584229 > > client short writes: 55299693 > > client medium writes: 190071 > > client promoted full track writes: 465145 > > client full track writes: 9249 > > flushed update writes: 4187708 > > flushed promoted full track writes: 123 > > migrate operations: 114 > > scrub operations: 450590 > > log writes: 28509602 > > > > it sais "VIOPS per second", but they seem to me just counters as > > every time i re-run the command, the numbers increase by a bit.. > > Can anyone confirm if those numbers are counter or if they are OPS/sec. > > the numbers are accumulative so everytime you run them they just > show the value since start (or last reset) time. > OK, you confirmed my toughts, thatks > > > > > On a closer eye about i dont understand what most of thosevalues > > mean. For example, what exacly are "flushed promoted full track write" ?? > > I tried to find a documentation about this output , but could not > > find any. can anyone point me a link where output of vio_s is explained? > > > > Another thing i dont understand about those numbers is if they are > > just operations, or the number of blocks that was read/write/etc . > > its just operations and if i would explain what the numbers mean i > might confuse you even more because this is not what you are really > looking for. > what you are looking for is what the client io's look like on the > Server side, while the VIO layer is the Server side to the disks, so > one lever lower than what you are looking for from what i could read > out of the description above. > No.. what I'm looking its exactly how the disks are busy to keep the > requests. Obviously i'm not looking just that, but I feel the needs > to monitor also those things. Ill explain you why. > > It happens when our storage is quite busy ( 180Gb/s of read/write ) > that the FS start to be slowin normal cd or ls requests. This might > be normal, but in those situation i want to know where the > bottleneck is. Is the server CPU? Memory? Network? Spindles? knowing > where the bottlenek is might help me to understand if we can tweak > the system a bit more. if cd or ls is very slow in GPFS in the majority of the cases it has nothing to do with NSD Server bottlenecks, only indirect. the main reason ls is slow in the field is you have some very powerful nodes that all do buffered writes into the same directory into 1 or multiple files while you do the ls on a different node. what happens now is that the ls you did run most likely is a alias for ls -l or something even more complex with color display, etc, but the point is it most likely returns file size. GPFS doesn't lie about the filesize, we only return accurate stat informations and while this is arguable, its a fact today. so what happens is that the stat on each file triggers a token revoke on the node that currently writing to the file you do stat on, lets say it has 1 gb of dirty data in its memory for this file (as its writes data buffered) this 1 GB of data now gets written to the NSD server, the client updates the inode info and returns the correct size. lets say you have very fast network and you have a fast storage device like GSS (which i see you have) it will be able to do this in a few 100 ms, but the problem is this now happens serialized for each single file in this directory that people write into as for each we need to get the exact stat info to satisfy your ls -l request. this is what takes so long, not the fact that the storage device might be slow or to much metadata activity is going on , this is token , means network traffic and obviously latency dependent. the best way to see this is to look at waiters on the client where you run the ls and see what they are waiting for. there are various ways to tune this to get better 'felt' ls responses but its not completely going away if all you try to with ls is if there is a file in the directory run unalias ls and check if ls after that runs fast as it shouldn't do the -l under the cover anymore. > > If its the CPU on the servers then there is no much to do beside > replacing or add more servers.If its not the CPU, maybe more memory > would help? Maybe its just the network that filled up? so i can add > more links > > Or if we reached the point there the bottleneck its the spindles, > then there is no much point o look somethere else, we just reached > the hardware limit.. > > Sometimes, it also happens that there is very low IO (10Gb/s ), > almost no cpu usage on the servers but huge slownes ( ls can take 10 > seconds). Why that happens? There is not much data ops , but we > think there is a huge ammount of metadata ops. So what i want to > know is if the metadata vdisks are busy or not. If this is our > problem, could some SSD disks dedicated to metadata help? the answer if ssd's would help or not are hard to say without knowing the root case and as i tried to explain above the most likely case is token revoke, not disk i/o. obviously as more busy your disks are as longer the token revoke will take. > > > In particular im, a bit puzzled with the design of our GSS storage. > Each recovery groups have 3 declustered arrays, and each declustered > aray have 1 data and 1 metadata vdisk, but in the end both metadata > and data vdisks use the same spindles. The problem that, its that I > dont understand if we have a metadata bottleneck there. Maybe some > SSD disks in a dedicated declustered array would perform much > better, but this is just theory. I really would like to be able to > monitor IO activities on the metadata vdisks. the short answer is we WANT the metadata disks to be with the data disks on the same spindles. compared to other storage systems, GSS is capable to handle different raid codes for different virtual disks on the same physical disks, this way we create raid1'ish 'LUNS' for metadata and raid6'is 'LUNS' for data so the small i/o penalty for a metadata is very small compared to a read/modify/write on the data disks. > > > > > so the Layer you care about is the NSD Server layer, which sits on > top of the VIO layer (which is essentially the SW RAID Layer in GNR) > > > I'm asking that because if they are just ops, i don't know how much > > they could be usefull. For example one write operation could eman > > write 1 block or write a file of 100GB. If those are oprations, > > there is a way to have the oupunt in bytes or blocks? > > there are multiple ways to get infos on the NSD layer, one would be > to use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats > counts again. > > Counters its not a problem. I can collect them and create some > graphs in a monitoring tool. I will check that. if you (let) upgrade your system to GSS 2.0 you get a graphical monitoring as part of it. if you want i can send you some direct email outside the group with additional informations on that. > > the alternative option is to use mmdiag --iohist. this shows you a > history of the last X numbers of io operations on either the client > or the server side like on a client : > > # mmdiag --iohist > > === mmdiag: iohist === > > I/O history: > > I/O start time RW Buf type disk:sectorNum nSec time ms > qTime ms RpcTimes ms Type Device/NSD ID NSD server > --------------- -- ----------- ----------------- ----- ------- > -------- ----------------- ---- ------------------ --------------- > 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 > 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:22.182723 R inode 1:1071252480 8 6.970 > 0.000 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 > 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.668262 R inode 2:1081373696 8 14.117 > 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 > 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.692019 R inode 2:1064356608 8 14.899 > 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.707100 R inode 2:1077830152 8 16.499 > 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 > 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.728082 R inode 2:1081918976 8 7.760 > 0.000 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.877416 R metadata 2:678978560 16 13.343 > 0.000 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 > 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.906556 R inode 2:1083476520 8 11.723 > 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 > 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.926592 R inode 1:1076503480 8 8.087 > 0.000 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 > 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.941441 R inode 2:1069885984 8 11.686 > 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.953294 R inode 2:1083476936 8 8.951 > 0.000 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965475 R inode 1:1076503504 8 0.477 > 0.000 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.965755 R inode 2:1083476488 8 0.410 > 0.000 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965787 R inode 2:1083476512 8 0.439 > 0.000 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 > > you basically see if its a inode , data block , what size it has (in > sectors) , which nsd server you did send this request to, etc. > > on the Server side you see the type , which physical disk it goes to > and also what size of disk i/o it causes like : > > 14:26:50.129995 R inode 12:3211886376 64 14.261 > 0.000 0.000 0.000 pd sdis > 14:26:50.137102 R inode 19:3003969520 64 9.004 > 0.000 0.000 0.000 pd sdad > 14:26:50.136116 R inode 55:3591710992 64 11.057 > 0.000 0.000 0.000 pd sdoh > 14:26:50.141510 R inode 21:3066810504 64 5.909 > 0.000 0.000 0.000 pd sdaf > 14:26:50.130529 R inode 89:2962370072 64 17.437 > 0.000 0.000 0.000 pd sddi > 14:26:50.131063 R inode 78:1889457000 64 17.062 > 0.000 0.000 0.000 pd sdsj > 14:26:50.143403 R inode 36:3323035688 64 4.807 > 0.000 0.000 0.000 pd sdmw > 14:26:50.131044 R inode 37:2513579736 128 17.181 > 0.000 0.000 0.000 pd sddv > 14:26:50.138181 R inode 72:3868810400 64 10.951 > 0.000 0.000 0.000 pd sdbz > 14:26:50.138188 R inode 131:2443484784 128 11.792 > 0.000 0.000 0.000 pd sdug > 14:26:50.138003 R inode 102:3696843872 64 11.994 > 0.000 0.000 0.000 pd sdgp > 14:26:50.137099 R inode 145:3370922504 64 13.225 > 0.000 0.000 0.000 pd sdmi > 14:26:50.141576 R inode 62:2668579904 64 9.313 > 0.000 0.000 0.000 pd sdou > 14:26:50.134689 R inode 159:2786164648 64 16.577 > 0.000 0.000 0.000 pd sdpq > 14:26:50.145034 R inode 34:2097217320 64 7.409 > 0.000 0.000 0.000 pd sdmt > 14:26:50.138140 R inode 139:2831038792 64 14.898 > 0.000 0.000 0.000 pd sdlw > 14:26:50.130954 R inode 164:282120312 64 22.274 > 0.000 0.000 0.000 pd sdzd > 14:26:50.137038 R inode 41:3421909608 64 16.314 > 0.000 0.000 0.000 pd sdef > 14:26:50.137606 R inode 104:1870962416 64 16.644 > 0.000 0.000 0.000 pd sdgx > 14:26:50.141306 R inode 65:2276184264 64 16.593 > 0.000 0.000 0.000 pd sdrk > > > mmdiag --iohist its another think i looked at it, but i could not > find good explanation for all the "buf type" ( third column ) > allocSeg > data > iallocSeg > indBlock > inode > LLIndBlock > logData > logDesc > logWrap > metadata > vdiskAULog > vdiskBuf > vdiskFWLog > vdiskMDLog > vdiskMeta > vdiskRGDesc > If i want to monifor metadata operation whan should i look at? just inodes =inodes , *alloc* = file or data allocation blocks , *ind* = indirect blocks (for very large files) and metadata , everyhing else is data or internal i/o's > the metadata flag or also inode? this command takes also long to > run, especially if i run it a second time it hangs for a lot before > to rerun again, so i'm not sure that run it every 30secs or minute > its viable, but i will look also into that. THere is any > documentation that descibes clearly the whole output? what i found > its quite generic and don't go into details... the reason it takes so long is because it collects 10's of thousands of i/os in a table and to not slow down the system when we dump the data we copy it to a separate buffer so we don't need locks :-) you can adjust the number of entries you want to collect by adjusting the ioHistorySize config parameter > > > > Last but not least.. and this is what i really would like to > > accomplish, i would to be able to monitor the latency of metadata > operations. > > you can't do this on the server side as you don't know how much time > you spend on the client , network or anything between the app and > the physical disk, so you can only reliably look at this from the > client, the iohist output only shows you the Server disk i/o > processing time, but that can be a fraction of the overall time (in > other cases this obviously can also be the dominant part depending > on your workload). > > the easiest way on the client is to run > > mmfsadm vfsstats enable > from now on vfs stats are collected until you restart GPFS. > > then run : > > vfs statistics currently enabled > started at: Fri Aug 29 13:15:05.380 2014 > duration: 448446.970 sec > > name calls time per call total time > -------------------- -------- -------------- -------------- > statfs 9 0.000002 0.000021 > startIO 246191176 0.005853 1441049.976740 > > to dump what ever you collected so far on this node. > > We already do that, but as I said, I want to check specifically how > gss servers are keeping the requests to identify or exlude server > side bottlenecks. > > > Thanks for your help, you gave me definitely few things where to look at. > > Salvatore > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at us.ibm.com Fri Sep 5 01:53:17 2014 From: oehmes at us.ibm.com (Sven Oehme) Date: Thu, 4 Sep 2014 17:53:17 -0700 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <5408722E.6060309@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <540869DF.5060100@ebi.ac.uk> <54086F1D.1000401@ed.ac.uk> <5408722E.6060309@ebi.ac.uk> Message-ID: if you don't have the files you need to update to a newer version of the GPFS client software on the node. they started shipping with 3.5.0.13 even you get the files you still wouldn't see many values as they never got exposed before. some more details are in a presentation i gave earlier this year which is archived in the list or here --> http://www.gpfsug.org/wp-content/uploads/2014/05/UG10_GPFS_Performance_Session_v10.pdf Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ From: Salvatore Di Nardo To: gpfsug-discuss at gpfsug.org Date: 09/04/2014 07:08 AM Subject: Re: [gpfsug-discuss] gpfs performance monitoring Sent by: gpfsug-discuss-bounces at gpfsug.org On 04/09/14 14:54, Orlando Richards wrote: On 04/09/14 14:32, Salvatore Di Nardo wrote: Sorry to bother you again but dstat have some issues with the plugin: [root at gss01a util]# dstat --gpfs /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use the subprocess module. pipes[cmd] = os.popen3(cmd, 't', 0) Module dstat_gpfs failed to load. (global name 'select' is not defined) None of the stats you selected are available. I found this solution , but involve dstat recompile.... https://github.com/dagwieers/dstat/issues/44 Are you aware about any easier solution (we use RHEL6.3) ? This worked for me the other day on a dev box I was poking at: # rm /usr/share/dstat/dstat_gpfsops* # cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 /usr/share/dstat/dstat_gpfsops.py # dstat --gpfsops /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use the subprocess module. pipes[cmd] = os.popen3(cmd, 't', 0) ---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o----------------------------- cr del op/cl rd wr trunc fsync looku gattr sattr other mb_rd mb_wr pref wrbeh steal clean sync revok logwr logda oth_r oth_w 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... NICE!! The only problem is that the box seems lacking those python scripts: ls /usr/lpp/mmfs/samples/util/ makefile README tsbackup tsbackup.C tsbackup.h tsfindinode tsfindinode.c tsgetusage tsgetusage.c tsinode tsinode.c tslistall tsreaddir tsreaddir.c tstimes tstimes.c Do you mind sending me those py files? They should be 3 as i see e gpfs options: gpfs, gpfs-ops, gpfsops (dunno what are the differences ) Regards, Salvatore _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan at buzzard.me.uk Fri Sep 5 10:29:27 2014 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Fri, 05 Sep 2014 10:29:27 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <54084258.90508@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <54084258.90508@ebi.ac.uk> Message-ID: <1409909367.30257.151.camel@buzzard.phy.strath.ac.uk> On Thu, 2014-09-04 at 11:43 +0100, Salvatore Di Nardo wrote: [SNIP] > > Sometimes, it also happens that there is very low IO (10Gb/s ), almost > no cpu usage on the servers but huge slownes ( ls can take 10 > seconds). Why that happens? There is not much data ops , but we think > there is a huge ammount of metadata ops. So what i want to know is if > the metadata vdisks are busy or not. If this is our problem, could > some SSD disks dedicated to metadata help? > This is almost always because you are using an external LDAP/NIS server for GECOS information and the values that you need are not cached for whatever reason and you are having to look them up again. Note that the standard aliasing for RHEL based distros of ls also causes it to do a stat on every file for the colouring etc. Also be aware that if you are trying to fill out your cd with TAB auto-completion you will run into similar issues. That is had you typed the path for the cd out in full you would get in instantly, doing a couple of letters and hitting cd it could take a while. You can test this on a RHEL based distro by doing "/bin/ls -n" The idea being to avoid any aliasing and not look up GECOS data and just report the raw numerical stuff. What I would suggest is that you set the cache time on UID/GID lookups for positive lookups to a long time, in general as long as possible because the values should almost never change. Even for a positive look up of a group membership I would have that cached for a couple of hours. For negative lookups something like five or 10 minutes is a good starting point. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From sdinardo at ebi.ac.uk Fri Sep 5 11:56:37 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Fri, 05 Sep 2014 11:56:37 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: References: <54074F90.7000303@ebi.ac.uk> <54084258.90508@ebi.ac.uk> Message-ID: <540996E5.5000502@ebi.ac.uk> Little clarification: Our ls its plain ls, there is no alias. Consider that all those things are already set up properly as EBI run hi computing farms from many years, so those things are already fixed loong time ago. We have very little experience with GPFS, but good knowledge with LSF farms and own multiple NFS stotages ( several petabyte sized). about NIS, all clients run NSCD that cashes all informations to avoid such tipe of slownes, in fact then ls isslow, also ls -n is slow. Beside that, also a "cd" sometimes hangs, so it have nothing to do with getting attributes. Just to clarify a bit more. Now GSS usually seems working fine, we have users that run jobs on the farms that pushes 180Gb/s read ( reading and writing files of 100GB size). GPFS works very well there, where other systems had performance problems accessing portion of data in so huge files. Sadly, on the other hand, other users run jobs that do suge ammount of metadata operations, like toons of ls in directory with many files, or creating a silly amount of temporary files just to synchronize the jobs between the farm nodes, or just to store temporary data for few milliseconds and them immediately delete those temporary files. Imagine to create constantly thousands files just to write few bytes and they delete them after few milliseconds... When those thing happens we see 10-15Gb/sec throughput, low CPU usage on the server ( 80% iddle), but any cd, or ls or wathever takes few seconds. So my question is, if the bottleneck could be the spindles, or if the clients could be tuned a bit more? I read your PDF and all the paramenters seems already well configured except "maxFilesToCache", but I'm not sure how we should configure few of those parameters on the clients. As an example I cannot immagine a client that require 38g pagepool size. so what's the correct *pagepool* on a client? what about those others? *maxFilestoCache** **maxBufferdescs** **worker1threads** **worker3threads* Right now all the clients have 1 GB pagepool size. In theory, we can afford to use more ( i thing we can easily go up to 8GB) as they have plenty or available memory. If this could help, we can do that, but the client really really need more than 1G? They are just clients after all, so the memory in theory should be used for jobs not just for "caching". Last question about "maxFIlesToCache" you say that must be large on small cluster but small on large clusters. What do you consider 6 servers and almost 700 clients? on clienst we have: maxFilesToCache 4000 on servers we have maxFilesToCache 12288 Regards, Salvatore On 05/09/14 01:48, Sven Oehme wrote: > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > gpfsug-discuss-bounces at gpfsug.org wrote on 09/04/2014 03:43:36 AM: > > > From: Salvatore Di Nardo > > To: gpfsug main discussion list > > Date: 09/04/2014 03:44 AM > > Subject: Re: [gpfsug-discuss] gpfs performance monitoring > > Sent by: gpfsug-discuss-bounces at gpfsug.org > > > > On 04/09/14 01:50, Sven Oehme wrote: > > > Hello everybody, > > > > Hi > > > > > here i come here again, this time to ask some hint about how to > > monitor GPFS. > > > > > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is > > > that they return number based only on the request done in the > > > current host, so i have to run them on all the clients ( over 600 > > > nodes) so its quite unpractical. Instead i would like to know from > > > the servers whats going on, and i came across the vio_s statistics > > > wich are less documented and i dont know exacly what they mean. > > > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that > > > runs VIO_S. > > > > > > My problems with the output of this command: > > > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 > > > > > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second > > > timestamp: 1409763206/477366 > > > recovery group: * > > > declustered array: * > > > vdisk: * > > > client reads: 2584229 > > > client short writes: 55299693 > > > client medium writes: 190071 > > > client promoted full track writes: 465145 > > > client full track writes: 9249 > > > flushed update writes: 4187708 > > > flushed promoted full track writes: 123 > > > migrate operations: 114 > > > scrub operations: 450590 > > > log writes: 28509602 > > > > > > it sais "VIOPS per second", but they seem to me just counters as > > > every time i re-run the command, the numbers increase by a bit.. > > > Can anyone confirm if those numbers are counter or if they are > OPS/sec. > > > > the numbers are accumulative so everytime you run them they just > > show the value since start (or last reset) time. > > OK, you confirmed my toughts, thatks > > > > > > > > > On a closer eye about i dont understand what most of thosevalues > > > mean. For example, what exacly are "flushed promoted full track > write" ?? > > > I tried to find a documentation about this output , but could not > > > find any. can anyone point me a link where output of vio_s is > explained? > > > > > > Another thing i dont understand about those numbers is if they are > > > just operations, or the number of blocks that was read/write/etc . > > > > its just operations and if i would explain what the numbers mean i > > might confuse you even more because this is not what you are really > > looking for. > > what you are looking for is what the client io's look like on the > > Server side, while the VIO layer is the Server side to the disks, so > > one lever lower than what you are looking for from what i could read > > out of the description above. > > No.. what I'm looking its exactly how the disks are busy to keep the > > requests. Obviously i'm not looking just that, but I feel the needs > > to monitor also those things. Ill explain you why. > > > > It happens when our storage is quite busy ( 180Gb/s of read/write ) > > that the FS start to be slowin normal cd or ls requests. This might > > be normal, but in those situation i want to know where the > > bottleneck is. Is the server CPU? Memory? Network? Spindles? knowing > > where the bottlenek is might help me to understand if we can tweak > > the system a bit more. > > if cd or ls is very slow in GPFS in the majority of the cases it has > nothing to do with NSD Server bottlenecks, only indirect. > the main reason ls is slow in the field is you have some very powerful > nodes that all do buffered writes into the same directory into 1 or > multiple files while you do the ls on a different node. what happens > now is that the ls you did run most likely is a alias for ls -l or > something even more complex with color display, etc, but the point is > it most likely returns file size. GPFS doesn't lie about the filesize, > we only return accurate stat informations and while this is arguable, > its a fact today. > so what happens is that the stat on each file triggers a token revoke > on the node that currently writing to the file you do stat on, lets > say it has 1 gb of dirty data in its memory for this file (as its > writes data buffered) this 1 GB of data now gets written to the NSD > server, the client updates the inode info and returns the correct size. > lets say you have very fast network and you have a fast storage device > like GSS (which i see you have) it will be able to do this in a few > 100 ms, but the problem is this now happens serialized for each single > file in this directory that people write into as for each we need to > get the exact stat info to satisfy your ls -l request. > this is what takes so long, not the fact that the storage device might > be slow or to much metadata activity is going on , this is token , > means network traffic and obviously latency dependent. > > the best way to see this is to look at waiters on the client where you > run the ls and see what they are waiting for. > > there are various ways to tune this to get better 'felt' ls responses > but its not completely going away > if all you try to with ls is if there is a file in the directory run > unalias ls and check if ls after that runs fast as it shouldn't do the > -l under the cover anymore. > > > > > If its the CPU on the servers then there is no much to do beside > > replacing or add more servers.If its not the CPU, maybe more memory > > would help? Maybe its just the network that filled up? so i can add > > more links > > > > Or if we reached the point there the bottleneck its the spindles, > > then there is no much point o look somethere else, we just reached > > the hardware limit.. > > > > Sometimes, it also happens that there is very low IO (10Gb/s ), > > almost no cpu usage on the servers but huge slownes ( ls can take 10 > > seconds). Why that happens? There is not much data ops , but we > > think there is a huge ammount of metadata ops. So what i want to > > know is if the metadata vdisks are busy or not. If this is our > > problem, could some SSD disks dedicated to metadata help? > > the answer if ssd's would help or not are hard to say without knowing > the root case and as i tried to explain above the most likely case is > token revoke, not disk i/o. obviously as more busy your disks are as > longer the token revoke will take. > > > > > > > In particular im, a bit puzzled with the design of our GSS storage. > > Each recovery groups have 3 declustered arrays, and each declustered > > aray have 1 data and 1 metadata vdisk, but in the end both metadata > > and data vdisks use the same spindles. The problem that, its that I > > dont understand if we have a metadata bottleneck there. Maybe some > > SSD disks in a dedicated declustered array would perform much > > better, but this is just theory. I really would like to be able to > > monitor IO activities on the metadata vdisks. > > the short answer is we WANT the metadata disks to be with the data > disks on the same spindles. compared to other storage systems, GSS is > capable to handle different raid codes for different virtual disks on > the same physical disks, this way we create raid1'ish 'LUNS' for > metadata and raid6'is 'LUNS' for data so the small i/o penalty for a > metadata is very small compared to a read/modify/write on the data disks. > > > > > > > > > > > > so the Layer you care about is the NSD Server layer, which sits on > > top of the VIO layer (which is essentially the SW RAID Layer in GNR) > > > > > I'm asking that because if they are just ops, i don't know how much > > > they could be usefull. For example one write operation could eman > > > write 1 block or write a file of 100GB. If those are oprations, > > > there is a way to have the oupunt in bytes or blocks? > > > > there are multiple ways to get infos on the NSD layer, one would be > > to use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats > > counts again. > > > > Counters its not a problem. I can collect them and create some > > graphs in a monitoring tool. I will check that. > > if you (let) upgrade your system to GSS 2.0 you get a graphical > monitoring as part of it. if you want i can send you some direct email > outside the group with additional informations on that. > > > > > the alternative option is to use mmdiag --iohist. this shows you a > > history of the last X numbers of io operations on either the client > > or the server side like on a client : > > > > # mmdiag --iohist > > > > === mmdiag: iohist === > > > > I/O history: > > > > I/O start time RW Buf type disk:sectorNum nSec time ms > > qTime ms RpcTimes ms Type Device/NSD ID NSD server > > --------------- -- ----------- ----------------- ----- ------- > > -------- ----------------- ---- ------------------ --------------- > > 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 > > 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:22.182723 R inode 1:1071252480 8 6.970 > > 0.000 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 > > 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:53.668262 R inode 2:1081373696 8 14.117 > > 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 > > 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:53.692019 R inode 2:1064356608 8 14.899 > > 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:53.707100 R inode 2:1077830152 8 16.499 > > 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 > > 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:53.728082 R inode 2:1081918976 8 7.760 > > 0.000 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:57.877416 R metadata 2:678978560 16 13.343 > > 0.000 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 > > 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:57.906556 R inode 2:1083476520 8 11.723 > > 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 > > 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:57.926592 R inode 1:1076503480 8 8.087 > > 0.000 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 > > 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:57.941441 R inode 2:1069885984 8 11.686 > > 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:57.953294 R inode 2:1083476936 8 8.951 > > 0.000 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:57.965475 R inode 1:1076503504 8 0.477 > > 0.000 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:57.965755 R inode 2:1083476488 8 0.410 > > 0.000 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:57.965787 R inode 2:1083476512 8 0.439 > > 0.000 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 > > > > you basically see if its a inode , data block , what size it has (in > > sectors) , which nsd server you did send this request to, etc. > > > > on the Server side you see the type , which physical disk it goes to > > and also what size of disk i/o it causes like : > > > > 14:26:50.129995 R inode 12:3211886376 64 14.261 > > 0.000 0.000 0.000 pd sdis > > 14:26:50.137102 R inode 19:3003969520 64 9.004 > > 0.000 0.000 0.000 pd sdad > > 14:26:50.136116 R inode 55:3591710992 64 11.057 > > 0.000 0.000 0.000 pd sdoh > > 14:26:50.141510 R inode 21:3066810504 64 5.909 > > 0.000 0.000 0.000 pd sdaf > > 14:26:50.130529 R inode 89:2962370072 64 17.437 > > 0.000 0.000 0.000 pd sddi > > 14:26:50.131063 R inode 78:1889457000 64 17.062 > > 0.000 0.000 0.000 pd sdsj > > 14:26:50.143403 R inode 36:3323035688 64 4.807 > > 0.000 0.000 0.000 pd sdmw > > 14:26:50.131044 R inode 37:2513579736 128 17.181 > > 0.000 0.000 0.000 pd sddv > > 14:26:50.138181 R inode 72:3868810400 64 10.951 > > 0.000 0.000 0.000 pd sdbz > > 14:26:50.138188 R inode 131:2443484784 128 11.792 > > 0.000 0.000 0.000 pd sdug > > 14:26:50.138003 R inode 102:3696843872 64 11.994 > > 0.000 0.000 0.000 pd sdgp > > 14:26:50.137099 R inode 145:3370922504 64 13.225 > > 0.000 0.000 0.000 pd sdmi > > 14:26:50.141576 R inode 62:2668579904 64 9.313 > > 0.000 0.000 0.000 pd sdou > > 14:26:50.134689 R inode 159:2786164648 64 16.577 > > 0.000 0.000 0.000 pd sdpq > > 14:26:50.145034 R inode 34:2097217320 64 7.409 > > 0.000 0.000 0.000 pd sdmt > > 14:26:50.138140 R inode 139:2831038792 64 14.898 > > 0.000 0.000 0.000 pd sdlw > > 14:26:50.130954 R inode 164:282120312 64 22.274 > > 0.000 0.000 0.000 pd sdzd > > 14:26:50.137038 R inode 41:3421909608 64 16.314 > > 0.000 0.000 0.000 pd sdef > > 14:26:50.137606 R inode 104:1870962416 64 16.644 > > 0.000 0.000 0.000 pd sdgx > > 14:26:50.141306 R inode 65:2276184264 64 16.593 > > 0.000 0.000 0.000 pd sdrk > > > > > > > mmdiag --iohist its another think i looked at it, but i could not > > find good explanation for all the "buf type" ( third column ) > > > allocSeg > > data > > iallocSeg > > indBlock > > inode > > LLIndBlock > > logData > > logDesc > > logWrap > > metadata > > vdiskAULog > > vdiskBuf > > vdiskFWLog > > vdiskMDLog > > vdiskMeta > > vdiskRGDesc > > If i want to monifor metadata operation whan should i look at? just > > inodes =inodes , *alloc* = file or data allocation blocks , *ind* = > indirect blocks (for very large files) and metadata , everyhing else > is data or internal i/o's > > > the metadata flag or also inode? this command takes also long to > > run, especially if i run it a second time it hangs for a lot before > > to rerun again, so i'm not sure that run it every 30secs or minute > > its viable, but i will look also into that. THere is any > > documentation that descibes clearly the whole output? what i found > > its quite generic and don't go into details... > > the reason it takes so long is because it collects 10's of thousands > of i/os in a table and to not slow down the system when we dump the > data we copy it to a separate buffer so we don't need locks :-) > you can adjust the number of entries you want to collect by adjusting > the ioHistorySize config parameter > > > > > > > > Last but not least.. and this is what i really would like to > > > accomplish, i would to be able to monitor the latency of metadata > > operations. > > > > you can't do this on the server side as you don't know how much time > > you spend on the client , network or anything between the app and > > the physical disk, so you can only reliably look at this from the > > client, the iohist output only shows you the Server disk i/o > > processing time, but that can be a fraction of the overall time (in > > other cases this obviously can also be the dominant part depending > > on your workload). > > > > the easiest way on the client is to run > > > > mmfsadm vfsstats enable > > from now on vfs stats are collected until you restart GPFS. > > > > then run : > > > > vfs statistics currently enabled > > started at: Fri Aug 29 13:15:05.380 2014 > > duration: 448446.970 sec > > > > name calls time per call total time > > -------------------- -------- -------------- -------------- > > statfs 9 0.000002 0.000021 > > startIO 246191176 0.005853 1441049.976740 > > > > to dump what ever you collected so far on this node. > > > > > We already do that, but as I said, I want to check specifically how > > gss servers are keeping the requests to identify or exlude server > > side bottlenecks. > > > > > > Thanks for your help, you gave me definitely few things where to > look at. > > > > Salvatore > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at gpfsug.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From chekh at stanford.edu Fri Sep 5 22:17:47 2014 From: chekh at stanford.edu (Alex Chekholko) Date: Fri, 05 Sep 2014 14:17:47 -0700 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <540996E5.5000502@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <54084258.90508@ebi.ac.uk> <540996E5.5000502@ebi.ac.uk> Message-ID: <540A287B.1050202@stanford.edu> On 9/5/14, 3:56 AM, Salvatore Di Nardo wrote: > Little clarification: > Our ls its plain ls, there is no alias. ... > Last question about "maxFIlesToCache" you say that must be large on > small cluster but small on large clusters. What do you consider 6 > servers and almost 700 clients? > > on clienst we have: > maxFilesToCache 4000 > > on servers we have > maxFilesToCache 12288 > > One thing to do is to try your 'ls', see it is slow, then immediately run it again. If it is fast the second and consecutive times, it's because now the stat info is coming out of local cache. e.g. /usr/bin/time ls /path/to/some/dir && /usr/bin/time ls /path/to/some/dir The second time is likely to be almost immediate. So long as your local cache is big enough. I see on one of our older clusters we have: tokenMemLimit 2G maxFilesToCache 40000 maxStatCache 80000 You can also interrogate the local cache to see how full it is. Of course, if many nodes are writing to same dirs, then the cache will need to be invalidated often which causes some overhead. Big local cache is good if clients are usually working in different directories. Regards, -- chekh at stanford.edu From oehmes at us.ibm.com Sat Sep 6 01:12:42 2014 From: oehmes at us.ibm.com (Sven Oehme) Date: Fri, 5 Sep 2014 17:12:42 -0700 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <540996E5.5000502@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <54084258.90508@ebi.ac.uk> <540996E5.5000502@ebi.ac.uk> Message-ID: on your GSS nodes you have tuning files we suggest customers to use for mixed workloads clients. the files in /usr/lpp/mmfs/samples/gss/ if you create a nodeclass for all your clients you can run /usr/lpp/mmfs/samples/gss/gssClientConfig.sh NODECLASS and it applies all the settings to them so they will be active on next restart of the gpfs daemon. this should be a very good starting point for your config. please try that and let me know if it doesn't. there are also several enhancements in GPFS 4.1 which reduce contention in multiple areas, which would help as well, if you have the choice to update the nodes. btw. the GSS 2.0 package will update your GSS nodes to 4.1 also Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ From: Salvatore Di Nardo To: gpfsug main discussion list Date: 09/05/2014 03:57 AM Subject: Re: [gpfsug-discuss] gpfs performance monitoring Sent by: gpfsug-discuss-bounces at gpfsug.org Little clarification: Our ls its plain ls, there is no alias. Consider that all those things are already set up properly as EBI run hi computing farms from many years, so those things are already fixed loong time ago. We have very little experience with GPFS, but good knowledge with LSF farms and own multiple NFS stotages ( several petabyte sized). about NIS, all clients run NSCD that cashes all informations to avoid such tipe of slownes, in fact then ls isslow, also ls -n is slow. Beside that, also a "cd" sometimes hangs, so it have nothing to do with getting attributes. Just to clarify a bit more. Now GSS usually seems working fine, we have users that run jobs on the farms that pushes 180Gb/s read ( reading and writing files of 100GB size). GPFS works very well there, where other systems had performance problems accessing portion of data in so huge files. Sadly, on the other hand, other users run jobs that do suge ammount of metadata operations, like toons of ls in directory with many files, or creating a silly amount of temporary files just to synchronize the jobs between the farm nodes, or just to store temporary data for few milliseconds and them immediately delete those temporary files. Imagine to create constantly thousands files just to write few bytes and they delete them after few milliseconds... When those thing happens we see 10-15Gb/sec throughput, low CPU usage on the server ( 80% iddle), but any cd, or ls or wathever takes few seconds. So my question is, if the bottleneck could be the spindles, or if the clients could be tuned a bit more? I read your PDF and all the paramenters seems already well configured except "maxFilesToCache", but I'm not sure how we should configure few of those parameters on the clients. As an example I cannot immagine a client that require 38g pagepool size. so what's the correct pagepool on a client? what about those others? maxFilestoCache maxBufferdescs worker1threads worker3threads Right now all the clients have 1 GB pagepool size. In theory, we can afford to use more ( i thing we can easily go up to 8GB) as they have plenty or available memory. If this could help, we can do that, but the client really really need more than 1G? They are just clients after all, so the memory in theory should be used for jobs not just for "caching". Last question about "maxFIlesToCache" you say that must be large on small cluster but small on large clusters. What do you consider 6 servers and almost 700 clients? on clienst we have: maxFilesToCache 4000 on servers we have maxFilesToCache 12288 Regards, Salvatore On 05/09/14 01:48, Sven Oehme wrote: ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ gpfsug-discuss-bounces at gpfsug.org wrote on 09/04/2014 03:43:36 AM: > From: Salvatore Di Nardo > To: gpfsug main discussion list > Date: 09/04/2014 03:44 AM > Subject: Re: [gpfsug-discuss] gpfs performance monitoring > Sent by: gpfsug-discuss-bounces at gpfsug.org > > On 04/09/14 01:50, Sven Oehme wrote: > > Hello everybody, > > Hi > > > here i come here again, this time to ask some hint about how to > monitor GPFS. > > > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is > > that they return number based only on the request done in the > > current host, so i have to run them on all the clients ( over 600 > > nodes) so its quite unpractical. Instead i would like to know from > > the servers whats going on, and i came across the vio_s statistics > > wich are less documented and i dont know exacly what they mean. > > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that > > runs VIO_S. > > > > My problems with the output of this command: > > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 > > > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second > > timestamp: 1409763206/477366 > > recovery group: * > > declustered array: * > > vdisk: * > > client reads: 2584229 > > client short writes: 55299693 > > client medium writes: 190071 > > client promoted full track writes: 465145 > > client full track writes: 9249 > > flushed update writes: 4187708 > > flushed promoted full track writes: 123 > > migrate operations: 114 > > scrub operations: 450590 > > log writes: 28509602 > > > > it sais "VIOPS per second", but they seem to me just counters as > > every time i re-run the command, the numbers increase by a bit.. > > Can anyone confirm if those numbers are counter or if they are OPS/sec. > > the numbers are accumulative so everytime you run them they just > show the value since start (or last reset) time. > OK, you confirmed my toughts, thatks > > > > > On a closer eye about i dont understand what most of thosevalues > > mean. For example, what exacly are "flushed promoted full track write" ?? > > I tried to find a documentation about this output , but could not > > find any. can anyone point me a link where output of vio_s is explained? > > > > Another thing i dont understand about those numbers is if they are > > just operations, or the number of blocks that was read/write/etc . > > its just operations and if i would explain what the numbers mean i > might confuse you even more because this is not what you are really > looking for. > what you are looking for is what the client io's look like on the > Server side, while the VIO layer is the Server side to the disks, so > one lever lower than what you are looking for from what i could read > out of the description above. > No.. what I'm looking its exactly how the disks are busy to keep the > requests. Obviously i'm not looking just that, but I feel the needs > to monitor also those things. Ill explain you why. > > It happens when our storage is quite busy ( 180Gb/s of read/write ) > that the FS start to be slowin normal cd or ls requests. This might > be normal, but in those situation i want to know where the > bottleneck is. Is the server CPU? Memory? Network? Spindles? knowing > where the bottlenek is might help me to understand if we can tweak > the system a bit more. if cd or ls is very slow in GPFS in the majority of the cases it has nothing to do with NSD Server bottlenecks, only indirect. the main reason ls is slow in the field is you have some very powerful nodes that all do buffered writes into the same directory into 1 or multiple files while you do the ls on a different node. what happens now is that the ls you did run most likely is a alias for ls -l or something even more complex with color display, etc, but the point is it most likely returns file size. GPFS doesn't lie about the filesize, we only return accurate stat informations and while this is arguable, its a fact today. so what happens is that the stat on each file triggers a token revoke on the node that currently writing to the file you do stat on, lets say it has 1 gb of dirty data in its memory for this file (as its writes data buffered) this 1 GB of data now gets written to the NSD server, the client updates the inode info and returns the correct size. lets say you have very fast network and you have a fast storage device like GSS (which i see you have) it will be able to do this in a few 100 ms, but the problem is this now happens serialized for each single file in this directory that people write into as for each we need to get the exact stat info to satisfy your ls -l request. this is what takes so long, not the fact that the storage device might be slow or to much metadata activity is going on , this is token , means network traffic and obviously latency dependent. the best way to see this is to look at waiters on the client where you run the ls and see what they are waiting for. there are various ways to tune this to get better 'felt' ls responses but its not completely going away if all you try to with ls is if there is a file in the directory run unalias ls and check if ls after that runs fast as it shouldn't do the -l under the cover anymore. > > If its the CPU on the servers then there is no much to do beside > replacing or add more servers.If its not the CPU, maybe more memory > would help? Maybe its just the network that filled up? so i can add > more links > > Or if we reached the point there the bottleneck its the spindles, > then there is no much point o look somethere else, we just reached > the hardware limit.. > > Sometimes, it also happens that there is very low IO (10Gb/s ), > almost no cpu usage on the servers but huge slownes ( ls can take 10 > seconds). Why that happens? There is not much data ops , but we > think there is a huge ammount of metadata ops. So what i want to > know is if the metadata vdisks are busy or not. If this is our > problem, could some SSD disks dedicated to metadata help? the answer if ssd's would help or not are hard to say without knowing the root case and as i tried to explain above the most likely case is token revoke, not disk i/o. obviously as more busy your disks are as longer the token revoke will take. > > > In particular im, a bit puzzled with the design of our GSS storage. > Each recovery groups have 3 declustered arrays, and each declustered > aray have 1 data and 1 metadata vdisk, but in the end both metadata > and data vdisks use the same spindles. The problem that, its that I > dont understand if we have a metadata bottleneck there. Maybe some > SSD disks in a dedicated declustered array would perform much > better, but this is just theory. I really would like to be able to > monitor IO activities on the metadata vdisks. the short answer is we WANT the metadata disks to be with the data disks on the same spindles. compared to other storage systems, GSS is capable to handle different raid codes for different virtual disks on the same physical disks, this way we create raid1'ish 'LUNS' for metadata and raid6'is 'LUNS' for data so the small i/o penalty for a metadata is very small compared to a read/modify/write on the data disks. > > > > > so the Layer you care about is the NSD Server layer, which sits on > top of the VIO layer (which is essentially the SW RAID Layer in GNR) > > > I'm asking that because if they are just ops, i don't know how much > > they could be usefull. For example one write operation could eman > > write 1 block or write a file of 100GB. If those are oprations, > > there is a way to have the oupunt in bytes or blocks? > > there are multiple ways to get infos on the NSD layer, one would be > to use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats > counts again. > > Counters its not a problem. I can collect them and create some > graphs in a monitoring tool. I will check that. if you (let) upgrade your system to GSS 2.0 you get a graphical monitoring as part of it. if you want i can send you some direct email outside the group with additional informations on that. > > the alternative option is to use mmdiag --iohist. this shows you a > history of the last X numbers of io operations on either the client > or the server side like on a client : > > # mmdiag --iohist > > === mmdiag: iohist === > > I/O history: > > I/O start time RW Buf type disk:sectorNum nSec time ms > qTime ms RpcTimes ms Type Device/NSD ID NSD server > --------------- -- ----------- ----------------- ----- ------- > -------- ----------------- ---- ------------------ --------------- > 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 > 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:22.182723 R inode 1:1071252480 8 6.970 > 0.000 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 > 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.668262 R inode 2:1081373696 8 14.117 > 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 > 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.692019 R inode 2:1064356608 8 14.899 > 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.707100 R inode 2:1077830152 8 16.499 > 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 > 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.728082 R inode 2:1081918976 8 7.760 > 0.000 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.877416 R metadata 2:678978560 16 13.343 > 0.000 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 > 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.906556 R inode 2:1083476520 8 11.723 > 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 > 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.926592 R inode 1:1076503480 8 8.087 > 0.000 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 > 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.941441 R inode 2:1069885984 8 11.686 > 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.953294 R inode 2:1083476936 8 8.951 > 0.000 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965475 R inode 1:1076503504 8 0.477 > 0.000 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.965755 R inode 2:1083476488 8 0.410 > 0.000 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965787 R inode 2:1083476512 8 0.439 > 0.000 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 > > you basically see if its a inode , data block , what size it has (in > sectors) , which nsd server you did send this request to, etc. > > on the Server side you see the type , which physical disk it goes to > and also what size of disk i/o it causes like : > > 14:26:50.129995 R inode 12:3211886376 64 14.261 > 0.000 0.000 0.000 pd sdis > 14:26:50.137102 R inode 19:3003969520 64 9.004 > 0.000 0.000 0.000 pd sdad > 14:26:50.136116 R inode 55:3591710992 64 11.057 > 0.000 0.000 0.000 pd sdoh > 14:26:50.141510 R inode 21:3066810504 64 5.909 > 0.000 0.000 0.000 pd sdaf > 14:26:50.130529 R inode 89:2962370072 64 17.437 > 0.000 0.000 0.000 pd sddi > 14:26:50.131063 R inode 78:1889457000 64 17.062 > 0.000 0.000 0.000 pd sdsj > 14:26:50.143403 R inode 36:3323035688 64 4.807 > 0.000 0.000 0.000 pd sdmw > 14:26:50.131044 R inode 37:2513579736 128 17.181 > 0.000 0.000 0.000 pd sddv > 14:26:50.138181 R inode 72:3868810400 64 10.951 > 0.000 0.000 0.000 pd sdbz > 14:26:50.138188 R inode 131:2443484784 128 11.792 > 0.000 0.000 0.000 pd sdug > 14:26:50.138003 R inode 102:3696843872 64 11.994 > 0.000 0.000 0.000 pd sdgp > 14:26:50.137099 R inode 145:3370922504 64 13.225 > 0.000 0.000 0.000 pd sdmi > 14:26:50.141576 R inode 62:2668579904 64 9.313 > 0.000 0.000 0.000 pd sdou > 14:26:50.134689 R inode 159:2786164648 64 16.577 > 0.000 0.000 0.000 pd sdpq > 14:26:50.145034 R inode 34:2097217320 64 7.409 > 0.000 0.000 0.000 pd sdmt > 14:26:50.138140 R inode 139:2831038792 64 14.898 > 0.000 0.000 0.000 pd sdlw > 14:26:50.130954 R inode 164:282120312 64 22.274 > 0.000 0.000 0.000 pd sdzd > 14:26:50.137038 R inode 41:3421909608 64 16.314 > 0.000 0.000 0.000 pd sdef > 14:26:50.137606 R inode 104:1870962416 64 16.644 > 0.000 0.000 0.000 pd sdgx > 14:26:50.141306 R inode 65:2276184264 64 16.593 > 0.000 0.000 0.000 pd sdrk > > > mmdiag --iohist its another think i looked at it, but i could not > find good explanation for all the "buf type" ( third column ) > allocSeg > data > iallocSeg > indBlock > inode > LLIndBlock > logData > logDesc > logWrap > metadata > vdiskAULog > vdiskBuf > vdiskFWLog > vdiskMDLog > vdiskMeta > vdiskRGDesc > If i want to monifor metadata operation whan should i look at? just inodes =inodes , *alloc* = file or data allocation blocks , *ind* = indirect blocks (for very large files) and metadata , everyhing else is data or internal i/o's > the metadata flag or also inode? this command takes also long to > run, especially if i run it a second time it hangs for a lot before > to rerun again, so i'm not sure that run it every 30secs or minute > its viable, but i will look also into that. THere is any > documentation that descibes clearly the whole output? what i found > its quite generic and don't go into details... the reason it takes so long is because it collects 10's of thousands of i/os in a table and to not slow down the system when we dump the data we copy it to a separate buffer so we don't need locks :-) you can adjust the number of entries you want to collect by adjusting the ioHistorySize config parameter > > > > Last but not least.. and this is what i really would like to > > accomplish, i would to be able to monitor the latency of metadata > operations. > > you can't do this on the server side as you don't know how much time > you spend on the client , network or anything between the app and > the physical disk, so you can only reliably look at this from the > client, the iohist output only shows you the Server disk i/o > processing time, but that can be a fraction of the overall time (in > other cases this obviously can also be the dominant part depending > on your workload). > > the easiest way on the client is to run > > mmfsadm vfsstats enable > from now on vfs stats are collected until you restart GPFS. > > then run : > > vfs statistics currently enabled > started at: Fri Aug 29 13:15:05.380 2014 > duration: 448446.970 sec > > name calls time per call total time > -------------------- -------- -------------- -------------- > statfs 9 0.000002 0.000021 > startIO 246191176 0.005853 1441049.976740 > > to dump what ever you collected so far on this node. > > We already do that, but as I said, I want to check specifically how > gss servers are keeping the requests to identify or exlude server > side bottlenecks. > > > Thanks for your help, you gave me definitely few things where to look at. > > Salvatore > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From luke.raimbach at oerc.ox.ac.uk Tue Sep 9 11:23:47 2014 From: luke.raimbach at oerc.ox.ac.uk (Luke Raimbach) Date: Tue, 9 Sep 2014 10:23:47 +0000 Subject: [gpfsug-discuss] mmdiag output questions Message-ID: Hi All, When tracing a problem recently (which turned out to be a NIC failure), mmdiag proved useful in tracing broken cluster connections. I have some questions about the output of mmdiag using the --network switch: Occasionally I see nodes in the same cluster grouped, but in no readily identifiable way - for example, the following output has three headings "Device bon0:" with some nodes listed, but the nodes don't seem to share anything in common like status, err, ostype, etc. Also, is anyone able to explain what might be seen under the err column? Do these correspond to GPFS error codes as one might see in mmfs.log.latest? What is the sock column displaying - the number of open sockets or the socket state? Lastly, the sent/recvd columns seem very low. Is there a rolling time window within which these statistics are kept in some internal mmfsd buffer? Cheers. === mmdiag: network === Pending messages: (none) Inter-node communication configuration: tscTcpPort 1191 my address 10.100.10.51/22 (eth0) my addr list 10.200.21.1/16 (bond0)/cpdn.oerc.local 10.100.10.51/22 (eth0) my node number 9 TCP Connections between nodes: Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype gpfs01 10.200.1.1 connected 0 32 110 110 Linux/L gpfs02 10.200.2.1 connected 0 36 104 104 Linux/L linux 10.200.101.1 connected 0 37 0 0 Linux/L jupiter 10.200.102.1 connected 0 35 0 0 Windows/L cnfs0 10.200.10.10 connected 0 39 0 0 Linux/L cnfs1 10.200.10.11 init 0 -1 0 0 Linux/L Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype cnfs2 10.200.10.12 connected 0 33 5 5 Linux/L cnfs3 10.200.10.13 init 0 -1 0 0 Linux/L cpdn-ppc02 10.200.61.1 init 0 -1 0 0 Linux/L cpdn-ppc03 10.200.62.1 init 0 -1 0 0 Linux/L Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype cpdn-ppc01 10.200.60.1 connected 0 38 0 0 Linux/L diag verbs: VERBS RDMA class not initialized Conversely, the output of mmdiag --network on the file system manager node for the same cluster looks like this: === mmdiag: network === Pending messages: (none) Inter-node communication configuration: tscTcpPort 1191 my address 10.100.10.21/22 (eth0) my addr list 10.200.1.1/16 (bond0)/cpdn.oerc.local 10.100.10.21/22 (eth0) my node number 1 TCP Connections between nodes: Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype gpfs02 10.200.2.1 connected 0 73 219 219 Linux/L linux 10.200.101.1 connected 0 49 180 181 Linux/L jupiter 10.200.102.1 connected 0 33 3 3 Windows/L cnfs0 10.200.10.10 connected 0 61 3 3 Linux/L cnfs1 10.200.10.11 connected 0 81 0 0 Linux/L cnfs2 10.200.10.12 connected 0 64 23 23 Linux/L cnfs3 10.200.10.13 connected 0 60 2 2 Linux/L tsm01 10.200.21.1 connected 0 50 110 110 Linux/L cpdn-ppc02 10.200.61.1 connected 0 63 0 0 Linux/L cpdn-ppc03 10.200.62.1 connected 0 65 0 0 Linux/L cpdn-ppc01 10.200.60.1 connected 0 62 94 94 Linux/L diag verbs: VERBS RDMA class not initialized All neatly connected! -- Luke Raimbach IT Manager Oxford e-Research Centre 7 Keble Road, Oxford, OX1 3QG +44(0)1865 610639 From chair at gpfsug.org Wed Sep 10 15:33:24 2014 From: chair at gpfsug.org (Jez Tucker (Chair)) Date: Wed, 10 Sep 2014 15:33:24 +0100 Subject: [gpfsug-discuss] GPFS Request for Enhancements Message-ID: <54106134.7010902@gpfsug.org> Hi all Just a quick reminder that the RFEs that you all gave feedback at the last UG on are live on IBM's RFE site: goo.gl/1K6LBa Please take the time to have a look and add your votes to the GPFS RFEs. Jez -------------- next part -------------- An HTML attachment was scrubbed... URL: From dmetcalfe at ocf.co.uk Thu Sep 11 21:18:58 2014 From: dmetcalfe at ocf.co.uk (Daniel Metcalfe) Date: Thu, 11 Sep 2014 21:18:58 +0100 Subject: [gpfsug-discuss] mmdiag output questions In-Reply-To: References: Message-ID: Hi Luke, I've seen the same apparent grouping of nodes, I don't believe the nodes are actually being grouped but instead the "Device Bond0:" and column headers are being re-printed to screen whenever there is a node that has the "init" status followed by a node that is "connected". It is something I've noticed on many different versions of GPFS so I imagine it's a "feature". I've not noticed anything but '0' in the err column so I'm not sure if these correspond to error codes in the GPFS logs. If you run the command "mmfsadm dump tscomm", you'll see a bit more detail than the mmdiag -network shows. This suggests the sock column is number of sockets. I've seen the low numbers to for sent / recv using mmdiag --network, again the mmfsadm command above gives a better representation I've found. All that being said, if you want to get in touch with us then we'll happily open a PMR for you and find out the answer to any of your questions. Kind regards, Danny Metcalfe Systems Engineer OCF plc Tel: 0114 257 2200 [cid:image001.jpg at 01CFCE04.575B8380] Twitter Fax: 0114 257 0022 [cid:image002.jpg at 01CFCE04.575B8380] Blog Mob: 07960 503404 [cid:image003.jpg at 01CFCE04.575B8380] Web Please note, any emails relating to an OCF Support request must always be sent to support at ocf.co.uk for a ticket number to be generated or existing support ticket to be updated. Should this not be done then OCF cannot be held responsible for requests not dealt with in a timely manner. OCF plc is a company registered in England and Wales. Registered number 4132533. Registered office address: OCF plc, 5 Rotunda Business Centre, Thorncliffe Park, Chapeltown, Sheffield, S35 2PG This message is private and confidential. If you have received this message in error, please notify us immediately and remove it from your system. -----Original Message----- From: gpfsug-discuss-bounces at gpfsug.org [mailto:gpfsug-discuss-bounces at gpfsug.org] On Behalf Of Luke Raimbach Sent: 09 September 2014 11:24 To: gpfsug-discuss at gpfsug.org Subject: [gpfsug-discuss] mmdiag output questions Hi All, When tracing a problem recently (which turned out to be a NIC failure), mmdiag proved useful in tracing broken cluster connections. I have some questions about the output of mmdiag using the --network switch: Occasionally I see nodes in the same cluster grouped, but in no readily identifiable way - for example, the following output has three headings "Device bon0:" with some nodes listed, but the nodes don't seem to share anything in common like status, err, ostype, etc. Also, is anyone able to explain what might be seen under the err column? Do these correspond to GPFS error codes as one might see in mmfs.log.latest? What is the sock column displaying - the number of open sockets or the socket state? Lastly, the sent/recvd columns seem very low. Is there a rolling time window within which these statistics are kept in some internal mmfsd buffer? Cheers. === mmdiag: network === Pending messages: (none) Inter-node communication configuration: tscTcpPort 1191 my address 10.100.10.51/22 (eth0) my addr list 10.200.21.1/16 (bond0)/cpdn.oerc.local 10.100.10.51/22 (eth0) my node number 9 TCP Connections between nodes: Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype gpfs01 10.200.1.1 connected 0 32 110 110 Linux/L gpfs02 10.200.2.1 connected 0 36 104 104 Linux/L linux 10.200.101.1 connected 0 37 0 0 Linux/L jupiter 10.200.102.1 connected 0 35 0 0 Windows/L cnfs0 10.200.10.10 connected 0 39 0 0 Linux/L cnfs1 10.200.10.11 init 0 -1 0 0 Linux/L Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype cnfs2 10.200.10.12 connected 0 33 5 5 Linux/L cnfs3 10.200.10.13 init 0 -1 0 0 Linux/L cpdn-ppc02 10.200.61.1 init 0 -1 0 0 Linux/L cpdn-ppc03 10.200.62.1 init 0 -1 0 0 Linux/L Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype cpdn-ppc01 10.200.60.1 connected 0 38 0 0 Linux/L diag verbs: VERBS RDMA class not initialized Conversely, the output of mmdiag --network on the file system manager node for the same cluster looks like this: === mmdiag: network === Pending messages: (none) Inter-node communication configuration: tscTcpPort 1191 my address 10.100.10.21/22 (eth0) my addr list 10.200.1.1/16 (bond0)/cpdn.oerc.local 10.100.10.21/22 (eth0) my node number 1 TCP Connections between nodes: Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype gpfs02 10.200.2.1 connected 0 73 219 219 Linux/L linux 10.200.101.1 connected 0 49 180 181 Linux/L jupiter 10.200.102.1 connected 0 33 3 3 Windows/L cnfs0 10.200.10.10 connected 0 61 3 3 Linux/L cnfs1 10.200.10.11 connected 0 81 0 0 Linux/L cnfs2 10.200.10.12 connected 0 64 23 23 Linux/L cnfs3 10.200.10.13 connected 0 60 2 2 Linux/L tsm01 10.200.21.1 connected 0 50 110 110 Linux/L cpdn-ppc02 10.200.61.1 connected 0 63 0 0 Linux/L cpdn-ppc03 10.200.62.1 connected 0 65 0 0 Linux/L cpdn-ppc01 10.200.60.1 connected 0 62 94 94 Linux/L diag verbs: VERBS RDMA class not initialized All neatly connected! -- Luke Raimbach IT Manager Oxford e-Research Centre 7 Keble Road, Oxford, OX1 3QG +44(0)1865 610639 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ----- No virus found in this message. Checked by AVG - www.avg.com Version: 2014.0.4765 / Virus Database: 4015/8158 - Release Date: 09/05/14 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 4696 bytes Desc: image001.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.jpg Type: image/jpeg Size: 4725 bytes Desc: image002.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.jpg Type: image/jpeg Size: 4820 bytes Desc: image003.jpg URL: From stuartb at 4gh.net Tue Sep 23 16:47:09 2014 From: stuartb at 4gh.net (Stuart Barkley) Date: Tue, 23 Sep 2014 11:47:09 -0400 (EDT) Subject: [gpfsug-discuss] filesets and mountpoint naming Message-ID: When we first started using GPFS we created several filesystems and just directly mounted them where seemed appropriate. We have something like: /home /scratch /projects /reference /applications We are finding the overhead of separate filesystems to be troublesome and are looking at using filesets inside fewer filesystems to accomplish our goals (we will probably keep /home separate for now). We can put symbolic links in place to provide the same user experience, but I'm looking for suggestions as to where to mount the actual gpfs filesystems. We have multiple compute clusters with multiple gpfs systems, one cluster has a traditional gpfs system and a separate gss system which will obviously need multiple mount points. We also want to consider possible future cross cluster mounts. Some thoughts are to just do filesystems as: /gpfs01, /gpfs02, etc. /mnt/gpfs01, etc /mnt/clustera/gpfs01, etc. What have other people done? Are you happy with it? What would you do differently? Thanks, Stuart -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone From sabujp at gmail.com Thu Sep 25 13:39:14 2014 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Thu, 25 Sep 2014 07:39:14 -0500 Subject: [gpfsug-discuss] really slow cnfs vip failover due to exportfs -u in bin/mmnfsfuncs unexportAll and unexportFS Message-ID: Hi all, We're running 3.5.0.19 with CNFS and noticed really slow CNFS vip failover times > 4.5mins . It looks like it's being caused by all the exportfs -u calls being made in the unexportAll and the unexportFS function in bin/mmnfsfuncs . What's the purpose of running exportfs -u on all the exported directories? We're running only NFSv3 and have lots of exports and for security reasons can't have one giant NFS export. That may be a possibility with GPFS4.1 and NFSv4 but we won't be migrating to that anytime soon. Assume the network went down for the cnfs server or the system panicked/crashed, what would be the purpose of exportfs -u be in that case, so what's the purpose at all? Thanks, Sabuj -------------- next part -------------- An HTML attachment was scrubbed... URL: From sabujp at gmail.com Thu Sep 25 14:11:18 2014 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Thu, 25 Sep 2014 08:11:18 -0500 Subject: [gpfsug-discuss] really slow cnfs vip failover due to exportfs -u in bin/mmnfsfuncs unexportAll and unexportFS In-Reply-To: References: Message-ID: our support engineer suggests adding & to the end of the exportfs -u lines in the mmnfsfunc script, which is a good workaround, can this be added to future gpfs 3.5 and 4.1 rels (haven't even looked at 4.1 yet). I was looking at the unexportfs all in nfs-utils/exportfs.c and it looks like the limiting factor there would be all the hostname lookups? I don't see what exportfs -u is doing other than doing slow reverse lookups and removing the export from the nfs stack. On Thu, Sep 25, 2014 at 7:39 AM, Sabuj Pattanayek wrote: > Hi all, > > We're running 3.5.0.19 with CNFS and noticed really slow CNFS vip failover > times > 4.5mins . It looks like it's being caused by all the exportfs -u > calls being made in the unexportAll and the unexportFS function in > bin/mmnfsfuncs . What's the purpose of running exportfs -u on all the > exported directories? We're running only NFSv3 and have lots of exports and > for security reasons can't have one giant NFS export. That may be a > possibility with GPFS4.1 and NFSv4 but we won't be migrating to that > anytime soon. > > Assume the network went down for the cnfs server or the system > panicked/crashed, what would be the purpose of exportfs -u be in that case, > so what's the purpose at all? > > Thanks, > Sabuj > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sabujp at gmail.com Thu Sep 25 14:15:19 2014 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Thu, 25 Sep 2014 08:15:19 -0500 Subject: [gpfsug-discuss] really slow cnfs vip failover due to exportfs -u in bin/mmnfsfuncs unexportAll and unexportFS In-Reply-To: References: Message-ID: yes, it's doing a getaddrinfo() call for every hostname that's a fqdn and not an ip addr, which we have lots of in our export entries since sometimes clients update their dns (ip's). On Thu, Sep 25, 2014 at 8:11 AM, Sabuj Pattanayek wrote: > our support engineer suggests adding & to the end of the exportfs -u lines > in the mmnfsfunc script, which is a good workaround, can this be added to > future gpfs 3.5 and 4.1 rels (haven't even looked at 4.1 yet). I was > looking at the unexportfs all in nfs-utils/exportfs.c and it looks like the > limiting factor there would be all the hostname lookups? I don't see what > exportfs -u is doing other than doing slow reverse lookups and removing the > export from the nfs stack. > > On Thu, Sep 25, 2014 at 7:39 AM, Sabuj Pattanayek > wrote: > >> Hi all, >> >> We're running 3.5.0.19 with CNFS and noticed really slow CNFS vip >> failover times > 4.5mins . It looks like it's being caused by all the >> exportfs -u calls being made in the unexportAll and the unexportFS function >> in bin/mmnfsfuncs . What's the purpose of running exportfs -u on all the >> exported directories? We're running only NFSv3 and have lots of exports and >> for security reasons can't have one giant NFS export. That may be a >> possibility with GPFS4.1 and NFSv4 but we won't be migrating to that >> anytime soon. >> >> Assume the network went down for the cnfs server or the system >> panicked/crashed, what would be the purpose of exportfs -u be in that case, >> so what's the purpose at all? >> >> Thanks, >> Sabuj >> >> >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Mon Sep 1 20:44:45 2014 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Mon, 1 Sep 2014 19:44:45 +0000 Subject: [gpfsug-discuss] GPFS admin host name vs subnets Message-ID: I was just reading through the docs at: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/GPFS+Network+Communication+Overview And was wondering about using admin host name bs using subnets. My reading of the page is that if say I have a 1GbE network and a 40GbE network, I could have an admin host name on the 1GbE network. But equally from the docs, it looks like I could also use subnets to achieve the same whilst allowing the admin network to be a fall back for data if necessary. For example, create the cluster using the primary name on the 1GbE network, then use the subnets property to use set the network on the 40GbE network as the first and the network on the 1GbE network as the second in the list, thus GPFS data will pass over the 40GbE network in preference and the 1GbE network will, by default only be used for admin traffic as the admin host name will just be the name of the host on the 1GbE network. Is my reading of the docs correct? Or do I really want to be creating the cluster using the 40GbE network hostnames and set the admin node name to the name of the 1GbE network interface? (there's actually also an FDR switch in there somewhere for verbs as well) Thanks Simon From ewahl at osc.edu Tue Sep 2 14:44:29 2014 From: ewahl at osc.edu (Ed Wahl) Date: Tue, 2 Sep 2014 13:44:29 +0000 Subject: [gpfsug-discuss] GPFS admin host name vs subnets In-Reply-To: References: Message-ID: Seems like you are on the correct track. This is similar to my setup. subnett'ed daemon 10GbE, 1GbE with main being QDR RDMA, admin 1GbE. To my mind the most important part is Setting "privateSubnetOverride" to 1. This allows both your 1GbE and your 40GbE to be on a private subnet. Serving block over public IPs just seems wrong on SO many levels. Whether truly private/internal or not. And how many people use public IPs internally? Wait, maybe I don't want to know... Using 'verbsRdma enable' for your FDR seems to override Daemon node name for block, at least in my experience. I love the fallback to 10GbE and then 1GbE in case of disaster when using IB. Lately we seem to be generating bugs in OpenSM at a frightening rate so that has been _extremely_ helpful. Now if we could just monitor when it happens more easily than running mmfsadm test verbs conn, say by logging a failure of RDMA? Ed OSC ________________________________________ From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of Simon Thompson (Research Computing - IT Services) [S.J.Thompson at bham.ac.uk] Sent: Monday, September 01, 2014 3:44 PM To: gpfsug main discussion list Subject: [gpfsug-discuss] GPFS admin host name vs subnets I was just reading through the docs at: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/GPFS+Network+Communication+Overview And was wondering about using admin host name bs using subnets. My reading of the page is that if say I have a 1GbE network and a 40GbE network, I could have an admin host name on the 1GbE network. But equally from the docs, it looks like I could also use subnets to achieve the same whilst allowing the admin network to be a fall back for data if necessary. For example, create the cluster using the primary name on the 1GbE network, then use the subnets property to use set the network on the 40GbE network as the first and the network on the 1GbE network as the second in the list, thus GPFS data will pass over the 40GbE network in preference and the 1GbE network will, by default only be used for admin traffic as the admin host name will just be the name of the host on the 1GbE network. Is my reading of the docs correct? Or do I really want to be creating the cluster using the 40GbE network hostnames and set the admin node name to the name of the 1GbE network interface? (there's actually also an FDR switch in there somewhere for verbs as well) Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From oehmes at gmail.com Tue Sep 2 15:11:03 2014 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 2 Sep 2014 07:11:03 -0700 Subject: [gpfsug-discuss] GPFS admin host name vs subnets In-Reply-To: References: Message-ID: Ed, if you enable RDMA, GPFS will always use this as preferred data transfer. if you have subnets configured, GPFS will prefer this for communication with higher priority as the default interface. so the order is RDMA , subnets, default. if RDMA will fail for whatever reason we will use the subnets defined interface and if that fails as well we will use the default interface. the easiest way to see what is used is to run mmdiag --network (only avail on more recent versions of GPFS) it will tell you if RDMA is enabled between individual nodes as well as if a subnet connection is used or not : [root at client05 ~]# mmdiag --network === mmdiag: network === Pending messages: (none) Inter-node communication configuration: tscTcpPort 1191 my address 192.167.13.5/16 (eth0) my addr list 192.1.13.5/16 (ib1) 192.0.13.5/16 (ib0)/ client04.clientad.almaden.ibm.com 192.167.13.5/16 (eth0) my node number 17 TCP Connections between nodes: Device ib0: hostname node destination status err sock sent(MB) recvd(MB) ostype client04n1 192.0.4.1 connected 0 69 0 37 Linux/L client04n2 192.0.4.2 connected 0 70 0 37 Linux/L client04n3 192.0.4.3 connected 0 68 0 0 Linux/L Device ib1: hostname node destination status err sock sent(MB) recvd(MB) ostype clientcl21 192.1.201.21 connected 0 65 0 0 Linux/L clientcl25 192.1.201.25 connected 0 66 0 0 Linux/L clientcl26 192.1.201.26 connected 0 67 0 0 Linux/L clientcl21 192.1.201.21 connected 0 71 0 0 Linux/L clientcl22 192.1.201.22 connected 0 63 0 0 Linux/L client10 192.1.13.10 connected 0 73 0 0 Linux/L client08 192.1.13.8 connected 0 72 0 0 Linux/L RDMA Connections between nodes: Fabric 1 - Device mlx4_0 Port 1 Width 4x Speed FDR lid 13 hostname idx CM state VS buff RDMA_CT(ERR) RDMA_RCV_MB RDMA_SND_MB VS_CT(ERR) VS_SND_MB VS_RCV_MB WAIT_CON_SLOT WAIT_NODE_SLOT clientcl21 0 N RTS (Y)903 0 (0 ) 0 0 192 (0 ) 0 0 0 0 client04n1 0 N RTS (Y)477 0 (0 ) 0 0 12367404(0 ) 107905 594 0 0 client04n1 1 N RTS (Y)477 0 (0 ) 0 0 12367404(0 ) 107901 593 0 0 client04n2 0 N RTS (Y)477 0 (0 ) 0 0 12371352(0 ) 107911 594 0 0 client04n2 2 N RTS (Y)477 0 (0 ) 0 0 12371352(0 ) 107902 594 0 0 clientcl21 0 N RTS (Y)880 0 (0 ) 0 0 11 (0 ) 0 0 0 0 client04n3 0 N RTS (Y)969 0 (0 ) 0 0 5 (0 ) 0 0 0 0 clientcl26 0 N RTS (Y)702 0 (0 ) 0 0 35 (0 ) 0 0 0 0 client08 0 N RTS (Y)637 0 (0 ) 0 0 16 (0 ) 0 0 0 0 clientcl25 0 N RTS (Y)574 0 (0 ) 0 0 14 (0 ) 0 0 0 0 clientcl22 0 N RTS (Y)507 0 (0 ) 0 0 2 (0 ) 0 0 0 0 client10 0 N RTS (Y)568 0 (0 ) 0 0 121 (0 ) 0 0 0 0 Fabric 2 - Device mlx4_0 Port 2 Width 4x Speed FDR lid 65 hostname idx CM state VS buff RDMA_CT(ERR) RDMA_RCV_MB RDMA_SND_MB VS_CT(ERR) VS_SND_MB VS_RCV_MB WAIT_CON_SLOT WAIT_NODE_SLOT clientcl21 1 N RTS (Y)904 0 (0 ) 0 0 192 (0 ) 0 0 0 0 client04n1 2 N RTS (Y)477 0 (0 ) 0 0 12367404(0 ) 107897 593 0 0 client04n2 1 N RTS (Y)477 0 (0 ) 0 0 12371352(0 ) 107903 594 0 0 clientcl21 1 N RTS (Y)881 0 (0 ) 0 0 10 (0 ) 0 0 0 0 clientcl26 1 N RTS (Y)701 0 (0 ) 0 0 35 (0 ) 0 0 0 0 client08 1 N RTS (Y)637 0 (0 ) 0 0 16 (0 ) 0 0 0 0 clientcl25 1 N RTS (Y)574 0 (0 ) 0 0 14 (0 ) 0 0 0 0 clientcl22 1 N RTS (Y)507 0 (0 ) 0 0 2 (0 ) 0 0 0 0 client10 1 N RTS (Y)568 0 (0 ) 0 0 121 (0 ) 0 0 0 0 in this example you can see thet my client (client05) has multiple subnets configured as well as RDMA. so to connected to the various TCP devices (ib0 and ib1) to different cluster nodes and also has a RDMA connection to a different set of nodes. as you can see there is basically no traffic on the TCP devices, as all the traffic uses the 2 defined RDMA fabrics. there is not a single connection using the daemon interface (eth0) as all nodes are either connected via subnets or via RDMA. hope this helps. Sven On Tue, Sep 2, 2014 at 6:44 AM, Ed Wahl wrote: > Seems like you are on the correct track. This is similar to my setup. > subnett'ed daemon 10GbE, 1GbE with main being QDR RDMA, admin 1GbE. To > my mind the most important part is Setting "privateSubnetOverride" to 1. > This allows both your 1GbE and your 40GbE to be on a private subnet. > Serving block over public IPs just seems wrong on SO many levels. Whether > truly private/internal or not. And how many people use public IPs > internally? Wait, maybe I don't want to know... > > Using 'verbsRdma enable' for your FDR seems to override Daemon node > name for block, at least in my experience. I love the fallback to 10GbE > and then 1GbE in case of disaster when using IB. Lately we seem to be > generating bugs in OpenSM at a frightening rate so that has been > _extremely_ helpful. Now if we could just monitor when it happens more > easily than running mmfsadm test verbs conn, say by logging a failure of > RDMA? > > > Ed > OSC > > ________________________________________ > From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] > on behalf of Simon Thompson (Research Computing - IT Services) [ > S.J.Thompson at bham.ac.uk] > Sent: Monday, September 01, 2014 3:44 PM > To: gpfsug main discussion list > Subject: [gpfsug-discuss] GPFS admin host name vs subnets > > I was just reading through the docs at: > > > https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/GPFS+Network+Communication+Overview > > And was wondering about using admin host name bs using subnets. My reading > of the page is that if say I have a 1GbE network and a 40GbE network, I > could have an admin host name on the 1GbE network. But equally from the > docs, it looks like I could also use subnets to achieve the same whilst > allowing the admin network to be a fall back for data if necessary. > > For example, create the cluster using the primary name on the 1GbE > network, then use the subnets property to use set the network on the 40GbE > network as the first and the network on the 1GbE network as the second in > the list, thus GPFS data will pass over the 40GbE network in preference and > the 1GbE network will, by default only be used for admin traffic as the > admin host name will just be the name of the host on the 1GbE network. > > Is my reading of the docs correct? Or do I really want to be creating the > cluster using the 40GbE network hostnames and set the admin node name to > the name of the 1GbE network interface? > > (there's actually also an FDR switch in there somewhere for verbs as well) > > Thanks > > Simon > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Wed Sep 3 18:27:44 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Wed, 03 Sep 2014 18:27:44 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring Message-ID: <54074F90.7000303@ebi.ac.uk> Hello everybody, here i come here again, this time to ask some hint about how to monitor GPFS. I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is that they return number based only on the request done in the current host, so i have to run them on all the clients ( over 600 nodes) so its quite unpractical. Instead i would like to know from the servers whats going on, and i came across the vio_s statistics wich are less documented and i dont know exacly what they mean. There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that runs VIO_S. My problems with the output of this command: echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second timestamp: 1409763206/477366 recovery group: * declustered array: * vdisk: * client reads: 2584229 client short writes: 55299693 client medium writes: 190071 client promoted full track writes: 465145 client full track writes: 9249 flushed update writes: 4187708 flushed promoted full track writes: 123 migrate operations: 114 scrub operations: 450590 log writes: 28509602 it sais "VIOPS per second", but they seem to me just counters as every time i re-run the command, the numbers increase by a bit.. Can anyone confirm if those numbers are counter or if they are OPS/sec. On a closer eye about i dont understand what most of thosevalues mean. For example, what exacly are "flushed promoted full track write" ?? I tried to find a documentation about this output , but could not find any. can anyone point me a link where output of vio_s is explained? Another thing i dont understand about those numbers is if they are just operations, or the number of blocks that was read/write/etc . I'm asking that because if they are just ops, i don't know how much they could be usefull. For example one write operation could eman write 1 block or write a file of 100GB. If those are oprations, there is a way to have the oupunt in bytes or blocks? Last but not least.. and this is what i really would like to accomplish, i would to be able to monitor the latency of metadata operations. In my environment there are users that litterally overhelm our storages with metadata request, so even if there is no massive throughput or huge waiters, any "ls" could take ages. I would like to be able to monitor metadata behaviour. There is a way to to do that from the NSD servers? Thanks in advance for any tip/help. Regards, Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From chekh at stanford.edu Wed Sep 3 21:55:14 2014 From: chekh at stanford.edu (Alex Chekholko) Date: Wed, 03 Sep 2014 13:55:14 -0700 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <54074F90.7000303@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> Message-ID: <54078032.2050605@stanford.edu> The usual way to do that is to re-architect your filesystem so that the system pool is metadata-only, and then you can just look at the storage layer and see total metadata throughput that way. Otherwise your metadata ops are mixed in with your data ops. Of course, both NSDs and clients also have metadata caches. On 09/03/2014 10:27 AM, Salvatore Di Nardo wrote: > > Last but not least.. and this is what i really would like to accomplish, > i would to be able to monitor the latency of metadata operations. > In my environment there are users that litterally overhelm our storages > with metadata request, so even if there is no massive throughput or huge > waiters, any "ls" could take ages. I would like to be able to monitor > metadata behaviour. There is a way to to do that from the NSD servers? -- Alex Chekholko chekh at stanford.edu 347-401-4860 From oehmes at us.ibm.com Thu Sep 4 01:50:25 2014 From: oehmes at us.ibm.com (Sven Oehme) Date: Wed, 3 Sep 2014 17:50:25 -0700 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <54074F90.7000303@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> Message-ID: > Hello everybody, Hi > here i come here again, this time to ask some hint about how to monitor GPFS. > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is > that they return number based only on the request done in the > current host, so i have to run them on all the clients ( over 600 > nodes) so its quite unpractical. Instead i would like to know from > the servers whats going on, and i came across the vio_s statistics > wich are less documented and i dont know exacly what they mean. > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that > runs VIO_S. > > My problems with the output of this command: > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second > timestamp: 1409763206/477366 > recovery group: * > declustered array: * > vdisk: * > client reads: 2584229 > client short writes: 55299693 > client medium writes: 190071 > client promoted full track writes: 465145 > client full track writes: 9249 > flushed update writes: 4187708 > flushed promoted full track writes: 123 > migrate operations: 114 > scrub operations: 450590 > log writes: 28509602 > > it sais "VIOPS per second", but they seem to me just counters as > every time i re-run the command, the numbers increase by a bit.. > Can anyone confirm if those numbers are counter or if they are OPS/sec. the numbers are accumulative so everytime you run them they just show the value since start (or last reset) time. > > On a closer eye about i dont understand what most of thosevalues > mean. For example, what exacly are "flushed promoted full track write" ?? > I tried to find a documentation about this output , but could not > find any. can anyone point me a link where output of vio_s is explained? > > Another thing i dont understand about those numbers is if they are > just operations, or the number of blocks that was read/write/etc . its just operations and if i would explain what the numbers mean i might confuse you even more because this is not what you are really looking for. what you are looking for is what the client io's look like on the Server side, while the VIO layer is the Server side to the disks, so one lever lower than what you are looking for from what i could read out of the description above. so the Layer you care about is the NSD Server layer, which sits on top of the VIO layer (which is essentially the SW RAID Layer in GNR) > I'm asking that because if they are just ops, i don't know how much > they could be usefull. For example one write operation could eman > write 1 block or write a file of 100GB. If those are oprations, > there is a way to have the oupunt in bytes or blocks? there are multiple ways to get infos on the NSD layer, one would be to use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats counts again. the alternative option is to use mmdiag --iohist. this shows you a history of the last X numbers of io operations on either the client or the server side like on a client : # mmdiag --iohist === mmdiag: iohist === I/O history: I/O start time RW Buf type disk:sectorNum nSec time ms qTime ms RpcTimes ms Type Device/NSD ID NSD server --------------- -- ----------- ----------------- ----- ------- -------- ----------------- ---- ------------------ --------------- 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 14:25:22.182723 R inode 1:1071252480 8 6.970 0.000 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 14:25:53.668262 R inode 2:1081373696 8 14.117 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 14:25:53.692019 R inode 2:1064356608 8 14.899 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 14:25:53.707100 R inode 2:1077830152 8 16.499 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 14:25:53.728082 R inode 2:1081918976 8 7.760 0.000 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 14:25:57.877416 R metadata 2:678978560 16 13.343 0.000 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 14:25:57.906556 R inode 2:1083476520 8 11.723 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 14:25:57.926592 R inode 1:1076503480 8 8.087 0.000 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 14:25:57.941441 R inode 2:1069885984 8 11.686 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 14:25:57.953294 R inode 2:1083476936 8 8.951 0.000 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 14:25:57.965475 R inode 1:1076503504 8 0.477 0.000 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 14:25:57.965755 R inode 2:1083476488 8 0.410 0.000 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 14:25:57.965787 R inode 2:1083476512 8 0.439 0.000 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 you basically see if its a inode , data block , what size it has (in sectors) , which nsd server you did send this request to, etc. on the Server side you see the type , which physical disk it goes to and also what size of disk i/o it causes like : 14:26:50.129995 R inode 12:3211886376 64 14.261 0.000 0.000 0.000 pd sdis 14:26:50.137102 R inode 19:3003969520 64 9.004 0.000 0.000 0.000 pd sdad 14:26:50.136116 R inode 55:3591710992 64 11.057 0.000 0.000 0.000 pd sdoh 14:26:50.141510 R inode 21:3066810504 64 5.909 0.000 0.000 0.000 pd sdaf 14:26:50.130529 R inode 89:2962370072 64 17.437 0.000 0.000 0.000 pd sddi 14:26:50.131063 R inode 78:1889457000 64 17.062 0.000 0.000 0.000 pd sdsj 14:26:50.143403 R inode 36:3323035688 64 4.807 0.000 0.000 0.000 pd sdmw 14:26:50.131044 R inode 37:2513579736 128 17.181 0.000 0.000 0.000 pd sddv 14:26:50.138181 R inode 72:3868810400 64 10.951 0.000 0.000 0.000 pd sdbz 14:26:50.138188 R inode 131:2443484784 128 11.792 0.000 0.000 0.000 pd sdug 14:26:50.138003 R inode 102:3696843872 64 11.994 0.000 0.000 0.000 pd sdgp 14:26:50.137099 R inode 145:3370922504 64 13.225 0.000 0.000 0.000 pd sdmi 14:26:50.141576 R inode 62:2668579904 64 9.313 0.000 0.000 0.000 pd sdou 14:26:50.134689 R inode 159:2786164648 64 16.577 0.000 0.000 0.000 pd sdpq 14:26:50.145034 R inode 34:2097217320 64 7.409 0.000 0.000 0.000 pd sdmt 14:26:50.138140 R inode 139:2831038792 64 14.898 0.000 0.000 0.000 pd sdlw 14:26:50.130954 R inode 164:282120312 64 22.274 0.000 0.000 0.000 pd sdzd 14:26:50.137038 R inode 41:3421909608 64 16.314 0.000 0.000 0.000 pd sdef 14:26:50.137606 R inode 104:1870962416 64 16.644 0.000 0.000 0.000 pd sdgx 14:26:50.141306 R inode 65:2276184264 64 16.593 0.000 0.000 0.000 pd sdrk > > Last but not least.. and this is what i really would like to > accomplish, i would to be able to monitor the latency of metadata operations. you can't do this on the server side as you don't know how much time you spend on the client , network or anything between the app and the physical disk, so you can only reliably look at this from the client, the iohist output only shows you the Server disk i/o processing time, but that can be a fraction of the overall time (in other cases this obviously can also be the dominant part depending on your workload). the easiest way on the client is to run mmfsadm vfsstats enable from now on vfs stats are collected until you restart GPFS. then run : vfs statistics currently enabled started at: Fri Aug 29 13:15:05.380 2014 duration: 448446.970 sec name calls time per call total time -------------------- -------- -------------- -------------- statfs 9 0.000002 0.000021 startIO 246191176 0.005853 1441049.976740 to dump what ever you collected so far on this node. > In my environment there are users that litterally overhelm our > storages with metadata request, so even if there is no massive > throughput or huge waiters, any "ls" could take ages. I would like > to be able to monitor metadata behaviour. There is a way to to do > that from the NSD servers? not this simple as described above. > > Thanks in advance for any tip/help. > > Regards, > Salvatore_______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Thu Sep 4 11:05:18 2014 From: service at metamodul.com (service at metamodul.com) Date: Thu, 4 Sep 2014 12:05:18 +0200 (CEST) Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <54074F90.7000303@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> Message-ID: <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> > , any "ls" could take ages. Check if you large directories either with many files or simply large. Verify if you have NFS exported GPFS. Verify that your cache settings on the clients are large enough ( maxStatCache , maxFilesToCache , sharedMemLimit ) Verify that you have dedicated metadata luns ( metadataOnly ) Reference: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Tuning%20Parameters Note: If possible monitor your metadata luns on the storage directly. hth Hajo -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Thu Sep 4 11:43:36 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 04 Sep 2014 11:43:36 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: References: <54074F90.7000303@ebi.ac.uk> Message-ID: <54084258.90508@ebi.ac.uk> On 04/09/14 01:50, Sven Oehme wrote: > > Hello everybody, > > Hi > > > here i come here again, this time to ask some hint about how to > monitor GPFS. > > > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is > > that they return number based only on the request done in the > > current host, so i have to run them on all the clients ( over 600 > > nodes) so its quite unpractical. Instead i would like to know from > > the servers whats going on, and i came across the vio_s statistics > > wich are less documented and i dont know exacly what they mean. > > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that > > runs VIO_S. > > > > My problems with the output of this command: > > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 > > > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second > > timestamp: 1409763206/477366 > > recovery group: * > > declustered array: * > > vdisk: * > > client reads: 2584229 > > client short writes: 55299693 > > client medium writes: 190071 > > client promoted full track writes: 465145 > > client full track writes: 9249 > > flushed update writes: 4187708 > > flushed promoted full track writes: 123 > > migrate operations: 114 > > scrub operations: 450590 > > log writes: 28509602 > > > > it sais "VIOPS per second", but they seem to me just counters as > > every time i re-run the command, the numbers increase by a bit.. > > Can anyone confirm if those numbers are counter or if they are OPS/sec. > > the numbers are accumulative so everytime you run them they just show > the value since start (or last reset) time. OK, you confirmed my toughts, thatks > > > > > On a closer eye about i dont understand what most of thosevalues > > mean. For example, what exacly are "flushed promoted full track > write" ?? > > I tried to find a documentation about this output , but could not > > find any. can anyone point me a link where output of vio_s is explained? > > > > Another thing i dont understand about those numbers is if they are > > just operations, or the number of blocks that was read/write/etc . > > its just operations and if i would explain what the numbers mean i > might confuse you even more because this is not what you are really > looking for. > what you are looking for is what the client io's look like on the > Server side, while the VIO layer is the Server side to the disks, so > one lever lower than what you are looking for from what i could read > out of the description above. No.. what I'm looking its exactly how the disks are busy to keep the requests. Obviously i'm not looking just that, but I feel the needs to monitor _*also*_ those things. Ill explain you why. It happens when our storage is quite busy ( 180Gb/s of read/write ) that the FS start to be slowin normal /*cd*/ or /*ls*/ requests. This might be normal, but in those situation i want to know where the bottleneck is. Is the server CPU? Memory? Network? Spindles? knowing where the bottlenek is might help me to understand if we can tweak the system a bit more. If its the CPU on the servers then there is no much to do beside replacing or add more servers.If its not the CPU, maybe more memory would help? Maybe its just the network that filled up? so i can add more links Or if we reached the point there the bottleneck its the spindles, then there is no much point o look somethere else, we just reached the hardware limit.. Sometimes, it also happens that there is very low IO (10Gb/s ), almost no cpu usage on the servers but huge slownes ( ls can take 10 seconds). Why that happens? There is not much data ops , but we think there is a huge ammount of metadata ops. So what i want to know is if the metadata vdisks are busy or not. If this is our problem, could some SSD disks dedicated to metadata help? In particular im, a bit puzzled with the design of our GSS storage. Each recovery groups have 3 declustered arrays, and each declustered aray have 1 data and 1 metadata vdisk, but in the end both metadata and data vdisks use the same spindles. The problem that, its that I dont understand if we have a metadata bottleneck there. Maybe some SSD disks in a dedicated declustered array would perform much better, but this is just theory. I really would like to be able to monitor IO activities on the metadata vdisks. > > > so the Layer you care about is the NSD Server layer, which sits on top > of the VIO layer (which is essentially the SW RAID Layer in GNR) > > > I'm asking that because if they are just ops, i don't know how much > > they could be usefull. For example one write operation could eman > > write 1 block or write a file of 100GB. If those are oprations, > > there is a way to have the oupunt in bytes or blocks? > > there are multiple ways to get infos on the NSD layer, one would be to > use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats counts > again. Counters its not a problem. I can collect them and create some graphs in a monitoring tool. I will check that. > > the alternative option is to use mmdiag --iohist. this shows you a > history of the last X numbers of io operations on either the client or > the server side like on a client : > > # mmdiag --iohist > > === mmdiag: iohist === > > I/O history: > > I/O start time RW Buf type disk:sectorNum nSec time ms qTime > ms RpcTimes ms Type Device/NSD ID NSD server > --------------- -- ----------- ----------------- ----- ------- > -------- ----------------- ---- ------------------ --------------- > 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 > 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:22.182723 R inode 1:1071252480 8 6.970 0.000 > 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 > 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.668262 R inode 2:1081373696 8 14.117 > 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 > 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.692019 R inode 2:1064356608 8 14.899 > 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.707100 R inode 2:1077830152 8 16.499 > 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 > 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.728082 R inode 2:1081918976 8 7.760 0.000 > 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.877416 R metadata 2:678978560 16 13.343 0.000 > 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 > 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.906556 R inode 2:1083476520 8 11.723 > 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 > 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.926592 R inode 1:1076503480 8 8.087 0.000 > 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 > 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.941441 R inode 2:1069885984 8 11.686 > 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.953294 R inode 2:1083476936 8 8.951 0.000 > 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965475 R inode 1:1076503504 8 0.477 0.000 > 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.965755 R inode 2:1083476488 8 0.410 0.000 > 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965787 R inode 2:1083476512 8 0.439 0.000 > 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 > > you basically see if its a inode , data block , what size it has (in > sectors) , which nsd server you did send this request to, etc. > > on the Server side you see the type , which physical disk it goes to > and also what size of disk i/o it causes like : > > 14:26:50.129995 R inode 12:3211886376 64 14.261 > 0.000 0.000 0.000 pd sdis > 14:26:50.137102 R inode 19:3003969520 64 9.004 > 0.000 0.000 0.000 pd sdad > 14:26:50.136116 R inode 55:3591710992 64 11.057 > 0.000 0.000 0.000 pd sdoh > 14:26:50.141510 R inode 21:3066810504 64 5.909 > 0.000 0.000 0.000 pd sdaf > 14:26:50.130529 R inode 89:2962370072 64 17.437 > 0.000 0.000 0.000 pd sddi > 14:26:50.131063 R inode 78:1889457000 64 17.062 > 0.000 0.000 0.000 pd sdsj > 14:26:50.143403 R inode 36:3323035688 64 4.807 > 0.000 0.000 0.000 pd sdmw > 14:26:50.131044 R inode 37:2513579736 128 17.181 > 0.000 0.000 0.000 pd sddv > 14:26:50.138181 R inode 72:3868810400 64 10.951 > 0.000 0.000 0.000 pd sdbz > 14:26:50.138188 R inode 131:2443484784 128 11.792 > 0.000 0.000 0.000 pd sdug > 14:26:50.138003 R inode 102:3696843872 64 11.994 > 0.000 0.000 0.000 pd sdgp > 14:26:50.137099 R inode 145:3370922504 64 13.225 > 0.000 0.000 0.000 pd sdmi > 14:26:50.141576 R inode 62:2668579904 64 9.313 > 0.000 0.000 0.000 pd sdou > 14:26:50.134689 R inode 159:2786164648 64 16.577 > 0.000 0.000 0.000 pd sdpq > 14:26:50.145034 R inode 34:2097217320 64 7.409 > 0.000 0.000 0.000 pd sdmt > 14:26:50.138140 R inode 139:2831038792 64 14.898 > 0.000 0.000 0.000 pd sdlw > 14:26:50.130954 R inode 164:282120312 64 22.274 > 0.000 0.000 0.000 pd sdzd > 14:26:50.137038 R inode 41:3421909608 64 16.314 > 0.000 0.000 0.000 pd sdef > 14:26:50.137606 R inode 104:1870962416 64 16.644 > 0.000 0.000 0.000 pd sdgx > 14:26:50.141306 R inode 65:2276184264 64 16.593 > 0.000 0.000 0.000 pd sdrk > > mmdiag --iohist its another think i looked at it, but i could not find good explanation for all the "buf type" ( third column ) allocSeg data iallocSeg indBlock inode LLIndBlock logData logDesc logWrap metadata vdiskAULog vdiskBuf vdiskFWLog vdiskMDLog vdiskMeta vdiskRGDesc If i want to monifor metadata operation whan should i look at? just the metadata flag or also inode? this command takes also long to run, especially if i run it a second time it hangs for a lot before to rerun again, so i'm not sure that run it every 30secs or minute its viable, but i will look also into that. THere is any documentation that descibes clearly the whole output? what i found its quite generic and don't go into details... > > > > Last but not least.. and this is what i really would like to > > accomplish, i would to be able to monitor the latency of metadata > operations. > > you can't do this on the server side as you don't know how much time > you spend on the client , network or anything between the app and the > physical disk, so you can only reliably look at this from the client, > the iohist output only shows you the Server disk i/o processing time, > but that can be a fraction of the overall time (in other cases this > obviously can also be the dominant part depending on your workload). > > the easiest way on the client is to run > > mmfsadm vfsstats enable > from now on vfs stats are collected until you restart GPFS. > > then run : > > vfs statistics currently enabled > started at: Fri Aug 29 13:15:05.380 2014 > duration: 448446.970 sec > > name calls time per call total time > -------------------- -------- -------------- -------------- > statfs 9 0.000002 0.000021 > startIO 246191176 0.005853 1441049.976740 > > to dump what ever you collected so far on this node. > We already do that, but as I said, I want to check specifically how gss servers are keeping the requests to identify or exlude server side bottlenecks. Thanks for your help, you gave me definitely few things where to look at. Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Thu Sep 4 11:58:51 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 04 Sep 2014 11:58:51 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> References: <54074F90.7000303@ebi.ac.uk> <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> Message-ID: <540845EB.1020202@ebi.ac.uk> Little clarification, the filsystemn its not always slow. It happens that became very slow with particular users jobs in the farm. Maybe its just an indication thant we have huge ammount of metadata requestes, that's why i want to be able to monitor them On 04/09/14 11:05, service at metamodul.com wrote: > > , any "ls" could take ages. > Check if you large directories either with many files or simply large. it happens that the files are very large ( over 100G), but there usually ther are no many files. > Verify if you have NFS exported GPFS. No NFS > Verify that your cache settings on the clients are large enough ( > maxStatCache , maxFilesToCache , sharedMemLimit ) will look at them, but i'm not sure that the best number will be on the client. Obviously i cannot use all the memory of the client because those blients are meant to run jobs.... > Verify that you have dedicated metadata luns ( metadataOnly ) Yes, we have dedicate vdisks for metadata, but they are in the same declustered arrays/recoverygroups, so they whare the same spindles > Reference: > https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Tuning%20Parameters > > Note: > If possible monitor your metadata luns on the storage directly. that?s exactly than I'm trying to do !!!! :-D > hth > Hajo > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Thu Sep 4 13:04:21 2014 From: service at metamodul.com (service at metamodul.com) Date: Thu, 4 Sep 2014 14:04:21 +0200 (CEST) Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <540845EB.1020202@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> <540845EB.1020202@ebi.ac.uk> Message-ID: <1374296431.733076.1409832261718.open-xchange@oxbaltgw12.schlund.de> ... , any "ls" could take ages. >Check if you large directories either with many files or simply large. >> it happens that the files are very large ( over 100G), but there usually >> ther are no many files. >>> Please check that the directory size is not large. In a worst case you have a directory with 10MB in size but it contains only one file. In any way GPFS must fetch the whole directory structure might causing unnecassery IO. Thus my request that you check your directory sizes. >Verify that your cache settings on the clients are large enough ( maxStatCache >, maxFilesToCache , sharedMemLimit ) >>will look at them, but i'm not sure that the best number will be on the >>client. Obviously i cannot use all the memory of the client because those >>blients are meant to run jobs.... Use lsof on the client to determine the amount of open filese. mmdiag --stats ( >From my memory ) shows a little bit about the cache usage. maxStatCache does not use that much memory. > Verify that you have dedicated metadata luns ( metadataOnly ) >> Yes, we have dedicate vdisks for metadata, but they are in the same >> declustered arrays/recoverygroups, so they whare the same spindles Thats imho not a good approach. Metadata operation are small and random, data io is large and streaming. Just think you have a highway full of large trucks and you try to get with a high speed bike to your destination. You will be blocked. The same problem you have at your destiation. If many large trucks would like to get their stuff off there is no time for somebody with a small parcel. Thats the same reason why you should not access tape storage and disk storage via the same FC adapter. ( Streaming IO version v. random/small IO ) So even without your current problem and motivation for measureing i would strongly suggest to have at least dediacted SSD for metadata and if possible even dedicated NSD server for the metadata. Meaning have a dedicated path for your data and a dedicated path for your metadata. All from a users point of view Hajo -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Thu Sep 4 14:25:09 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 04 Sep 2014 14:25:09 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <1374296431.733076.1409832261718.open-xchange@oxbaltgw12.schlund.de> References: <54074F90.7000303@ebi.ac.uk> <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> <540845EB.1020202@ebi.ac.uk> <1374296431.733076.1409832261718.open-xchange@oxbaltgw12.schlund.de> Message-ID: <54086835.6050603@ebi.ac.uk> > >> Yes, we have dedicate vdisks for metadata, but they are in the same > declustered arrays/recoverygroups, so they whare the same spindles > > Thats imho not a good approach. Metadata operation are small and > random, data io is large and streaming. > > Just think you have a highway full of large trucks and you try to get > with a high speed bike to your destination. You will be blocked. > The same problem you have at your destiation. If many large trucks > would like to get their stuff off there is no time for somebody with a > small parcel. > > Thats the same reason why you should not access tape storage and disk > storage via the same FC adapter. ( Streaming IO version v. > random/small IO ) > > So even without your current problem and motivation for measureing i > would strongly suggest to have at least dediacted SSD for metadata and > if possible even dedicated NSD server for the metadata. > Meaning have a dedicated path for your data and a dedicated path for > your metadata. > > All from a users point of view > Hajo > That's where i was puzzled too. GSS its a gpfs appliance and came configured this way. Also official GSS documentation suggest to create separate vdisks for data and meatadata, but in the same declustered arrays. I always felt this a strange choice, specially if we consider that metadata require a very small abbount of space, so few ssd could do the trick.... From sdinardo at ebi.ac.uk Thu Sep 4 14:32:15 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 04 Sep 2014 14:32:15 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: References: <54074F90.7000303@ebi.ac.uk> Message-ID: <540869DF.5060100@ebi.ac.uk> Sorry to bother you again but dstat have some issues with the plugin: [root at gss01a util]# dstat --gpfs /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use the subprocess module. pipes[cmd] = os.popen3(cmd, 't', 0) Module dstat_gpfs failed to load. (global name 'select' is not defined) None of the stats you selected are available. I found this solution , but involve dstat recompile.... https://github.com/dagwieers/dstat/issues/44 Are you aware about any easier solution (we use RHEL6.3) ? Regards, Salvatore On 04/09/14 01:50, Sven Oehme wrote: > > Hello everybody, > > Hi > > > here i come here again, this time to ask some hint about how to > monitor GPFS. > > > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is > > that they return number based only on the request done in the > > current host, so i have to run them on all the clients ( over 600 > > nodes) so its quite unpractical. Instead i would like to know from > > the servers whats going on, and i came across the vio_s statistics > > wich are less documented and i dont know exacly what they mean. > > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that > > runs VIO_S. > > > > My problems with the output of this command: > > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 > > > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second > > timestamp: 1409763206/477366 > > recovery group: * > > declustered array: * > > vdisk: * > > client reads: 2584229 > > client short writes: 55299693 > > client medium writes: 190071 > > client promoted full track writes: 465145 > > client full track writes: 9249 > > flushed update writes: 4187708 > > flushed promoted full track writes: 123 > > migrate operations: 114 > > scrub operations: 450590 > > log writes: 28509602 > > > > it sais "VIOPS per second", but they seem to me just counters as > > every time i re-run the command, the numbers increase by a bit.. > > Can anyone confirm if those numbers are counter or if they are OPS/sec. > > the numbers are accumulative so everytime you run them they just show > the value since start (or last reset) time. > > > > > On a closer eye about i dont understand what most of thosevalues > > mean. For example, what exacly are "flushed promoted full track > write" ?? > > I tried to find a documentation about this output , but could not > > find any. can anyone point me a link where output of vio_s is explained? > > > > Another thing i dont understand about those numbers is if they are > > just operations, or the number of blocks that was read/write/etc . > > its just operations and if i would explain what the numbers mean i > might confuse you even more because this is not what you are really > looking for. > what you are looking for is what the client io's look like on the > Server side, while the VIO layer is the Server side to the disks, so > one lever lower than what you are looking for from what i could read > out of the description above. > > so the Layer you care about is the NSD Server layer, which sits on top > of the VIO layer (which is essentially the SW RAID Layer in GNR) > > > I'm asking that because if they are just ops, i don't know how much > > they could be usefull. For example one write operation could eman > > write 1 block or write a file of 100GB. If those are oprations, > > there is a way to have the oupunt in bytes or blocks? > > there are multiple ways to get infos on the NSD layer, one would be to > use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats counts > again. > > the alternative option is to use mmdiag --iohist. this shows you a > history of the last X numbers of io operations on either the client or > the server side like on a client : > > # mmdiag --iohist > > === mmdiag: iohist === > > I/O history: > > I/O start time RW Buf type disk:sectorNum nSec time ms qTime > ms RpcTimes ms Type Device/NSD ID NSD server > --------------- -- ----------- ----------------- ----- ------- > -------- ----------------- ---- ------------------ --------------- > 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 > 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:22.182723 R inode 1:1071252480 8 6.970 0.000 > 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 > 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.668262 R inode 2:1081373696 8 14.117 > 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 > 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.692019 R inode 2:1064356608 8 14.899 > 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.707100 R inode 2:1077830152 8 16.499 > 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 > 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.728082 R inode 2:1081918976 8 7.760 0.000 > 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.877416 R metadata 2:678978560 16 13.343 0.000 > 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 > 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.906556 R inode 2:1083476520 8 11.723 > 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 > 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.926592 R inode 1:1076503480 8 8.087 0.000 > 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 > 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.941441 R inode 2:1069885984 8 11.686 > 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.953294 R inode 2:1083476936 8 8.951 0.000 > 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965475 R inode 1:1076503504 8 0.477 0.000 > 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.965755 R inode 2:1083476488 8 0.410 0.000 > 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965787 R inode 2:1083476512 8 0.439 0.000 > 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 > > you basically see if its a inode , data block , what size it has (in > sectors) , which nsd server you did send this request to, etc. > > on the Server side you see the type , which physical disk it goes to > and also what size of disk i/o it causes like : > > 14:26:50.129995 R inode 12:3211886376 64 14.261 > 0.000 0.000 0.000 pd sdis > 14:26:50.137102 R inode 19:3003969520 64 9.004 > 0.000 0.000 0.000 pd sdad > 14:26:50.136116 R inode 55:3591710992 64 11.057 > 0.000 0.000 0.000 pd sdoh > 14:26:50.141510 R inode 21:3066810504 64 5.909 > 0.000 0.000 0.000 pd sdaf > 14:26:50.130529 R inode 89:2962370072 64 17.437 > 0.000 0.000 0.000 pd sddi > 14:26:50.131063 R inode 78:1889457000 64 17.062 > 0.000 0.000 0.000 pd sdsj > 14:26:50.143403 R inode 36:3323035688 64 4.807 > 0.000 0.000 0.000 pd sdmw > 14:26:50.131044 R inode 37:2513579736 128 17.181 > 0.000 0.000 0.000 pd sddv > 14:26:50.138181 R inode 72:3868810400 64 10.951 > 0.000 0.000 0.000 pd sdbz > 14:26:50.138188 R inode 131:2443484784 128 11.792 > 0.000 0.000 0.000 pd sdug > 14:26:50.138003 R inode 102:3696843872 64 11.994 > 0.000 0.000 0.000 pd sdgp > 14:26:50.137099 R inode 145:3370922504 64 13.225 > 0.000 0.000 0.000 pd sdmi > 14:26:50.141576 R inode 62:2668579904 64 9.313 > 0.000 0.000 0.000 pd sdou > 14:26:50.134689 R inode 159:2786164648 64 16.577 > 0.000 0.000 0.000 pd sdpq > 14:26:50.145034 R inode 34:2097217320 64 7.409 > 0.000 0.000 0.000 pd sdmt > 14:26:50.138140 R inode 139:2831038792 64 14.898 > 0.000 0.000 0.000 pd sdlw > 14:26:50.130954 R inode 164:282120312 64 22.274 > 0.000 0.000 0.000 pd sdzd > 14:26:50.137038 R inode 41:3421909608 64 16.314 > 0.000 0.000 0.000 pd sdef > 14:26:50.137606 R inode 104:1870962416 64 16.644 > 0.000 0.000 0.000 pd sdgx > 14:26:50.141306 R inode 65:2276184264 64 16.593 > 0.000 0.000 0.000 pd sdrk > > > > > > Last but not least.. and this is what i really would like to > > accomplish, i would to be able to monitor the latency of metadata > operations. > > you can't do this on the server side as you don't know how much time > you spend on the client , network or anything between the app and the > physical disk, so you can only reliably look at this from the client, > the iohist output only shows you the Server disk i/o processing time, > but that can be a fraction of the overall time (in other cases this > obviously can also be the dominant part depending on your workload). > > the easiest way on the client is to run > > mmfsadm vfsstats enable > from now on vfs stats are collected until you restart GPFS. > > then run : > > vfs statistics currently enabled > started at: Fri Aug 29 13:15:05.380 2014 > duration: 448446.970 sec > > name calls time per call total time > -------------------- -------- -------------- -------------- > statfs 9 0.000002 0.000021 > startIO 246191176 0.005853 1441049.976740 > > to dump what ever you collected so far on this node. > > > In my environment there are users that litterally overhelm our > > storages with metadata request, so even if there is no massive > > throughput or huge waiters, any "ls" could take ages. I would like > > to be able to monitor metadata behaviour. There is a way to to do > > that from the NSD servers? > > not this simple as described above. > > > > > Thanks in advance for any tip/help. > > > > Regards, > > Salvatore_______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at gpfsug.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From orlando.richards at ed.ac.uk Thu Sep 4 14:54:37 2014 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Thu, 04 Sep 2014 14:54:37 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <540869DF.5060100@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <540869DF.5060100@ebi.ac.uk> Message-ID: <54086F1D.1000401@ed.ac.uk> On 04/09/14 14:32, Salvatore Di Nardo wrote: > Sorry to bother you again but dstat have some issues with the plugin: > > [root at gss01a util]# dstat --gpfs > /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is > deprecated. Use the subprocess module. > pipes[cmd] = os.popen3(cmd, 't', 0) > Module dstat_gpfs failed to load. (global name 'select' is not > defined) > None of the stats you selected are available. > > I found this solution , but involve dstat recompile.... > > https://github.com/dagwieers/dstat/issues/44 > > Are you aware about any easier solution (we use RHEL6.3) ? > This worked for me the other day on a dev box I was poking at: # rm /usr/share/dstat/dstat_gpfsops* # cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 /usr/share/dstat/dstat_gpfsops.py # dstat --gpfsops /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use the subprocess module. pipes[cmd] = os.popen3(cmd, 't', 0) ---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o----------------------------- cr del op/cl rd wr trunc fsync looku gattr sattr other mb_rd mb_wr pref wrbeh steal clean sync revok logwr logda oth_r oth_w 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... > > Regards, > Salvatore > > On 04/09/14 01:50, Sven Oehme wrote: >> > Hello everybody, >> >> Hi >> >> > here i come here again, this time to ask some hint about how to >> monitor GPFS. >> > >> > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is >> > that they return number based only on the request done in the >> > current host, so i have to run them on all the clients ( over 600 >> > nodes) so its quite unpractical. Instead i would like to know from >> > the servers whats going on, and i came across the vio_s statistics >> > wich are less documented and i dont know exacly what they mean. >> > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that >> > runs VIO_S. >> > >> > My problems with the output of this command: >> > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 >> > >> > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second >> > timestamp: 1409763206/477366 >> > recovery group: * >> > declustered array: * >> > vdisk: * >> > client reads: 2584229 >> > client short writes: 55299693 >> > client medium writes: 190071 >> > client promoted full track writes: 465145 >> > client full track writes: 9249 >> > flushed update writes: 4187708 >> > flushed promoted full track writes: 123 >> > migrate operations: 114 >> > scrub operations: 450590 >> > log writes: 28509602 >> > >> > it sais "VIOPS per second", but they seem to me just counters as >> > every time i re-run the command, the numbers increase by a bit.. >> > Can anyone confirm if those numbers are counter or if they are OPS/sec. >> >> the numbers are accumulative so everytime you run them they just show >> the value since start (or last reset) time. >> >> > >> > On a closer eye about i dont understand what most of thosevalues >> > mean. For example, what exacly are "flushed promoted full track >> write" ?? >> > I tried to find a documentation about this output , but could not >> > find any. can anyone point me a link where output of vio_s is explained? >> > >> > Another thing i dont understand about those numbers is if they are >> > just operations, or the number of blocks that was read/write/etc . >> >> its just operations and if i would explain what the numbers mean i >> might confuse you even more because this is not what you are really >> looking for. >> what you are looking for is what the client io's look like on the >> Server side, while the VIO layer is the Server side to the disks, so >> one lever lower than what you are looking for from what i could read >> out of the description above. >> >> so the Layer you care about is the NSD Server layer, which sits on top >> of the VIO layer (which is essentially the SW RAID Layer in GNR) >> >> > I'm asking that because if they are just ops, i don't know how much >> > they could be usefull. For example one write operation could eman >> > write 1 block or write a file of 100GB. If those are oprations, >> > there is a way to have the oupunt in bytes or blocks? >> >> there are multiple ways to get infos on the NSD layer, one would be to >> use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats counts >> again. >> >> the alternative option is to use mmdiag --iohist. this shows you a >> history of the last X numbers of io operations on either the client or >> the server side like on a client : >> >> # mmdiag --iohist >> >> === mmdiag: iohist === >> >> I/O history: >> >> I/O start time RW Buf type disk:sectorNum nSec time ms qTime >> ms RpcTimes ms Type Device/NSD ID NSD server >> --------------- -- ----------- ----------------- ----- ------- >> -------- ----------------- ---- ------------------ --------------- >> 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 >> 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:22.182723 R inode 1:1071252480 8 6.970 0.000 >> 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 >> 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:53.668262 R inode 2:1081373696 8 14.117 >> 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 >> 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:53.692019 R inode 2:1064356608 8 14.899 >> 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:53.707100 R inode 2:1077830152 8 16.499 >> 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 >> 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:53.728082 R inode 2:1081918976 8 7.760 0.000 >> 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:57.877416 R metadata 2:678978560 16 13.343 0.000 >> 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 >> 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:57.906556 R inode 2:1083476520 8 11.723 >> 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 >> 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:57.926592 R inode 1:1076503480 8 8.087 0.000 >> 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 >> 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:57.941441 R inode 2:1069885984 8 11.686 >> 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:57.953294 R inode 2:1083476936 8 8.951 0.000 >> 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:57.965475 R inode 1:1076503504 8 0.477 0.000 >> 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:57.965755 R inode 2:1083476488 8 0.410 0.000 >> 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:57.965787 R inode 2:1083476512 8 0.439 0.000 >> 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 >> >> you basically see if its a inode , data block , what size it has (in >> sectors) , which nsd server you did send this request to, etc. >> >> on the Server side you see the type , which physical disk it goes to >> and also what size of disk i/o it causes like : >> >> 14:26:50.129995 R inode 12:3211886376 64 14.261 >> 0.000 0.000 0.000 pd sdis >> 14:26:50.137102 R inode 19:3003969520 64 9.004 >> 0.000 0.000 0.000 pd sdad >> 14:26:50.136116 R inode 55:3591710992 64 11.057 >> 0.000 0.000 0.000 pd sdoh >> 14:26:50.141510 R inode 21:3066810504 64 5.909 >> 0.000 0.000 0.000 pd sdaf >> 14:26:50.130529 R inode 89:2962370072 64 17.437 >> 0.000 0.000 0.000 pd sddi >> 14:26:50.131063 R inode 78:1889457000 64 17.062 >> 0.000 0.000 0.000 pd sdsj >> 14:26:50.143403 R inode 36:3323035688 64 4.807 >> 0.000 0.000 0.000 pd sdmw >> 14:26:50.131044 R inode 37:2513579736 128 17.181 >> 0.000 0.000 0.000 pd sddv >> 14:26:50.138181 R inode 72:3868810400 64 10.951 >> 0.000 0.000 0.000 pd sdbz >> 14:26:50.138188 R inode 131:2443484784 128 11.792 >> 0.000 0.000 0.000 pd sdug >> 14:26:50.138003 R inode 102:3696843872 64 11.994 >> 0.000 0.000 0.000 pd sdgp >> 14:26:50.137099 R inode 145:3370922504 64 13.225 >> 0.000 0.000 0.000 pd sdmi >> 14:26:50.141576 R inode 62:2668579904 64 9.313 >> 0.000 0.000 0.000 pd sdou >> 14:26:50.134689 R inode 159:2786164648 64 16.577 >> 0.000 0.000 0.000 pd sdpq >> 14:26:50.145034 R inode 34:2097217320 64 7.409 >> 0.000 0.000 0.000 pd sdmt >> 14:26:50.138140 R inode 139:2831038792 64 14.898 >> 0.000 0.000 0.000 pd sdlw >> 14:26:50.130954 R inode 164:282120312 64 22.274 >> 0.000 0.000 0.000 pd sdzd >> 14:26:50.137038 R inode 41:3421909608 64 16.314 >> 0.000 0.000 0.000 pd sdef >> 14:26:50.137606 R inode 104:1870962416 64 16.644 >> 0.000 0.000 0.000 pd sdgx >> 14:26:50.141306 R inode 65:2276184264 64 16.593 >> 0.000 0.000 0.000 pd sdrk >> >> >> > >> > Last but not least.. and this is what i really would like to >> > accomplish, i would to be able to monitor the latency of metadata >> operations. >> >> you can't do this on the server side as you don't know how much time >> you spend on the client , network or anything between the app and the >> physical disk, so you can only reliably look at this from the client, >> the iohist output only shows you the Server disk i/o processing time, >> but that can be a fraction of the overall time (in other cases this >> obviously can also be the dominant part depending on your workload). >> >> the easiest way on the client is to run >> >> mmfsadm vfsstats enable >> from now on vfs stats are collected until you restart GPFS. >> >> then run : >> >> vfs statistics currently enabled >> started at: Fri Aug 29 13:15:05.380 2014 >> duration: 448446.970 sec >> >> name calls time per call total time >> -------------------- -------- -------------- -------------- >> statfs 9 0.000002 0.000021 >> startIO 246191176 0.005853 1441049.976740 >> >> to dump what ever you collected so far on this node. >> >> > In my environment there are users that litterally overhelm our >> > storages with metadata request, so even if there is no massive >> > throughput or huge waiters, any "ls" could take ages. I would like >> > to be able to monitor metadata behaviour. There is a way to to do >> > that from the NSD servers? >> >> not this simple as described above. >> >> > >> > Thanks in advance for any tip/help. >> > >> > Regards, >> > Salvatore_______________________________________________ >> > gpfsug-discuss mailing list >> > gpfsug-discuss at gpfsug.org >> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- -- Dr Orlando Richards Research Facilities (ECDF) Systems Leader Information Services IT Infrastructure Division Tel: 0131 650 4994 skype: orlando.richards The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From sdinardo at ebi.ac.uk Thu Sep 4 15:07:42 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 04 Sep 2014 15:07:42 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <54086F1D.1000401@ed.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <540869DF.5060100@ebi.ac.uk> <54086F1D.1000401@ed.ac.uk> Message-ID: <5408722E.6060309@ebi.ac.uk> On 04/09/14 14:54, Orlando Richards wrote: > > > On 04/09/14 14:32, Salvatore Di Nardo wrote: >> Sorry to bother you again but dstat have some issues with the plugin: >> >> [root at gss01a util]# dstat --gpfs >> /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is >> deprecated. Use the subprocess module. >> pipes[cmd] = os.popen3(cmd, 't', 0) >> Module dstat_gpfs failed to load. (global name 'select' is not >> defined) >> None of the stats you selected are available. >> >> I found this solution , but involve dstat recompile.... >> >> https://github.com/dagwieers/dstat/issues/44 >> >> Are you aware about any easier solution (we use RHEL6.3) ? >> > > This worked for me the other day on a dev box I was poking at: > > # rm /usr/share/dstat/dstat_gpfsops* > > # cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 > /usr/share/dstat/dstat_gpfsops.py > > # dstat --gpfsops > /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use > the subprocess module. > pipes[cmd] = os.popen3(cmd, 't', 0) > ---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o----------------------------- > > cr del op/cl rd wr trunc fsync looku gattr sattr other > mb_rd mb_wr pref wrbeh steal clean sync revok logwr logda oth_r oth_w > 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 > > ... > NICE!! The only problem is that the box seems lacking those python scripts: ls /usr/lpp/mmfs/samples/util/ makefile README tsbackup tsbackup.C tsbackup.h tsfindinode tsfindinode.c tsgetusage tsgetusage.c tsinode tsinode.c tslistall tsreaddir tsreaddir.c tstimes tstimes.c Do you mind sending me those py files? They should be 3 as i see e gpfs options: gpfs, gpfs-ops, gpfsops (dunno what are the differences ) Regards, Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From orlando.richards at ed.ac.uk Thu Sep 4 15:14:02 2014 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Thu, 04 Sep 2014 15:14:02 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <5408722E.6060309@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <540869DF.5060100@ebi.ac.uk> <54086F1D.1000401@ed.ac.uk> <5408722E.6060309@ebi.ac.uk> Message-ID: <540873AA.5070401@ed.ac.uk> On 04/09/14 15:07, Salvatore Di Nardo wrote: > > On 04/09/14 14:54, Orlando Richards wrote: >> >> >> On 04/09/14 14:32, Salvatore Di Nardo wrote: >>> Sorry to bother you again but dstat have some issues with the plugin: >>> >>> [root at gss01a util]# dstat --gpfs >>> /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is >>> deprecated. Use the subprocess module. >>> pipes[cmd] = os.popen3(cmd, 't', 0) >>> Module dstat_gpfs failed to load. (global name 'select' is not >>> defined) >>> None of the stats you selected are available. >>> >>> I found this solution , but involve dstat recompile.... >>> >>> https://github.com/dagwieers/dstat/issues/44 >>> >>> Are you aware about any easier solution (we use RHEL6.3) ? >>> >> >> This worked for me the other day on a dev box I was poking at: >> >> # rm /usr/share/dstat/dstat_gpfsops* >> >> # cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 >> /usr/share/dstat/dstat_gpfsops.py >> >> # dstat --gpfsops >> /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use >> the subprocess module. >> pipes[cmd] = os.popen3(cmd, 't', 0) >> ---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o----------------------------- >> >> cr del op/cl rd wr trunc fsync looku gattr sattr other >> mb_rd mb_wr pref wrbeh steal clean sync revok logwr logda oth_r oth_w >> 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 >> >> ... >> > > NICE!! The only problem is that the box seems lacking those python scripts: > > ls /usr/lpp/mmfs/samples/util/ > makefile README tsbackup tsbackup.C tsbackup.h tsfindinode > tsfindinode.c tsgetusage tsgetusage.c tsinode tsinode.c > tslistall tsreaddir tsreaddir.c tstimes tstimes.c > It came from the gpfs.base rpm: # rpm -qf /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 gpfs.base-3.5.0-13.x86_64 > Do you mind sending me those py files? They should be 3 as i see e gpfs > options: gpfs, gpfs-ops, gpfsops (dunno what are the differences ) > Only the gpfsops.py is included in the bundle - one for dstat 0.6 and one for dstat 0.7. I've attached it to this mail as well (it seems to be GPL'd). > Regards, > Salvatore > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- -- Dr Orlando Richards Research Facilities (ECDF) Systems Leader Information Services IT Infrastructure Division Tel: 0131 650 4994 skype: orlando.richards The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -------------- next part -------------- # # Copyright (C) 2009, 2010 IBM Corporation # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2, or (at your option) # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. # global string, select, os, re, fnmatch import string, select, os, re, fnmatch # Dstat class to display selected gpfs performance counters returned by the # mmpmon "vfs_s", "ioc_s", "vio_s", "vflush_s", and "lroc_s" commands. # # The set of counters displayed can be customized via environment variables: # # DSTAT_GPFS_WHAT # # Selects which of the five mmpmon commands to display. # It is a comma separated list of any of the following: # "vfs": show mmpmon "vfs_s" counters # "ioc": show mmpmon "ioc_s" counters related to NSD client I/O # "nsd": show mmpmon "ioc_s" counters related to NSD server I/O # "vio": show mmpmon "vio_s" counters # "vflush": show mmpmon "vflush_s" counters # "lroc": show mmpmon "lroc_s" counters # "all": equivalent to specifying all of the above # # Example: # # DSTAT_GPFS_WHAT=vfs,lroc dstat -M gpfsops # # will display counters for mmpmon "vfs_s" and "lroc" commands. # # The default setting is "vfs,ioc", i.e., by default only "vfs_s" and NSD # client related "ioc_s" counters are displayed. # # DSTAT_GPFS_VFS # DSTAT_GPFS_IOC # DSTAT_GPFS_VIO # DSTAT_GPFS_VFLUSH # DSTAT_GPFS_LROC # # Allow finer grain control over exactly which values will be displayed for # each of the five mmpmon commands. Each variable is a comma separated list # of counter names with optional column header string. # # Example: # # export DSTAT_GPFS_VFS='create, remove, rd/wr=read+write' # export DSTAT_GPFS_IOC='sync*, logwrap*, oth_rd=*_rd, oth_wr=*_wr' # dstat -M gpfsops # # Under "vfs-ops" this will display three columns, showing creates, deletes # (removes), and a third column labelled "rd/wr" with a combined count of # read and write operations. # Under "disk-i/o" it will display four columns, showing all disk I/Os # initiated by sync, and log wrap, plus two columns labeled "oth_rd" and # "oth_wr" showing counts of all other disk reads and disk writes, # respectively. # # Note: setting one of these environment variables overrides the # corrosponding setting in DSTAT_GPFS_WHAT. For example, setting # DSTAT_GPFS_VFS="" will omit all "vfs_s" counters regardless of whether # "vfs" appears in DSTAT_GPFS_WHAT or not. # # Counter sets are specified as a comma-separated list of entries of one # of the following forms # # counter # label = counter # label = counter1 + counter2 + ... # # If no label is specified, the name of the counter is used as the column # header (truncated to 5 characters). # Counter names may contain shell-style wildcards. For example, the # pattern "sync*" matches the two ioc_s counters "sync_rd" and "sync_wr" and # therefore produce a column containing the combined count of disk reads and # disk writes initiated by sync. If a counter appears in or matches a name # pattern in more than one entry, it is included only in the count under the # first entry in which it appears. For example, adding an entry "other = *" # at the end of the list will add a column labeled "other" that shows the # sum of all counter values *not* included in any of the previous columns. # # DSTAT_GPFS_LIST=1 dstat -M gpfsops # # This will show all available counter names and the default definition # for which sets of counter values are displayed. # # An alternative to setting environment variables is to create a file # ~/.dstat_gpfs_rc # with python statements that sets any of the following variables # vfs_wanted: equivalent to setting DSTAT_GPFS_VFS # ioc_wanted: equivalent to setting DSTAT_GPFS_IOC # vio_wanted: equivalent to setting DSTAT_GPFS_VIO # vflush_wanted: equivalent to setting DSTAT_GPFS_VFLUSH # lroc_wanted: equivalent to setting DSTAT_GPFS_LROC # # For example, the following ~/.dstat_gpfs_rc file will produce the same # result as the environment variables in the example above: # # vfs_wanted = 'create, remove, rd/wr=read+write' # ioc_wanted = 'sync*, logwrap*, oth_rd=*_rd, oth_wr=*_wr' # # See also the default vfs_wanted, ioc_wanted, and vio_wanted settings in # the dstat_gpfsops __init__ method below. class dstat_plugin(dstat): def __init__(self): # list of all stats counters returned by mmpmon "vfs_s", "ioc_s", "vio_s", "vflush_s", and "lroc_s" # always ignore the first few chars like : io_s _io_s_ _n_ 172.31.136.2 _nn_ mgmt001st001 _rc_ 0 _t_ 1322526286 _tu_ 415518 vfs_keys = ('_access_', '_close_', '_create_', '_fclear_', '_fsync_', '_fsync_range_', '_ftrunc_', '_getattr_', '_link_', '_lockctl_', '_lookup_', '_map_lloff_', '_mkdir_', '_mknod_', '_open_', '_read_', '_write_', '_mmapRead_', '_mmapWrite_', '_aioRead_', '_aioWrite_','_readdir_', '_readlink_', '_readpage_', '_remove_', '_rename_', '_rmdir_', '_setacl_', '_setattr_', '_symlink_', '_unmap_', '_writepage_', '_tsfattr_', '_tsfsattr_', '_flock_', '_setxattr_', '_getxattr_', '_listxattr_', '_removexattr_', '_encode_fh_', '_decode_fh_', '_get_dentry_', '_get_parent_', '_mount_', '_statfs_', '_sync_', '_vget_') ioc_keys = ('_other_rd_', '_other_wr_','_mb_rd_', '_mb_wr_', '_steal_rd_', '_steal_wr_', '_cleaner_rd_', '_cleaner_wr_', '_sync_rd_', '_sync_wr_', '_logwrap_rd_', '_logwrap_wr_', '_revoke_rd_', '_revoke_wr_', '_prefetch_rd_', '_prefetch_wr_', '_logdata_rd_', '_logdata_wr_', '_nsdworker_rd_', '_nsdworker_wr_','_nsdlocal_rd_','_nsdlocal_wr_', '_vdisk_rd_','_vdisk_wr_', '_pdisk_rd_','_pdisk_wr_', '_logtip_rd_', '_logtip_wr_') vio_keys = ('_r_', '_sw_', '_mw_', '_pfw_', '_ftw_', '_fuw_', '_fpw_', '_m_', '_s_', '_l_', '_rgd_', '_meta_') vflush_keys = ('_ndt_', '_ngdb_', '_nfwlmb_', '_nfipt_', '_nfwwt_', '_ahwm_', '_susp_', '_uwrttf_', '_fftc_', '_nalth_', '_nasth_', '_nsigth_', '_ntgtth_') lroc_keys = ('_Inode_s_', '_Inode_sf_', '_Inode_smb_', '_Inode_r_', '_Inode_rf_', '_Inode_rmb_', '_Inode_i_', '_Inode_imb_', '_Directory_s_', '_Directory_sf_', '_Directory_smb_', '_Directory_r_', '_Directory_rf_', '_Directory_rmb_', '_Directory_i_', '_Directory_imb_', '_Data_s_', '_Data_sf_', '_Data_smb_', '_Data_r_', '_Data_rf_', '_Data_rmb_', '_Data_i_', '_Data_imb_', '_agt_i_', '_agt_i_rm_', '_agt_i_rM_', '_agt_i_ra_', '_agt_r_', '_agt_r_rm_', '_agt_r_rM_', '_agt_r_ra_', '_ssd_w_', '_ssd_w_p_', '_ssd_w_rm_', '_ssd_w_rM_', '_ssd_w_ra_', '_ssd_r_', '_ssd_r_p_', '_ssd_r_rm_', '_ssd_r_rM_', '_ssd_r_ra_') # Default counters to display for each mmpmon category vfs_wanted = '''cr = create + mkdir + link + symlink, del = remove + rmdir, op/cl = open + close + map_lloff + unmap, rd = read + readdir + readlink + mmapRead + readpage + aioRead + aioWrite, wr = write + mmapWrite + writepage, trunc = ftrunc + fclear, fsync = fsync + fsync_range, lookup, gattr = access + getattr + getxattr + getacl, sattr = setattr + setxattr + setacl, other = * ''' ioc_wanted1 = '''mb_rd, mb_wr, pref=prefetch_rd, wrbeh=prefetch_wr, steal*, cleaner*, sync*, revoke*, logwrap*, logdata*, oth_rd = other_rd, oth_wr = other_wr ''' ioc_wanted2 = '''rns_r=nsdworker_rd, rns_w=nsdworker_wr, lns_r=nsdlocal_rd, lns_w=nsdlocal_wr, vd_r=vdisk_rd, vd_w=vdisk_wr, pd_r=pdisk_rd, pd_w=pdisk_wr, ''' vio_wanted = '''ClRead=r, ClShWr=sw, ClMdWr=mw, ClPFTWr=pfw, ClFTWr=ftw, FlUpWr=fuw, FlPFTWr=fpw, Migrte=m, Scrub=s, LgWr=l, RGDsc=rgd, Meta=meta ''' vflush_wanted = '''DiTrk = ndt, DiBuf = ngdb, FwLog = nfwlmb, FinPr = nfipt, WraTh = nfwwt, HiWMa = ahwm, Suspd = susp, WrThF = uwrttf, Force = fftc, TrgTh = ntgtth, other = nalth + nasth + nsigth ''' lroc_wanted = '''StorS = Inode_s + Directory_s + Data_s, StorF = Inode_sf + Directory_sf + Data_sf, FetcS = Inode_r + Directory_r + Data_r, FetcF = Inode_rf + Directory_rf + Data_rf, InVAL = Inode_i + Directory_i + Data_i ''' # Coarse counter selection via DSTAT_GPFS_WHAT if 'DSTAT_GPFS_WHAT' in os.environ: what_wanted = os.environ['DSTAT_GPFS_WHAT'].split(',') else: what_wanted = [ 'vfs', 'ioc' ] # If ".dstat_gpfs_rc" exists in user's home directory, run it. # Otherwise, use DSTAT_GPFS_WHAT for counter selection and look for other # DSTAT_GPFS_XXX environment variables for additional customization. userprofile = os.path.join(os.environ['HOME'], '.dstat_gpfs_rc') if os.path.exists(userprofile): ioc_wanted = ioc_wanted1 + ioc_wanted2 exec file(userprofile) else: if 'all' not in what_wanted: if 'vfs' not in what_wanted: vfs_wanted = '' if 'ioc' not in what_wanted: ioc_wanted1 = '' if 'nsd' not in what_wanted: ioc_wanted2 = '' if 'vio' not in what_wanted: vio_wanted = '' if 'vflush' not in what_wanted: vflush_wanted = '' if 'lroc' not in what_wanted: lroc_wanted = '' ioc_wanted = ioc_wanted1 + ioc_wanted2 # Fine grain counter cusomization via DSTAT_GPFS_XXX if 'DSTAT_GPFS_VFS' in os.environ: vfs_wanted = os.environ['DSTAT_GPFS_VFS'] if 'DSTAT_GPFS_IOC' in os.environ: ioc_wanted = os.environ['DSTAT_GPFS_IOC'] if 'DSTAT_GPFS_VIO' in os.environ: vio_wanted = os.environ['DSTAT_GPFS_VIO'] if 'DSTAT_GPFS_VFLUSH' in os.environ: vflush_wanted = os.environ['DSTAT_GPFS_VFLUSH'] if 'DSTAT_GPFS_LROC' in os.environ: lroc_wanted = os.environ['DSTAT_GPFS_LROC'] self.debug = 0 vars1, nick1, keymap1 = self.make_keymap(vfs_keys, vfs_wanted, 'gpfs-vfs-') vars2, nick2, keymap2 = self.make_keymap(ioc_keys, ioc_wanted, 'gpfs-io-') vars3, nick3, keymap3 = self.make_keymap(vio_keys, vio_wanted, 'gpfs-vio-') vars4, nick4, keymap4 = self.make_keymap(vflush_keys, vflush_wanted, 'gpfs-vflush-') vars5, nick5, keymap5 = self.make_keymap(lroc_keys, lroc_wanted, 'gpfs-lroc-') if 'DSTAT_GPFS_LIST' in os.environ or self.debug: self.show_keymap('vfs_s', 'DSTAT_GPFS_VFS', vfs_keys, vfs_wanted, vars1, keymap1, 'gpfs-vfs-') self.show_keymap('ioc_s', 'DSTAT_GPFS_IOC', ioc_keys, ioc_wanted, vars2, keymap2, 'gpfs-io-') self.show_keymap('vio_s', 'DSTAT_GPFS_VIO', vio_keys, vio_wanted, vars3, keymap3, 'gpfs-vio-') self.show_keymap('vflush_stat', 'DSTAT_GPFS_VFLUSH', vflush_keys, vflush_wanted, vars4, keymap4, 'gpfs-vflush-') self.show_keymap('lroc_s', 'DSTAT_GPFS_LROC', lroc_keys, lroc_wanted, vars5, keymap5, 'gpfs-lroc-') print self.vars = vars1 + vars2 + vars3 + vars4 + vars5 self.varsrate = vars1 + vars2 + vars3 + vars5 self.varsconst = vars4 self.nick = nick1 + nick2 + nick3 + nick4 + nick5 self.vfs_keymap = keymap1 self.ioc_keymap = keymap2 self.vio_keymap = keymap3 self.vflush_keymap = keymap4 self.lroc_keymap = keymap5 names = [] self.addtitle(names, 'gpfs vfs ops', len(vars1)) self.addtitle(names, 'gpfs disk i/o', len(vars2)) self.addtitle(names, 'gpfs vio', len(vars3)) self.addtitle(names, 'gpfs vflush', len(vars4)) self.addtitle(names, 'gpfs lroc', len(vars5)) self.name = '#'.join(names) self.type = 'd' self.width = 5 self.scale = 1000 def make_keymap(self, keys, wanted, prefix): '''Parse the list of counter values to be displayd "keys" is the list of all available counters "wanted" is a string of the form "name1 = key1 + key2 + ..., name2 = key3 + key4 ..." Returns a list of all names found, e.g. ['name1', 'name2', ...], and a dictionary that maps counters to names, e.g., { 'key1': 'name1', 'key2': 'name1', 'key3': 'name2', ... }, ''' vars = [] nick = [] kmap = {} ## print re.split(r'\s*,\s*', wanted.strip()) for n in re.split(r'\s*,\s*', wanted.strip()): l = re.split(r'\s*=\s*', n, 2) if len(l) == 2: v = l[0] kl = re.split(r'\s*\+\s*', l[1]) elif l[0]: v = l[0].strip('*') kl = l else: continue nick.append(v[0:5]) v = prefix + v.replace('/', '-') vars.append(v) for s in kl: for k in keys: if fnmatch.fnmatch(k.strip('_'), s) and k not in kmap: kmap[k] = v return vars, nick, kmap def show_keymap(self, label, envname, keys, wanted, vars, kmap, prefix): 'show available counter names and current counter set definition' linewd = 100 print '\nAvailable counters for "%s":' % label mlen = max([len(k.strip('_')) for k in keys]) ncols = linewd // (mlen + 1) nrows = (len(keys) + ncols - 1) // ncols for r in range(nrows): print ' ', for c in range(ncols): i = c *nrows + r if not i < len(keys): break print keys[i].strip('_').ljust(mlen), print print '\nCurrent counter set selection:' print "\n%s='%s'\n" % (envname, re.sub(r'\s+', '', wanted).strip().replace(',', ', ')) if not vars: return mlen = 5 for v in vars: if v.startswith(prefix): s = v[len(prefix):] else: s = v n = ' %s = ' % s[0:mlen].rjust(mlen) kl = [ k.strip('_') for k in keys if kmap.get(k) == v ] i = 0 while i < len(kl): slen = len(n) + 3 + len(kl[i]) j = i + 1 while j < len(kl) and slen + 3 + len(kl[j]) < linewd: slen += 3 + len(kl[j]) j += 1 print n + ' + '.join(kl[i:j]) i = j n = ' %s + ' % ''.rjust(mlen) def addtitle(self, names, name, ncols): 'pad title given by "name" with minus signs to span "ncols" columns' if ncols == 1: names.append(name.split()[-1].center(6*ncols - 1)) elif ncols > 1: names.append(name.center(6*ncols - 1)) def check(self): 'start mmpmon command' if os.access('/usr/lpp/mmfs/bin/mmpmon', os.X_OK): try: self.stdin, self.stdout, self.stderr = dpopen('/usr/lpp/mmfs/bin/mmpmon -p -s') self.stdin.write('reset\n') readpipe(self.stdout) except IOError: raise Exception, 'Cannot interface with gpfs mmpmon binary' return True raise Exception, 'Needs GPFS mmpmon binary' def extract_vfs(self): 'collect "vfs_s" counter values' self.stdin.write('vfs_s\n') l = [] for line in readpipe(self.stdout): if not line: continue l += line.split() for i in range(11, len(l), 3): try: self.set2[self.vfs_keymap[l[i]]] += long(l[i+1]) except KeyError: pass def extract_ioc(self): 'collect "ioc_s" counter values' self.stdin.write('ioc_s\n') l = [] for line in readpipe(self.stdout): if not line: continue l += line.split() for i in range(11, len(l), 3): try: self.set2[self.ioc_keymap[l[i]+'rd_']] += long(l[i+1]) except KeyError: pass try: self.set2[self.ioc_keymap[l[i]+'wr_']] += long(l[i+2]) except KeyError: pass def extract_vio(self): 'collect "vio_s" counter values' self.stdin.write('vio_s\n') l = [] for line in readpipe(self.stdout): if not line: continue l += line.split() for i in range(19, len(l), 2): try: if l[i] in self.vio_keymap: self.set2[self.vio_keymap[l[i]]] += long(l[i+1]) except KeyError: pass def extract_vflush(self): 'collect "vflush_stat" counter values' self.stdin.write('vflush_stat\n') l = [] for line in readpipe(self.stdout): if not line: continue l += line.split() for i in range(11, len(l), 2): try: if l[i] in self.vflush_keymap: self.set2[self.vflush_keymap[l[i]]] += long(l[i+1]) except KeyError: pass def extract_lroc(self): 'collect "lroc_s" counter values' self.stdin.write('lroc_s\n') l = [] for line in readpipe(self.stdout): if not line: continue l += line.split() for i in range(11, len(l), 2): try: if l[i] in self.lroc_keymap: self.set2[self.lroc_keymap[l[i]]] += long(l[i+1]) except KeyError: pass def extract(self): try: for name in self.vars: self.set2[name] = 0 self.extract_ioc() self.extract_vfs() self.extract_vio() self.extract_vflush() self.extract_lroc() for name in self.varsrate: self.val[name] = (self.set2[name] - self.set1[name]) * 1.0 / elapsed for name in self.varsconst: self.val[name] = self.set2[name] except IOError, e: for name in self.vars: self.val[name] = -1 ## print 'dstat_gpfs: lost pipe to mmpmon,', e except Exception, e: for name in self.vars: self.val[name] = -1 print 'dstat_gpfs: exception', e if self.debug >= 0: self.debug -= 1 if step == op.delay: self.set1.update(self.set2) From ewahl at osc.edu Thu Sep 4 15:13:48 2014 From: ewahl at osc.edu (Ed Wahl) Date: Thu, 4 Sep 2014 14:13:48 +0000 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> References: <54074F90.7000303@ebi.ac.uk>, <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> Message-ID: Another known issue with slow "ls" can be the annoyance that is 'sssd' under newer OSs (rhel 6) and properly configuring this for remote auth. I know on my nsd's I never did and the first ls in a directory where the cache is expired takes forever to make all the remote LDAP calls to get the UID info. bleh. Ed ________________________________ From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of service at metamodul.com [service at metamodul.com] Sent: Thursday, September 04, 2014 6:05 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] gpfs performance monitoring > , any "ls" could take ages. Check if you large directories either with many files or simply large. Verify if you have NFS exported GPFS. Verify that your cache settings on the clients are large enough ( maxStatCache , maxFilesToCache , sharedMemLimit ) Verify that you have dedicated metadata luns ( metadataOnly ) Reference: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Tuning%20Parameters Note: If possible monitor your metadata luns on the storage directly. hth Hajo -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Thu Sep 4 15:18:02 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 04 Sep 2014 15:18:02 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <540873AA.5070401@ed.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <540869DF.5060100@ebi.ac.uk> <54086F1D.1000401@ed.ac.uk> <5408722E.6060309@ebi.ac.uk> <540873AA.5070401@ed.ac.uk> Message-ID: <5408749A.9080306@ebi.ac.uk> On 04/09/14 15:14, Orlando Richards wrote: > > > On 04/09/14 15:07, Salvatore Di Nardo wrote: >> >> On 04/09/14 14:54, Orlando Richards wrote: >>> >>> >>> On 04/09/14 14:32, Salvatore Di Nardo wrote: >>>> Sorry to bother you again but dstat have some issues with the plugin: >>>> >>>> [root at gss01a util]# dstat --gpfs >>>> /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is >>>> deprecated. Use the subprocess module. >>>> pipes[cmd] = os.popen3(cmd, 't', 0) >>>> Module dstat_gpfs failed to load. (global name 'select' is not >>>> defined) >>>> None of the stats you selected are available. >>>> >>>> I found this solution , but involve dstat recompile.... >>>> >>>> https://github.com/dagwieers/dstat/issues/44 >>>> >>>> Are you aware about any easier solution (we use RHEL6.3) ? >>>> >>> >>> This worked for me the other day on a dev box I was poking at: >>> >>> # rm /usr/share/dstat/dstat_gpfsops* >>> >>> # cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 >>> /usr/share/dstat/dstat_gpfsops.py >>> >>> # dstat --gpfsops >>> /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use >>> the subprocess module. >>> pipes[cmd] = os.popen3(cmd, 't', 0) >>> ---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o----------------------------- >>> >>> >>> cr del op/cl rd wr trunc fsync looku gattr sattr other >>> mb_rd mb_wr pref wrbeh steal clean sync revok logwr logda oth_r oth_w >>> 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 0 0 0 >>> >>> ... >>> >> >> NICE!! The only problem is that the box seems lacking those python >> scripts: >> >> ls /usr/lpp/mmfs/samples/util/ >> makefile README tsbackup tsbackup.C tsbackup.h tsfindinode >> tsfindinode.c tsgetusage tsgetusage.c tsinode tsinode.c >> tslistall tsreaddir tsreaddir.c tstimes tstimes.c >> > > It came from the gpfs.base rpm: > > # rpm -qf /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 > gpfs.base-3.5.0-13.x86_64 > > >> Do you mind sending me those py files? They should be 3 as i see e gpfs >> options: gpfs, gpfs-ops, gpfsops (dunno what are the differences ) >> > > Only the gpfsops.py is included in the bundle - one for dstat 0.6 and > one for dstat 0.7. > > > I've attached it to this mail as well (it seems to be GPL'd). > Thanks. From J.R.Jones at soton.ac.uk Thu Sep 4 16:15:48 2014 From: J.R.Jones at soton.ac.uk (Jones J.R.) Date: Thu, 4 Sep 2014 15:15:48 +0000 Subject: [gpfsug-discuss] Building the portability layer for Xeon Phi Message-ID: <1409843748.7733.31.camel@uos-204812.clients.soton.ac.uk> Hi folks Has anyone managed to successfully build the portability layer for Xeon Phi? At the moment we are having to export the GPFS mounts from the host machine over NFS, which is proving rather unreliable. Jess From oehmes at us.ibm.com Fri Sep 5 01:48:40 2014 From: oehmes at us.ibm.com (Sven Oehme) Date: Thu, 4 Sep 2014 17:48:40 -0700 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <54084258.90508@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <54084258.90508@ebi.ac.uk> Message-ID: ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ gpfsug-discuss-bounces at gpfsug.org wrote on 09/04/2014 03:43:36 AM: > From: Salvatore Di Nardo > To: gpfsug main discussion list > Date: 09/04/2014 03:44 AM > Subject: Re: [gpfsug-discuss] gpfs performance monitoring > Sent by: gpfsug-discuss-bounces at gpfsug.org > > On 04/09/14 01:50, Sven Oehme wrote: > > Hello everybody, > > Hi > > > here i come here again, this time to ask some hint about how to > monitor GPFS. > > > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is > > that they return number based only on the request done in the > > current host, so i have to run them on all the clients ( over 600 > > nodes) so its quite unpractical. Instead i would like to know from > > the servers whats going on, and i came across the vio_s statistics > > wich are less documented and i dont know exacly what they mean. > > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that > > runs VIO_S. > > > > My problems with the output of this command: > > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 > > > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second > > timestamp: 1409763206/477366 > > recovery group: * > > declustered array: * > > vdisk: * > > client reads: 2584229 > > client short writes: 55299693 > > client medium writes: 190071 > > client promoted full track writes: 465145 > > client full track writes: 9249 > > flushed update writes: 4187708 > > flushed promoted full track writes: 123 > > migrate operations: 114 > > scrub operations: 450590 > > log writes: 28509602 > > > > it sais "VIOPS per second", but they seem to me just counters as > > every time i re-run the command, the numbers increase by a bit.. > > Can anyone confirm if those numbers are counter or if they are OPS/sec. > > the numbers are accumulative so everytime you run them they just > show the value since start (or last reset) time. > OK, you confirmed my toughts, thatks > > > > > On a closer eye about i dont understand what most of thosevalues > > mean. For example, what exacly are "flushed promoted full track write" ?? > > I tried to find a documentation about this output , but could not > > find any. can anyone point me a link where output of vio_s is explained? > > > > Another thing i dont understand about those numbers is if they are > > just operations, or the number of blocks that was read/write/etc . > > its just operations and if i would explain what the numbers mean i > might confuse you even more because this is not what you are really > looking for. > what you are looking for is what the client io's look like on the > Server side, while the VIO layer is the Server side to the disks, so > one lever lower than what you are looking for from what i could read > out of the description above. > No.. what I'm looking its exactly how the disks are busy to keep the > requests. Obviously i'm not looking just that, but I feel the needs > to monitor also those things. Ill explain you why. > > It happens when our storage is quite busy ( 180Gb/s of read/write ) > that the FS start to be slowin normal cd or ls requests. This might > be normal, but in those situation i want to know where the > bottleneck is. Is the server CPU? Memory? Network? Spindles? knowing > where the bottlenek is might help me to understand if we can tweak > the system a bit more. if cd or ls is very slow in GPFS in the majority of the cases it has nothing to do with NSD Server bottlenecks, only indirect. the main reason ls is slow in the field is you have some very powerful nodes that all do buffered writes into the same directory into 1 or multiple files while you do the ls on a different node. what happens now is that the ls you did run most likely is a alias for ls -l or something even more complex with color display, etc, but the point is it most likely returns file size. GPFS doesn't lie about the filesize, we only return accurate stat informations and while this is arguable, its a fact today. so what happens is that the stat on each file triggers a token revoke on the node that currently writing to the file you do stat on, lets say it has 1 gb of dirty data in its memory for this file (as its writes data buffered) this 1 GB of data now gets written to the NSD server, the client updates the inode info and returns the correct size. lets say you have very fast network and you have a fast storage device like GSS (which i see you have) it will be able to do this in a few 100 ms, but the problem is this now happens serialized for each single file in this directory that people write into as for each we need to get the exact stat info to satisfy your ls -l request. this is what takes so long, not the fact that the storage device might be slow or to much metadata activity is going on , this is token , means network traffic and obviously latency dependent. the best way to see this is to look at waiters on the client where you run the ls and see what they are waiting for. there are various ways to tune this to get better 'felt' ls responses but its not completely going away if all you try to with ls is if there is a file in the directory run unalias ls and check if ls after that runs fast as it shouldn't do the -l under the cover anymore. > > If its the CPU on the servers then there is no much to do beside > replacing or add more servers.If its not the CPU, maybe more memory > would help? Maybe its just the network that filled up? so i can add > more links > > Or if we reached the point there the bottleneck its the spindles, > then there is no much point o look somethere else, we just reached > the hardware limit.. > > Sometimes, it also happens that there is very low IO (10Gb/s ), > almost no cpu usage on the servers but huge slownes ( ls can take 10 > seconds). Why that happens? There is not much data ops , but we > think there is a huge ammount of metadata ops. So what i want to > know is if the metadata vdisks are busy or not. If this is our > problem, could some SSD disks dedicated to metadata help? the answer if ssd's would help or not are hard to say without knowing the root case and as i tried to explain above the most likely case is token revoke, not disk i/o. obviously as more busy your disks are as longer the token revoke will take. > > > In particular im, a bit puzzled with the design of our GSS storage. > Each recovery groups have 3 declustered arrays, and each declustered > aray have 1 data and 1 metadata vdisk, but in the end both metadata > and data vdisks use the same spindles. The problem that, its that I > dont understand if we have a metadata bottleneck there. Maybe some > SSD disks in a dedicated declustered array would perform much > better, but this is just theory. I really would like to be able to > monitor IO activities on the metadata vdisks. the short answer is we WANT the metadata disks to be with the data disks on the same spindles. compared to other storage systems, GSS is capable to handle different raid codes for different virtual disks on the same physical disks, this way we create raid1'ish 'LUNS' for metadata and raid6'is 'LUNS' for data so the small i/o penalty for a metadata is very small compared to a read/modify/write on the data disks. > > > > > so the Layer you care about is the NSD Server layer, which sits on > top of the VIO layer (which is essentially the SW RAID Layer in GNR) > > > I'm asking that because if they are just ops, i don't know how much > > they could be usefull. For example one write operation could eman > > write 1 block or write a file of 100GB. If those are oprations, > > there is a way to have the oupunt in bytes or blocks? > > there are multiple ways to get infos on the NSD layer, one would be > to use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats > counts again. > > Counters its not a problem. I can collect them and create some > graphs in a monitoring tool. I will check that. if you (let) upgrade your system to GSS 2.0 you get a graphical monitoring as part of it. if you want i can send you some direct email outside the group with additional informations on that. > > the alternative option is to use mmdiag --iohist. this shows you a > history of the last X numbers of io operations on either the client > or the server side like on a client : > > # mmdiag --iohist > > === mmdiag: iohist === > > I/O history: > > I/O start time RW Buf type disk:sectorNum nSec time ms > qTime ms RpcTimes ms Type Device/NSD ID NSD server > --------------- -- ----------- ----------------- ----- ------- > -------- ----------------- ---- ------------------ --------------- > 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 > 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:22.182723 R inode 1:1071252480 8 6.970 > 0.000 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 > 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.668262 R inode 2:1081373696 8 14.117 > 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 > 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.692019 R inode 2:1064356608 8 14.899 > 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.707100 R inode 2:1077830152 8 16.499 > 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 > 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.728082 R inode 2:1081918976 8 7.760 > 0.000 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.877416 R metadata 2:678978560 16 13.343 > 0.000 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 > 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.906556 R inode 2:1083476520 8 11.723 > 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 > 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.926592 R inode 1:1076503480 8 8.087 > 0.000 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 > 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.941441 R inode 2:1069885984 8 11.686 > 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.953294 R inode 2:1083476936 8 8.951 > 0.000 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965475 R inode 1:1076503504 8 0.477 > 0.000 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.965755 R inode 2:1083476488 8 0.410 > 0.000 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965787 R inode 2:1083476512 8 0.439 > 0.000 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 > > you basically see if its a inode , data block , what size it has (in > sectors) , which nsd server you did send this request to, etc. > > on the Server side you see the type , which physical disk it goes to > and also what size of disk i/o it causes like : > > 14:26:50.129995 R inode 12:3211886376 64 14.261 > 0.000 0.000 0.000 pd sdis > 14:26:50.137102 R inode 19:3003969520 64 9.004 > 0.000 0.000 0.000 pd sdad > 14:26:50.136116 R inode 55:3591710992 64 11.057 > 0.000 0.000 0.000 pd sdoh > 14:26:50.141510 R inode 21:3066810504 64 5.909 > 0.000 0.000 0.000 pd sdaf > 14:26:50.130529 R inode 89:2962370072 64 17.437 > 0.000 0.000 0.000 pd sddi > 14:26:50.131063 R inode 78:1889457000 64 17.062 > 0.000 0.000 0.000 pd sdsj > 14:26:50.143403 R inode 36:3323035688 64 4.807 > 0.000 0.000 0.000 pd sdmw > 14:26:50.131044 R inode 37:2513579736 128 17.181 > 0.000 0.000 0.000 pd sddv > 14:26:50.138181 R inode 72:3868810400 64 10.951 > 0.000 0.000 0.000 pd sdbz > 14:26:50.138188 R inode 131:2443484784 128 11.792 > 0.000 0.000 0.000 pd sdug > 14:26:50.138003 R inode 102:3696843872 64 11.994 > 0.000 0.000 0.000 pd sdgp > 14:26:50.137099 R inode 145:3370922504 64 13.225 > 0.000 0.000 0.000 pd sdmi > 14:26:50.141576 R inode 62:2668579904 64 9.313 > 0.000 0.000 0.000 pd sdou > 14:26:50.134689 R inode 159:2786164648 64 16.577 > 0.000 0.000 0.000 pd sdpq > 14:26:50.145034 R inode 34:2097217320 64 7.409 > 0.000 0.000 0.000 pd sdmt > 14:26:50.138140 R inode 139:2831038792 64 14.898 > 0.000 0.000 0.000 pd sdlw > 14:26:50.130954 R inode 164:282120312 64 22.274 > 0.000 0.000 0.000 pd sdzd > 14:26:50.137038 R inode 41:3421909608 64 16.314 > 0.000 0.000 0.000 pd sdef > 14:26:50.137606 R inode 104:1870962416 64 16.644 > 0.000 0.000 0.000 pd sdgx > 14:26:50.141306 R inode 65:2276184264 64 16.593 > 0.000 0.000 0.000 pd sdrk > > > mmdiag --iohist its another think i looked at it, but i could not > find good explanation for all the "buf type" ( third column ) > allocSeg > data > iallocSeg > indBlock > inode > LLIndBlock > logData > logDesc > logWrap > metadata > vdiskAULog > vdiskBuf > vdiskFWLog > vdiskMDLog > vdiskMeta > vdiskRGDesc > If i want to monifor metadata operation whan should i look at? just inodes =inodes , *alloc* = file or data allocation blocks , *ind* = indirect blocks (for very large files) and metadata , everyhing else is data or internal i/o's > the metadata flag or also inode? this command takes also long to > run, especially if i run it a second time it hangs for a lot before > to rerun again, so i'm not sure that run it every 30secs or minute > its viable, but i will look also into that. THere is any > documentation that descibes clearly the whole output? what i found > its quite generic and don't go into details... the reason it takes so long is because it collects 10's of thousands of i/os in a table and to not slow down the system when we dump the data we copy it to a separate buffer so we don't need locks :-) you can adjust the number of entries you want to collect by adjusting the ioHistorySize config parameter > > > > Last but not least.. and this is what i really would like to > > accomplish, i would to be able to monitor the latency of metadata > operations. > > you can't do this on the server side as you don't know how much time > you spend on the client , network or anything between the app and > the physical disk, so you can only reliably look at this from the > client, the iohist output only shows you the Server disk i/o > processing time, but that can be a fraction of the overall time (in > other cases this obviously can also be the dominant part depending > on your workload). > > the easiest way on the client is to run > > mmfsadm vfsstats enable > from now on vfs stats are collected until you restart GPFS. > > then run : > > vfs statistics currently enabled > started at: Fri Aug 29 13:15:05.380 2014 > duration: 448446.970 sec > > name calls time per call total time > -------------------- -------- -------------- -------------- > statfs 9 0.000002 0.000021 > startIO 246191176 0.005853 1441049.976740 > > to dump what ever you collected so far on this node. > > We already do that, but as I said, I want to check specifically how > gss servers are keeping the requests to identify or exlude server > side bottlenecks. > > > Thanks for your help, you gave me definitely few things where to look at. > > Salvatore > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at us.ibm.com Fri Sep 5 01:53:17 2014 From: oehmes at us.ibm.com (Sven Oehme) Date: Thu, 4 Sep 2014 17:53:17 -0700 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <5408722E.6060309@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <540869DF.5060100@ebi.ac.uk> <54086F1D.1000401@ed.ac.uk> <5408722E.6060309@ebi.ac.uk> Message-ID: if you don't have the files you need to update to a newer version of the GPFS client software on the node. they started shipping with 3.5.0.13 even you get the files you still wouldn't see many values as they never got exposed before. some more details are in a presentation i gave earlier this year which is archived in the list or here --> http://www.gpfsug.org/wp-content/uploads/2014/05/UG10_GPFS_Performance_Session_v10.pdf Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ From: Salvatore Di Nardo To: gpfsug-discuss at gpfsug.org Date: 09/04/2014 07:08 AM Subject: Re: [gpfsug-discuss] gpfs performance monitoring Sent by: gpfsug-discuss-bounces at gpfsug.org On 04/09/14 14:54, Orlando Richards wrote: On 04/09/14 14:32, Salvatore Di Nardo wrote: Sorry to bother you again but dstat have some issues with the plugin: [root at gss01a util]# dstat --gpfs /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use the subprocess module. pipes[cmd] = os.popen3(cmd, 't', 0) Module dstat_gpfs failed to load. (global name 'select' is not defined) None of the stats you selected are available. I found this solution , but involve dstat recompile.... https://github.com/dagwieers/dstat/issues/44 Are you aware about any easier solution (we use RHEL6.3) ? This worked for me the other day on a dev box I was poking at: # rm /usr/share/dstat/dstat_gpfsops* # cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 /usr/share/dstat/dstat_gpfsops.py # dstat --gpfsops /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use the subprocess module. pipes[cmd] = os.popen3(cmd, 't', 0) ---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o----------------------------- cr del op/cl rd wr trunc fsync looku gattr sattr other mb_rd mb_wr pref wrbeh steal clean sync revok logwr logda oth_r oth_w 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... NICE!! The only problem is that the box seems lacking those python scripts: ls /usr/lpp/mmfs/samples/util/ makefile README tsbackup tsbackup.C tsbackup.h tsfindinode tsfindinode.c tsgetusage tsgetusage.c tsinode tsinode.c tslistall tsreaddir tsreaddir.c tstimes tstimes.c Do you mind sending me those py files? They should be 3 as i see e gpfs options: gpfs, gpfs-ops, gpfsops (dunno what are the differences ) Regards, Salvatore _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan at buzzard.me.uk Fri Sep 5 10:29:27 2014 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Fri, 05 Sep 2014 10:29:27 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <54084258.90508@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <54084258.90508@ebi.ac.uk> Message-ID: <1409909367.30257.151.camel@buzzard.phy.strath.ac.uk> On Thu, 2014-09-04 at 11:43 +0100, Salvatore Di Nardo wrote: [SNIP] > > Sometimes, it also happens that there is very low IO (10Gb/s ), almost > no cpu usage on the servers but huge slownes ( ls can take 10 > seconds). Why that happens? There is not much data ops , but we think > there is a huge ammount of metadata ops. So what i want to know is if > the metadata vdisks are busy or not. If this is our problem, could > some SSD disks dedicated to metadata help? > This is almost always because you are using an external LDAP/NIS server for GECOS information and the values that you need are not cached for whatever reason and you are having to look them up again. Note that the standard aliasing for RHEL based distros of ls also causes it to do a stat on every file for the colouring etc. Also be aware that if you are trying to fill out your cd with TAB auto-completion you will run into similar issues. That is had you typed the path for the cd out in full you would get in instantly, doing a couple of letters and hitting cd it could take a while. You can test this on a RHEL based distro by doing "/bin/ls -n" The idea being to avoid any aliasing and not look up GECOS data and just report the raw numerical stuff. What I would suggest is that you set the cache time on UID/GID lookups for positive lookups to a long time, in general as long as possible because the values should almost never change. Even for a positive look up of a group membership I would have that cached for a couple of hours. For negative lookups something like five or 10 minutes is a good starting point. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From sdinardo at ebi.ac.uk Fri Sep 5 11:56:37 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Fri, 05 Sep 2014 11:56:37 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: References: <54074F90.7000303@ebi.ac.uk> <54084258.90508@ebi.ac.uk> Message-ID: <540996E5.5000502@ebi.ac.uk> Little clarification: Our ls its plain ls, there is no alias. Consider that all those things are already set up properly as EBI run hi computing farms from many years, so those things are already fixed loong time ago. We have very little experience with GPFS, but good knowledge with LSF farms and own multiple NFS stotages ( several petabyte sized). about NIS, all clients run NSCD that cashes all informations to avoid such tipe of slownes, in fact then ls isslow, also ls -n is slow. Beside that, also a "cd" sometimes hangs, so it have nothing to do with getting attributes. Just to clarify a bit more. Now GSS usually seems working fine, we have users that run jobs on the farms that pushes 180Gb/s read ( reading and writing files of 100GB size). GPFS works very well there, where other systems had performance problems accessing portion of data in so huge files. Sadly, on the other hand, other users run jobs that do suge ammount of metadata operations, like toons of ls in directory with many files, or creating a silly amount of temporary files just to synchronize the jobs between the farm nodes, or just to store temporary data for few milliseconds and them immediately delete those temporary files. Imagine to create constantly thousands files just to write few bytes and they delete them after few milliseconds... When those thing happens we see 10-15Gb/sec throughput, low CPU usage on the server ( 80% iddle), but any cd, or ls or wathever takes few seconds. So my question is, if the bottleneck could be the spindles, or if the clients could be tuned a bit more? I read your PDF and all the paramenters seems already well configured except "maxFilesToCache", but I'm not sure how we should configure few of those parameters on the clients. As an example I cannot immagine a client that require 38g pagepool size. so what's the correct *pagepool* on a client? what about those others? *maxFilestoCache** **maxBufferdescs** **worker1threads** **worker3threads* Right now all the clients have 1 GB pagepool size. In theory, we can afford to use more ( i thing we can easily go up to 8GB) as they have plenty or available memory. If this could help, we can do that, but the client really really need more than 1G? They are just clients after all, so the memory in theory should be used for jobs not just for "caching". Last question about "maxFIlesToCache" you say that must be large on small cluster but small on large clusters. What do you consider 6 servers and almost 700 clients? on clienst we have: maxFilesToCache 4000 on servers we have maxFilesToCache 12288 Regards, Salvatore On 05/09/14 01:48, Sven Oehme wrote: > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > gpfsug-discuss-bounces at gpfsug.org wrote on 09/04/2014 03:43:36 AM: > > > From: Salvatore Di Nardo > > To: gpfsug main discussion list > > Date: 09/04/2014 03:44 AM > > Subject: Re: [gpfsug-discuss] gpfs performance monitoring > > Sent by: gpfsug-discuss-bounces at gpfsug.org > > > > On 04/09/14 01:50, Sven Oehme wrote: > > > Hello everybody, > > > > Hi > > > > > here i come here again, this time to ask some hint about how to > > monitor GPFS. > > > > > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is > > > that they return number based only on the request done in the > > > current host, so i have to run them on all the clients ( over 600 > > > nodes) so its quite unpractical. Instead i would like to know from > > > the servers whats going on, and i came across the vio_s statistics > > > wich are less documented and i dont know exacly what they mean. > > > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that > > > runs VIO_S. > > > > > > My problems with the output of this command: > > > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 > > > > > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second > > > timestamp: 1409763206/477366 > > > recovery group: * > > > declustered array: * > > > vdisk: * > > > client reads: 2584229 > > > client short writes: 55299693 > > > client medium writes: 190071 > > > client promoted full track writes: 465145 > > > client full track writes: 9249 > > > flushed update writes: 4187708 > > > flushed promoted full track writes: 123 > > > migrate operations: 114 > > > scrub operations: 450590 > > > log writes: 28509602 > > > > > > it sais "VIOPS per second", but they seem to me just counters as > > > every time i re-run the command, the numbers increase by a bit.. > > > Can anyone confirm if those numbers are counter or if they are > OPS/sec. > > > > the numbers are accumulative so everytime you run them they just > > show the value since start (or last reset) time. > > OK, you confirmed my toughts, thatks > > > > > > > > > On a closer eye about i dont understand what most of thosevalues > > > mean. For example, what exacly are "flushed promoted full track > write" ?? > > > I tried to find a documentation about this output , but could not > > > find any. can anyone point me a link where output of vio_s is > explained? > > > > > > Another thing i dont understand about those numbers is if they are > > > just operations, or the number of blocks that was read/write/etc . > > > > its just operations and if i would explain what the numbers mean i > > might confuse you even more because this is not what you are really > > looking for. > > what you are looking for is what the client io's look like on the > > Server side, while the VIO layer is the Server side to the disks, so > > one lever lower than what you are looking for from what i could read > > out of the description above. > > No.. what I'm looking its exactly how the disks are busy to keep the > > requests. Obviously i'm not looking just that, but I feel the needs > > to monitor also those things. Ill explain you why. > > > > It happens when our storage is quite busy ( 180Gb/s of read/write ) > > that the FS start to be slowin normal cd or ls requests. This might > > be normal, but in those situation i want to know where the > > bottleneck is. Is the server CPU? Memory? Network? Spindles? knowing > > where the bottlenek is might help me to understand if we can tweak > > the system a bit more. > > if cd or ls is very slow in GPFS in the majority of the cases it has > nothing to do with NSD Server bottlenecks, only indirect. > the main reason ls is slow in the field is you have some very powerful > nodes that all do buffered writes into the same directory into 1 or > multiple files while you do the ls on a different node. what happens > now is that the ls you did run most likely is a alias for ls -l or > something even more complex with color display, etc, but the point is > it most likely returns file size. GPFS doesn't lie about the filesize, > we only return accurate stat informations and while this is arguable, > its a fact today. > so what happens is that the stat on each file triggers a token revoke > on the node that currently writing to the file you do stat on, lets > say it has 1 gb of dirty data in its memory for this file (as its > writes data buffered) this 1 GB of data now gets written to the NSD > server, the client updates the inode info and returns the correct size. > lets say you have very fast network and you have a fast storage device > like GSS (which i see you have) it will be able to do this in a few > 100 ms, but the problem is this now happens serialized for each single > file in this directory that people write into as for each we need to > get the exact stat info to satisfy your ls -l request. > this is what takes so long, not the fact that the storage device might > be slow or to much metadata activity is going on , this is token , > means network traffic and obviously latency dependent. > > the best way to see this is to look at waiters on the client where you > run the ls and see what they are waiting for. > > there are various ways to tune this to get better 'felt' ls responses > but its not completely going away > if all you try to with ls is if there is a file in the directory run > unalias ls and check if ls after that runs fast as it shouldn't do the > -l under the cover anymore. > > > > > If its the CPU on the servers then there is no much to do beside > > replacing or add more servers.If its not the CPU, maybe more memory > > would help? Maybe its just the network that filled up? so i can add > > more links > > > > Or if we reached the point there the bottleneck its the spindles, > > then there is no much point o look somethere else, we just reached > > the hardware limit.. > > > > Sometimes, it also happens that there is very low IO (10Gb/s ), > > almost no cpu usage on the servers but huge slownes ( ls can take 10 > > seconds). Why that happens? There is not much data ops , but we > > think there is a huge ammount of metadata ops. So what i want to > > know is if the metadata vdisks are busy or not. If this is our > > problem, could some SSD disks dedicated to metadata help? > > the answer if ssd's would help or not are hard to say without knowing > the root case and as i tried to explain above the most likely case is > token revoke, not disk i/o. obviously as more busy your disks are as > longer the token revoke will take. > > > > > > > In particular im, a bit puzzled with the design of our GSS storage. > > Each recovery groups have 3 declustered arrays, and each declustered > > aray have 1 data and 1 metadata vdisk, but in the end both metadata > > and data vdisks use the same spindles. The problem that, its that I > > dont understand if we have a metadata bottleneck there. Maybe some > > SSD disks in a dedicated declustered array would perform much > > better, but this is just theory. I really would like to be able to > > monitor IO activities on the metadata vdisks. > > the short answer is we WANT the metadata disks to be with the data > disks on the same spindles. compared to other storage systems, GSS is > capable to handle different raid codes for different virtual disks on > the same physical disks, this way we create raid1'ish 'LUNS' for > metadata and raid6'is 'LUNS' for data so the small i/o penalty for a > metadata is very small compared to a read/modify/write on the data disks. > > > > > > > > > > > > so the Layer you care about is the NSD Server layer, which sits on > > top of the VIO layer (which is essentially the SW RAID Layer in GNR) > > > > > I'm asking that because if they are just ops, i don't know how much > > > they could be usefull. For example one write operation could eman > > > write 1 block or write a file of 100GB. If those are oprations, > > > there is a way to have the oupunt in bytes or blocks? > > > > there are multiple ways to get infos on the NSD layer, one would be > > to use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats > > counts again. > > > > Counters its not a problem. I can collect them and create some > > graphs in a monitoring tool. I will check that. > > if you (let) upgrade your system to GSS 2.0 you get a graphical > monitoring as part of it. if you want i can send you some direct email > outside the group with additional informations on that. > > > > > the alternative option is to use mmdiag --iohist. this shows you a > > history of the last X numbers of io operations on either the client > > or the server side like on a client : > > > > # mmdiag --iohist > > > > === mmdiag: iohist === > > > > I/O history: > > > > I/O start time RW Buf type disk:sectorNum nSec time ms > > qTime ms RpcTimes ms Type Device/NSD ID NSD server > > --------------- -- ----------- ----------------- ----- ------- > > -------- ----------------- ---- ------------------ --------------- > > 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 > > 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:22.182723 R inode 1:1071252480 8 6.970 > > 0.000 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 > > 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:53.668262 R inode 2:1081373696 8 14.117 > > 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 > > 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:53.692019 R inode 2:1064356608 8 14.899 > > 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:53.707100 R inode 2:1077830152 8 16.499 > > 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 > > 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:53.728082 R inode 2:1081918976 8 7.760 > > 0.000 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:57.877416 R metadata 2:678978560 16 13.343 > > 0.000 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 > > 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:57.906556 R inode 2:1083476520 8 11.723 > > 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 > > 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:57.926592 R inode 1:1076503480 8 8.087 > > 0.000 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 > > 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:57.941441 R inode 2:1069885984 8 11.686 > > 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:57.953294 R inode 2:1083476936 8 8.951 > > 0.000 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:57.965475 R inode 1:1076503504 8 0.477 > > 0.000 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:57.965755 R inode 2:1083476488 8 0.410 > > 0.000 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:57.965787 R inode 2:1083476512 8 0.439 > > 0.000 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 > > > > you basically see if its a inode , data block , what size it has (in > > sectors) , which nsd server you did send this request to, etc. > > > > on the Server side you see the type , which physical disk it goes to > > and also what size of disk i/o it causes like : > > > > 14:26:50.129995 R inode 12:3211886376 64 14.261 > > 0.000 0.000 0.000 pd sdis > > 14:26:50.137102 R inode 19:3003969520 64 9.004 > > 0.000 0.000 0.000 pd sdad > > 14:26:50.136116 R inode 55:3591710992 64 11.057 > > 0.000 0.000 0.000 pd sdoh > > 14:26:50.141510 R inode 21:3066810504 64 5.909 > > 0.000 0.000 0.000 pd sdaf > > 14:26:50.130529 R inode 89:2962370072 64 17.437 > > 0.000 0.000 0.000 pd sddi > > 14:26:50.131063 R inode 78:1889457000 64 17.062 > > 0.000 0.000 0.000 pd sdsj > > 14:26:50.143403 R inode 36:3323035688 64 4.807 > > 0.000 0.000 0.000 pd sdmw > > 14:26:50.131044 R inode 37:2513579736 128 17.181 > > 0.000 0.000 0.000 pd sddv > > 14:26:50.138181 R inode 72:3868810400 64 10.951 > > 0.000 0.000 0.000 pd sdbz > > 14:26:50.138188 R inode 131:2443484784 128 11.792 > > 0.000 0.000 0.000 pd sdug > > 14:26:50.138003 R inode 102:3696843872 64 11.994 > > 0.000 0.000 0.000 pd sdgp > > 14:26:50.137099 R inode 145:3370922504 64 13.225 > > 0.000 0.000 0.000 pd sdmi > > 14:26:50.141576 R inode 62:2668579904 64 9.313 > > 0.000 0.000 0.000 pd sdou > > 14:26:50.134689 R inode 159:2786164648 64 16.577 > > 0.000 0.000 0.000 pd sdpq > > 14:26:50.145034 R inode 34:2097217320 64 7.409 > > 0.000 0.000 0.000 pd sdmt > > 14:26:50.138140 R inode 139:2831038792 64 14.898 > > 0.000 0.000 0.000 pd sdlw > > 14:26:50.130954 R inode 164:282120312 64 22.274 > > 0.000 0.000 0.000 pd sdzd > > 14:26:50.137038 R inode 41:3421909608 64 16.314 > > 0.000 0.000 0.000 pd sdef > > 14:26:50.137606 R inode 104:1870962416 64 16.644 > > 0.000 0.000 0.000 pd sdgx > > 14:26:50.141306 R inode 65:2276184264 64 16.593 > > 0.000 0.000 0.000 pd sdrk > > > > > > > mmdiag --iohist its another think i looked at it, but i could not > > find good explanation for all the "buf type" ( third column ) > > > allocSeg > > data > > iallocSeg > > indBlock > > inode > > LLIndBlock > > logData > > logDesc > > logWrap > > metadata > > vdiskAULog > > vdiskBuf > > vdiskFWLog > > vdiskMDLog > > vdiskMeta > > vdiskRGDesc > > If i want to monifor metadata operation whan should i look at? just > > inodes =inodes , *alloc* = file or data allocation blocks , *ind* = > indirect blocks (for very large files) and metadata , everyhing else > is data or internal i/o's > > > the metadata flag or also inode? this command takes also long to > > run, especially if i run it a second time it hangs for a lot before > > to rerun again, so i'm not sure that run it every 30secs or minute > > its viable, but i will look also into that. THere is any > > documentation that descibes clearly the whole output? what i found > > its quite generic and don't go into details... > > the reason it takes so long is because it collects 10's of thousands > of i/os in a table and to not slow down the system when we dump the > data we copy it to a separate buffer so we don't need locks :-) > you can adjust the number of entries you want to collect by adjusting > the ioHistorySize config parameter > > > > > > > > Last but not least.. and this is what i really would like to > > > accomplish, i would to be able to monitor the latency of metadata > > operations. > > > > you can't do this on the server side as you don't know how much time > > you spend on the client , network or anything between the app and > > the physical disk, so you can only reliably look at this from the > > client, the iohist output only shows you the Server disk i/o > > processing time, but that can be a fraction of the overall time (in > > other cases this obviously can also be the dominant part depending > > on your workload). > > > > the easiest way on the client is to run > > > > mmfsadm vfsstats enable > > from now on vfs stats are collected until you restart GPFS. > > > > then run : > > > > vfs statistics currently enabled > > started at: Fri Aug 29 13:15:05.380 2014 > > duration: 448446.970 sec > > > > name calls time per call total time > > -------------------- -------- -------------- -------------- > > statfs 9 0.000002 0.000021 > > startIO 246191176 0.005853 1441049.976740 > > > > to dump what ever you collected so far on this node. > > > > > We already do that, but as I said, I want to check specifically how > > gss servers are keeping the requests to identify or exlude server > > side bottlenecks. > > > > > > Thanks for your help, you gave me definitely few things where to > look at. > > > > Salvatore > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at gpfsug.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From chekh at stanford.edu Fri Sep 5 22:17:47 2014 From: chekh at stanford.edu (Alex Chekholko) Date: Fri, 05 Sep 2014 14:17:47 -0700 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <540996E5.5000502@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <54084258.90508@ebi.ac.uk> <540996E5.5000502@ebi.ac.uk> Message-ID: <540A287B.1050202@stanford.edu> On 9/5/14, 3:56 AM, Salvatore Di Nardo wrote: > Little clarification: > Our ls its plain ls, there is no alias. ... > Last question about "maxFIlesToCache" you say that must be large on > small cluster but small on large clusters. What do you consider 6 > servers and almost 700 clients? > > on clienst we have: > maxFilesToCache 4000 > > on servers we have > maxFilesToCache 12288 > > One thing to do is to try your 'ls', see it is slow, then immediately run it again. If it is fast the second and consecutive times, it's because now the stat info is coming out of local cache. e.g. /usr/bin/time ls /path/to/some/dir && /usr/bin/time ls /path/to/some/dir The second time is likely to be almost immediate. So long as your local cache is big enough. I see on one of our older clusters we have: tokenMemLimit 2G maxFilesToCache 40000 maxStatCache 80000 You can also interrogate the local cache to see how full it is. Of course, if many nodes are writing to same dirs, then the cache will need to be invalidated often which causes some overhead. Big local cache is good if clients are usually working in different directories. Regards, -- chekh at stanford.edu From oehmes at us.ibm.com Sat Sep 6 01:12:42 2014 From: oehmes at us.ibm.com (Sven Oehme) Date: Fri, 5 Sep 2014 17:12:42 -0700 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <540996E5.5000502@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <54084258.90508@ebi.ac.uk> <540996E5.5000502@ebi.ac.uk> Message-ID: on your GSS nodes you have tuning files we suggest customers to use for mixed workloads clients. the files in /usr/lpp/mmfs/samples/gss/ if you create a nodeclass for all your clients you can run /usr/lpp/mmfs/samples/gss/gssClientConfig.sh NODECLASS and it applies all the settings to them so they will be active on next restart of the gpfs daemon. this should be a very good starting point for your config. please try that and let me know if it doesn't. there are also several enhancements in GPFS 4.1 which reduce contention in multiple areas, which would help as well, if you have the choice to update the nodes. btw. the GSS 2.0 package will update your GSS nodes to 4.1 also Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ From: Salvatore Di Nardo To: gpfsug main discussion list Date: 09/05/2014 03:57 AM Subject: Re: [gpfsug-discuss] gpfs performance monitoring Sent by: gpfsug-discuss-bounces at gpfsug.org Little clarification: Our ls its plain ls, there is no alias. Consider that all those things are already set up properly as EBI run hi computing farms from many years, so those things are already fixed loong time ago. We have very little experience with GPFS, but good knowledge with LSF farms and own multiple NFS stotages ( several petabyte sized). about NIS, all clients run NSCD that cashes all informations to avoid such tipe of slownes, in fact then ls isslow, also ls -n is slow. Beside that, also a "cd" sometimes hangs, so it have nothing to do with getting attributes. Just to clarify a bit more. Now GSS usually seems working fine, we have users that run jobs on the farms that pushes 180Gb/s read ( reading and writing files of 100GB size). GPFS works very well there, where other systems had performance problems accessing portion of data in so huge files. Sadly, on the other hand, other users run jobs that do suge ammount of metadata operations, like toons of ls in directory with many files, or creating a silly amount of temporary files just to synchronize the jobs between the farm nodes, or just to store temporary data for few milliseconds and them immediately delete those temporary files. Imagine to create constantly thousands files just to write few bytes and they delete them after few milliseconds... When those thing happens we see 10-15Gb/sec throughput, low CPU usage on the server ( 80% iddle), but any cd, or ls or wathever takes few seconds. So my question is, if the bottleneck could be the spindles, or if the clients could be tuned a bit more? I read your PDF and all the paramenters seems already well configured except "maxFilesToCache", but I'm not sure how we should configure few of those parameters on the clients. As an example I cannot immagine a client that require 38g pagepool size. so what's the correct pagepool on a client? what about those others? maxFilestoCache maxBufferdescs worker1threads worker3threads Right now all the clients have 1 GB pagepool size. In theory, we can afford to use more ( i thing we can easily go up to 8GB) as they have plenty or available memory. If this could help, we can do that, but the client really really need more than 1G? They are just clients after all, so the memory in theory should be used for jobs not just for "caching". Last question about "maxFIlesToCache" you say that must be large on small cluster but small on large clusters. What do you consider 6 servers and almost 700 clients? on clienst we have: maxFilesToCache 4000 on servers we have maxFilesToCache 12288 Regards, Salvatore On 05/09/14 01:48, Sven Oehme wrote: ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ gpfsug-discuss-bounces at gpfsug.org wrote on 09/04/2014 03:43:36 AM: > From: Salvatore Di Nardo > To: gpfsug main discussion list > Date: 09/04/2014 03:44 AM > Subject: Re: [gpfsug-discuss] gpfs performance monitoring > Sent by: gpfsug-discuss-bounces at gpfsug.org > > On 04/09/14 01:50, Sven Oehme wrote: > > Hello everybody, > > Hi > > > here i come here again, this time to ask some hint about how to > monitor GPFS. > > > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is > > that they return number based only on the request done in the > > current host, so i have to run them on all the clients ( over 600 > > nodes) so its quite unpractical. Instead i would like to know from > > the servers whats going on, and i came across the vio_s statistics > > wich are less documented and i dont know exacly what they mean. > > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that > > runs VIO_S. > > > > My problems with the output of this command: > > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 > > > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second > > timestamp: 1409763206/477366 > > recovery group: * > > declustered array: * > > vdisk: * > > client reads: 2584229 > > client short writes: 55299693 > > client medium writes: 190071 > > client promoted full track writes: 465145 > > client full track writes: 9249 > > flushed update writes: 4187708 > > flushed promoted full track writes: 123 > > migrate operations: 114 > > scrub operations: 450590 > > log writes: 28509602 > > > > it sais "VIOPS per second", but they seem to me just counters as > > every time i re-run the command, the numbers increase by a bit.. > > Can anyone confirm if those numbers are counter or if they are OPS/sec. > > the numbers are accumulative so everytime you run them they just > show the value since start (or last reset) time. > OK, you confirmed my toughts, thatks > > > > > On a closer eye about i dont understand what most of thosevalues > > mean. For example, what exacly are "flushed promoted full track write" ?? > > I tried to find a documentation about this output , but could not > > find any. can anyone point me a link where output of vio_s is explained? > > > > Another thing i dont understand about those numbers is if they are > > just operations, or the number of blocks that was read/write/etc . > > its just operations and if i would explain what the numbers mean i > might confuse you even more because this is not what you are really > looking for. > what you are looking for is what the client io's look like on the > Server side, while the VIO layer is the Server side to the disks, so > one lever lower than what you are looking for from what i could read > out of the description above. > No.. what I'm looking its exactly how the disks are busy to keep the > requests. Obviously i'm not looking just that, but I feel the needs > to monitor also those things. Ill explain you why. > > It happens when our storage is quite busy ( 180Gb/s of read/write ) > that the FS start to be slowin normal cd or ls requests. This might > be normal, but in those situation i want to know where the > bottleneck is. Is the server CPU? Memory? Network? Spindles? knowing > where the bottlenek is might help me to understand if we can tweak > the system a bit more. if cd or ls is very slow in GPFS in the majority of the cases it has nothing to do with NSD Server bottlenecks, only indirect. the main reason ls is slow in the field is you have some very powerful nodes that all do buffered writes into the same directory into 1 or multiple files while you do the ls on a different node. what happens now is that the ls you did run most likely is a alias for ls -l or something even more complex with color display, etc, but the point is it most likely returns file size. GPFS doesn't lie about the filesize, we only return accurate stat informations and while this is arguable, its a fact today. so what happens is that the stat on each file triggers a token revoke on the node that currently writing to the file you do stat on, lets say it has 1 gb of dirty data in its memory for this file (as its writes data buffered) this 1 GB of data now gets written to the NSD server, the client updates the inode info and returns the correct size. lets say you have very fast network and you have a fast storage device like GSS (which i see you have) it will be able to do this in a few 100 ms, but the problem is this now happens serialized for each single file in this directory that people write into as for each we need to get the exact stat info to satisfy your ls -l request. this is what takes so long, not the fact that the storage device might be slow or to much metadata activity is going on , this is token , means network traffic and obviously latency dependent. the best way to see this is to look at waiters on the client where you run the ls and see what they are waiting for. there are various ways to tune this to get better 'felt' ls responses but its not completely going away if all you try to with ls is if there is a file in the directory run unalias ls and check if ls after that runs fast as it shouldn't do the -l under the cover anymore. > > If its the CPU on the servers then there is no much to do beside > replacing or add more servers.If its not the CPU, maybe more memory > would help? Maybe its just the network that filled up? so i can add > more links > > Or if we reached the point there the bottleneck its the spindles, > then there is no much point o look somethere else, we just reached > the hardware limit.. > > Sometimes, it also happens that there is very low IO (10Gb/s ), > almost no cpu usage on the servers but huge slownes ( ls can take 10 > seconds). Why that happens? There is not much data ops , but we > think there is a huge ammount of metadata ops. So what i want to > know is if the metadata vdisks are busy or not. If this is our > problem, could some SSD disks dedicated to metadata help? the answer if ssd's would help or not are hard to say without knowing the root case and as i tried to explain above the most likely case is token revoke, not disk i/o. obviously as more busy your disks are as longer the token revoke will take. > > > In particular im, a bit puzzled with the design of our GSS storage. > Each recovery groups have 3 declustered arrays, and each declustered > aray have 1 data and 1 metadata vdisk, but in the end both metadata > and data vdisks use the same spindles. The problem that, its that I > dont understand if we have a metadata bottleneck there. Maybe some > SSD disks in a dedicated declustered array would perform much > better, but this is just theory. I really would like to be able to > monitor IO activities on the metadata vdisks. the short answer is we WANT the metadata disks to be with the data disks on the same spindles. compared to other storage systems, GSS is capable to handle different raid codes for different virtual disks on the same physical disks, this way we create raid1'ish 'LUNS' for metadata and raid6'is 'LUNS' for data so the small i/o penalty for a metadata is very small compared to a read/modify/write on the data disks. > > > > > so the Layer you care about is the NSD Server layer, which sits on > top of the VIO layer (which is essentially the SW RAID Layer in GNR) > > > I'm asking that because if they are just ops, i don't know how much > > they could be usefull. For example one write operation could eman > > write 1 block or write a file of 100GB. If those are oprations, > > there is a way to have the oupunt in bytes or blocks? > > there are multiple ways to get infos on the NSD layer, one would be > to use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats > counts again. > > Counters its not a problem. I can collect them and create some > graphs in a monitoring tool. I will check that. if you (let) upgrade your system to GSS 2.0 you get a graphical monitoring as part of it. if you want i can send you some direct email outside the group with additional informations on that. > > the alternative option is to use mmdiag --iohist. this shows you a > history of the last X numbers of io operations on either the client > or the server side like on a client : > > # mmdiag --iohist > > === mmdiag: iohist === > > I/O history: > > I/O start time RW Buf type disk:sectorNum nSec time ms > qTime ms RpcTimes ms Type Device/NSD ID NSD server > --------------- -- ----------- ----------------- ----- ------- > -------- ----------------- ---- ------------------ --------------- > 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 > 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:22.182723 R inode 1:1071252480 8 6.970 > 0.000 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 > 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.668262 R inode 2:1081373696 8 14.117 > 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 > 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.692019 R inode 2:1064356608 8 14.899 > 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.707100 R inode 2:1077830152 8 16.499 > 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 > 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.728082 R inode 2:1081918976 8 7.760 > 0.000 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.877416 R metadata 2:678978560 16 13.343 > 0.000 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 > 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.906556 R inode 2:1083476520 8 11.723 > 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 > 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.926592 R inode 1:1076503480 8 8.087 > 0.000 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 > 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.941441 R inode 2:1069885984 8 11.686 > 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.953294 R inode 2:1083476936 8 8.951 > 0.000 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965475 R inode 1:1076503504 8 0.477 > 0.000 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.965755 R inode 2:1083476488 8 0.410 > 0.000 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965787 R inode 2:1083476512 8 0.439 > 0.000 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 > > you basically see if its a inode , data block , what size it has (in > sectors) , which nsd server you did send this request to, etc. > > on the Server side you see the type , which physical disk it goes to > and also what size of disk i/o it causes like : > > 14:26:50.129995 R inode 12:3211886376 64 14.261 > 0.000 0.000 0.000 pd sdis > 14:26:50.137102 R inode 19:3003969520 64 9.004 > 0.000 0.000 0.000 pd sdad > 14:26:50.136116 R inode 55:3591710992 64 11.057 > 0.000 0.000 0.000 pd sdoh > 14:26:50.141510 R inode 21:3066810504 64 5.909 > 0.000 0.000 0.000 pd sdaf > 14:26:50.130529 R inode 89:2962370072 64 17.437 > 0.000 0.000 0.000 pd sddi > 14:26:50.131063 R inode 78:1889457000 64 17.062 > 0.000 0.000 0.000 pd sdsj > 14:26:50.143403 R inode 36:3323035688 64 4.807 > 0.000 0.000 0.000 pd sdmw > 14:26:50.131044 R inode 37:2513579736 128 17.181 > 0.000 0.000 0.000 pd sddv > 14:26:50.138181 R inode 72:3868810400 64 10.951 > 0.000 0.000 0.000 pd sdbz > 14:26:50.138188 R inode 131:2443484784 128 11.792 > 0.000 0.000 0.000 pd sdug > 14:26:50.138003 R inode 102:3696843872 64 11.994 > 0.000 0.000 0.000 pd sdgp > 14:26:50.137099 R inode 145:3370922504 64 13.225 > 0.000 0.000 0.000 pd sdmi > 14:26:50.141576 R inode 62:2668579904 64 9.313 > 0.000 0.000 0.000 pd sdou > 14:26:50.134689 R inode 159:2786164648 64 16.577 > 0.000 0.000 0.000 pd sdpq > 14:26:50.145034 R inode 34:2097217320 64 7.409 > 0.000 0.000 0.000 pd sdmt > 14:26:50.138140 R inode 139:2831038792 64 14.898 > 0.000 0.000 0.000 pd sdlw > 14:26:50.130954 R inode 164:282120312 64 22.274 > 0.000 0.000 0.000 pd sdzd > 14:26:50.137038 R inode 41:3421909608 64 16.314 > 0.000 0.000 0.000 pd sdef > 14:26:50.137606 R inode 104:1870962416 64 16.644 > 0.000 0.000 0.000 pd sdgx > 14:26:50.141306 R inode 65:2276184264 64 16.593 > 0.000 0.000 0.000 pd sdrk > > > mmdiag --iohist its another think i looked at it, but i could not > find good explanation for all the "buf type" ( third column ) > allocSeg > data > iallocSeg > indBlock > inode > LLIndBlock > logData > logDesc > logWrap > metadata > vdiskAULog > vdiskBuf > vdiskFWLog > vdiskMDLog > vdiskMeta > vdiskRGDesc > If i want to monifor metadata operation whan should i look at? just inodes =inodes , *alloc* = file or data allocation blocks , *ind* = indirect blocks (for very large files) and metadata , everyhing else is data or internal i/o's > the metadata flag or also inode? this command takes also long to > run, especially if i run it a second time it hangs for a lot before > to rerun again, so i'm not sure that run it every 30secs or minute > its viable, but i will look also into that. THere is any > documentation that descibes clearly the whole output? what i found > its quite generic and don't go into details... the reason it takes so long is because it collects 10's of thousands of i/os in a table and to not slow down the system when we dump the data we copy it to a separate buffer so we don't need locks :-) you can adjust the number of entries you want to collect by adjusting the ioHistorySize config parameter > > > > Last but not least.. and this is what i really would like to > > accomplish, i would to be able to monitor the latency of metadata > operations. > > you can't do this on the server side as you don't know how much time > you spend on the client , network or anything between the app and > the physical disk, so you can only reliably look at this from the > client, the iohist output only shows you the Server disk i/o > processing time, but that can be a fraction of the overall time (in > other cases this obviously can also be the dominant part depending > on your workload). > > the easiest way on the client is to run > > mmfsadm vfsstats enable > from now on vfs stats are collected until you restart GPFS. > > then run : > > vfs statistics currently enabled > started at: Fri Aug 29 13:15:05.380 2014 > duration: 448446.970 sec > > name calls time per call total time > -------------------- -------- -------------- -------------- > statfs 9 0.000002 0.000021 > startIO 246191176 0.005853 1441049.976740 > > to dump what ever you collected so far on this node. > > We already do that, but as I said, I want to check specifically how > gss servers are keeping the requests to identify or exlude server > side bottlenecks. > > > Thanks for your help, you gave me definitely few things where to look at. > > Salvatore > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From luke.raimbach at oerc.ox.ac.uk Tue Sep 9 11:23:47 2014 From: luke.raimbach at oerc.ox.ac.uk (Luke Raimbach) Date: Tue, 9 Sep 2014 10:23:47 +0000 Subject: [gpfsug-discuss] mmdiag output questions Message-ID: Hi All, When tracing a problem recently (which turned out to be a NIC failure), mmdiag proved useful in tracing broken cluster connections. I have some questions about the output of mmdiag using the --network switch: Occasionally I see nodes in the same cluster grouped, but in no readily identifiable way - for example, the following output has three headings "Device bon0:" with some nodes listed, but the nodes don't seem to share anything in common like status, err, ostype, etc. Also, is anyone able to explain what might be seen under the err column? Do these correspond to GPFS error codes as one might see in mmfs.log.latest? What is the sock column displaying - the number of open sockets or the socket state? Lastly, the sent/recvd columns seem very low. Is there a rolling time window within which these statistics are kept in some internal mmfsd buffer? Cheers. === mmdiag: network === Pending messages: (none) Inter-node communication configuration: tscTcpPort 1191 my address 10.100.10.51/22 (eth0) my addr list 10.200.21.1/16 (bond0)/cpdn.oerc.local 10.100.10.51/22 (eth0) my node number 9 TCP Connections between nodes: Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype gpfs01 10.200.1.1 connected 0 32 110 110 Linux/L gpfs02 10.200.2.1 connected 0 36 104 104 Linux/L linux 10.200.101.1 connected 0 37 0 0 Linux/L jupiter 10.200.102.1 connected 0 35 0 0 Windows/L cnfs0 10.200.10.10 connected 0 39 0 0 Linux/L cnfs1 10.200.10.11 init 0 -1 0 0 Linux/L Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype cnfs2 10.200.10.12 connected 0 33 5 5 Linux/L cnfs3 10.200.10.13 init 0 -1 0 0 Linux/L cpdn-ppc02 10.200.61.1 init 0 -1 0 0 Linux/L cpdn-ppc03 10.200.62.1 init 0 -1 0 0 Linux/L Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype cpdn-ppc01 10.200.60.1 connected 0 38 0 0 Linux/L diag verbs: VERBS RDMA class not initialized Conversely, the output of mmdiag --network on the file system manager node for the same cluster looks like this: === mmdiag: network === Pending messages: (none) Inter-node communication configuration: tscTcpPort 1191 my address 10.100.10.21/22 (eth0) my addr list 10.200.1.1/16 (bond0)/cpdn.oerc.local 10.100.10.21/22 (eth0) my node number 1 TCP Connections between nodes: Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype gpfs02 10.200.2.1 connected 0 73 219 219 Linux/L linux 10.200.101.1 connected 0 49 180 181 Linux/L jupiter 10.200.102.1 connected 0 33 3 3 Windows/L cnfs0 10.200.10.10 connected 0 61 3 3 Linux/L cnfs1 10.200.10.11 connected 0 81 0 0 Linux/L cnfs2 10.200.10.12 connected 0 64 23 23 Linux/L cnfs3 10.200.10.13 connected 0 60 2 2 Linux/L tsm01 10.200.21.1 connected 0 50 110 110 Linux/L cpdn-ppc02 10.200.61.1 connected 0 63 0 0 Linux/L cpdn-ppc03 10.200.62.1 connected 0 65 0 0 Linux/L cpdn-ppc01 10.200.60.1 connected 0 62 94 94 Linux/L diag verbs: VERBS RDMA class not initialized All neatly connected! -- Luke Raimbach IT Manager Oxford e-Research Centre 7 Keble Road, Oxford, OX1 3QG +44(0)1865 610639 From chair at gpfsug.org Wed Sep 10 15:33:24 2014 From: chair at gpfsug.org (Jez Tucker (Chair)) Date: Wed, 10 Sep 2014 15:33:24 +0100 Subject: [gpfsug-discuss] GPFS Request for Enhancements Message-ID: <54106134.7010902@gpfsug.org> Hi all Just a quick reminder that the RFEs that you all gave feedback at the last UG on are live on IBM's RFE site: goo.gl/1K6LBa Please take the time to have a look and add your votes to the GPFS RFEs. Jez -------------- next part -------------- An HTML attachment was scrubbed... URL: From dmetcalfe at ocf.co.uk Thu Sep 11 21:18:58 2014 From: dmetcalfe at ocf.co.uk (Daniel Metcalfe) Date: Thu, 11 Sep 2014 21:18:58 +0100 Subject: [gpfsug-discuss] mmdiag output questions In-Reply-To: References: Message-ID: Hi Luke, I've seen the same apparent grouping of nodes, I don't believe the nodes are actually being grouped but instead the "Device Bond0:" and column headers are being re-printed to screen whenever there is a node that has the "init" status followed by a node that is "connected". It is something I've noticed on many different versions of GPFS so I imagine it's a "feature". I've not noticed anything but '0' in the err column so I'm not sure if these correspond to error codes in the GPFS logs. If you run the command "mmfsadm dump tscomm", you'll see a bit more detail than the mmdiag -network shows. This suggests the sock column is number of sockets. I've seen the low numbers to for sent / recv using mmdiag --network, again the mmfsadm command above gives a better representation I've found. All that being said, if you want to get in touch with us then we'll happily open a PMR for you and find out the answer to any of your questions. Kind regards, Danny Metcalfe Systems Engineer OCF plc Tel: 0114 257 2200 [cid:image001.jpg at 01CFCE04.575B8380] Twitter Fax: 0114 257 0022 [cid:image002.jpg at 01CFCE04.575B8380] Blog Mob: 07960 503404 [cid:image003.jpg at 01CFCE04.575B8380] Web Please note, any emails relating to an OCF Support request must always be sent to support at ocf.co.uk for a ticket number to be generated or existing support ticket to be updated. Should this not be done then OCF cannot be held responsible for requests not dealt with in a timely manner. OCF plc is a company registered in England and Wales. Registered number 4132533. Registered office address: OCF plc, 5 Rotunda Business Centre, Thorncliffe Park, Chapeltown, Sheffield, S35 2PG This message is private and confidential. If you have received this message in error, please notify us immediately and remove it from your system. -----Original Message----- From: gpfsug-discuss-bounces at gpfsug.org [mailto:gpfsug-discuss-bounces at gpfsug.org] On Behalf Of Luke Raimbach Sent: 09 September 2014 11:24 To: gpfsug-discuss at gpfsug.org Subject: [gpfsug-discuss] mmdiag output questions Hi All, When tracing a problem recently (which turned out to be a NIC failure), mmdiag proved useful in tracing broken cluster connections. I have some questions about the output of mmdiag using the --network switch: Occasionally I see nodes in the same cluster grouped, but in no readily identifiable way - for example, the following output has three headings "Device bon0:" with some nodes listed, but the nodes don't seem to share anything in common like status, err, ostype, etc. Also, is anyone able to explain what might be seen under the err column? Do these correspond to GPFS error codes as one might see in mmfs.log.latest? What is the sock column displaying - the number of open sockets or the socket state? Lastly, the sent/recvd columns seem very low. Is there a rolling time window within which these statistics are kept in some internal mmfsd buffer? Cheers. === mmdiag: network === Pending messages: (none) Inter-node communication configuration: tscTcpPort 1191 my address 10.100.10.51/22 (eth0) my addr list 10.200.21.1/16 (bond0)/cpdn.oerc.local 10.100.10.51/22 (eth0) my node number 9 TCP Connections between nodes: Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype gpfs01 10.200.1.1 connected 0 32 110 110 Linux/L gpfs02 10.200.2.1 connected 0 36 104 104 Linux/L linux 10.200.101.1 connected 0 37 0 0 Linux/L jupiter 10.200.102.1 connected 0 35 0 0 Windows/L cnfs0 10.200.10.10 connected 0 39 0 0 Linux/L cnfs1 10.200.10.11 init 0 -1 0 0 Linux/L Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype cnfs2 10.200.10.12 connected 0 33 5 5 Linux/L cnfs3 10.200.10.13 init 0 -1 0 0 Linux/L cpdn-ppc02 10.200.61.1 init 0 -1 0 0 Linux/L cpdn-ppc03 10.200.62.1 init 0 -1 0 0 Linux/L Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype cpdn-ppc01 10.200.60.1 connected 0 38 0 0 Linux/L diag verbs: VERBS RDMA class not initialized Conversely, the output of mmdiag --network on the file system manager node for the same cluster looks like this: === mmdiag: network === Pending messages: (none) Inter-node communication configuration: tscTcpPort 1191 my address 10.100.10.21/22 (eth0) my addr list 10.200.1.1/16 (bond0)/cpdn.oerc.local 10.100.10.21/22 (eth0) my node number 1 TCP Connections between nodes: Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype gpfs02 10.200.2.1 connected 0 73 219 219 Linux/L linux 10.200.101.1 connected 0 49 180 181 Linux/L jupiter 10.200.102.1 connected 0 33 3 3 Windows/L cnfs0 10.200.10.10 connected 0 61 3 3 Linux/L cnfs1 10.200.10.11 connected 0 81 0 0 Linux/L cnfs2 10.200.10.12 connected 0 64 23 23 Linux/L cnfs3 10.200.10.13 connected 0 60 2 2 Linux/L tsm01 10.200.21.1 connected 0 50 110 110 Linux/L cpdn-ppc02 10.200.61.1 connected 0 63 0 0 Linux/L cpdn-ppc03 10.200.62.1 connected 0 65 0 0 Linux/L cpdn-ppc01 10.200.60.1 connected 0 62 94 94 Linux/L diag verbs: VERBS RDMA class not initialized All neatly connected! -- Luke Raimbach IT Manager Oxford e-Research Centre 7 Keble Road, Oxford, OX1 3QG +44(0)1865 610639 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ----- No virus found in this message. Checked by AVG - www.avg.com Version: 2014.0.4765 / Virus Database: 4015/8158 - Release Date: 09/05/14 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 4696 bytes Desc: image001.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.jpg Type: image/jpeg Size: 4725 bytes Desc: image002.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.jpg Type: image/jpeg Size: 4820 bytes Desc: image003.jpg URL: From stuartb at 4gh.net Tue Sep 23 16:47:09 2014 From: stuartb at 4gh.net (Stuart Barkley) Date: Tue, 23 Sep 2014 11:47:09 -0400 (EDT) Subject: [gpfsug-discuss] filesets and mountpoint naming Message-ID: When we first started using GPFS we created several filesystems and just directly mounted them where seemed appropriate. We have something like: /home /scratch /projects /reference /applications We are finding the overhead of separate filesystems to be troublesome and are looking at using filesets inside fewer filesystems to accomplish our goals (we will probably keep /home separate for now). We can put symbolic links in place to provide the same user experience, but I'm looking for suggestions as to where to mount the actual gpfs filesystems. We have multiple compute clusters with multiple gpfs systems, one cluster has a traditional gpfs system and a separate gss system which will obviously need multiple mount points. We also want to consider possible future cross cluster mounts. Some thoughts are to just do filesystems as: /gpfs01, /gpfs02, etc. /mnt/gpfs01, etc /mnt/clustera/gpfs01, etc. What have other people done? Are you happy with it? What would you do differently? Thanks, Stuart -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone From sabujp at gmail.com Thu Sep 25 13:39:14 2014 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Thu, 25 Sep 2014 07:39:14 -0500 Subject: [gpfsug-discuss] really slow cnfs vip failover due to exportfs -u in bin/mmnfsfuncs unexportAll and unexportFS Message-ID: Hi all, We're running 3.5.0.19 with CNFS and noticed really slow CNFS vip failover times > 4.5mins . It looks like it's being caused by all the exportfs -u calls being made in the unexportAll and the unexportFS function in bin/mmnfsfuncs . What's the purpose of running exportfs -u on all the exported directories? We're running only NFSv3 and have lots of exports and for security reasons can't have one giant NFS export. That may be a possibility with GPFS4.1 and NFSv4 but we won't be migrating to that anytime soon. Assume the network went down for the cnfs server or the system panicked/crashed, what would be the purpose of exportfs -u be in that case, so what's the purpose at all? Thanks, Sabuj -------------- next part -------------- An HTML attachment was scrubbed... URL: From sabujp at gmail.com Thu Sep 25 14:11:18 2014 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Thu, 25 Sep 2014 08:11:18 -0500 Subject: [gpfsug-discuss] really slow cnfs vip failover due to exportfs -u in bin/mmnfsfuncs unexportAll and unexportFS In-Reply-To: References: Message-ID: our support engineer suggests adding & to the end of the exportfs -u lines in the mmnfsfunc script, which is a good workaround, can this be added to future gpfs 3.5 and 4.1 rels (haven't even looked at 4.1 yet). I was looking at the unexportfs all in nfs-utils/exportfs.c and it looks like the limiting factor there would be all the hostname lookups? I don't see what exportfs -u is doing other than doing slow reverse lookups and removing the export from the nfs stack. On Thu, Sep 25, 2014 at 7:39 AM, Sabuj Pattanayek wrote: > Hi all, > > We're running 3.5.0.19 with CNFS and noticed really slow CNFS vip failover > times > 4.5mins . It looks like it's being caused by all the exportfs -u > calls being made in the unexportAll and the unexportFS function in > bin/mmnfsfuncs . What's the purpose of running exportfs -u on all the > exported directories? We're running only NFSv3 and have lots of exports and > for security reasons can't have one giant NFS export. That may be a > possibility with GPFS4.1 and NFSv4 but we won't be migrating to that > anytime soon. > > Assume the network went down for the cnfs server or the system > panicked/crashed, what would be the purpose of exportfs -u be in that case, > so what's the purpose at all? > > Thanks, > Sabuj > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sabujp at gmail.com Thu Sep 25 14:15:19 2014 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Thu, 25 Sep 2014 08:15:19 -0500 Subject: [gpfsug-discuss] really slow cnfs vip failover due to exportfs -u in bin/mmnfsfuncs unexportAll and unexportFS In-Reply-To: References: Message-ID: yes, it's doing a getaddrinfo() call for every hostname that's a fqdn and not an ip addr, which we have lots of in our export entries since sometimes clients update their dns (ip's). On Thu, Sep 25, 2014 at 8:11 AM, Sabuj Pattanayek wrote: > our support engineer suggests adding & to the end of the exportfs -u lines > in the mmnfsfunc script, which is a good workaround, can this be added to > future gpfs 3.5 and 4.1 rels (haven't even looked at 4.1 yet). I was > looking at the unexportfs all in nfs-utils/exportfs.c and it looks like the > limiting factor there would be all the hostname lookups? I don't see what > exportfs -u is doing other than doing slow reverse lookups and removing the > export from the nfs stack. > > On Thu, Sep 25, 2014 at 7:39 AM, Sabuj Pattanayek > wrote: > >> Hi all, >> >> We're running 3.5.0.19 with CNFS and noticed really slow CNFS vip >> failover times > 4.5mins . It looks like it's being caused by all the >> exportfs -u calls being made in the unexportAll and the unexportFS function >> in bin/mmnfsfuncs . What's the purpose of running exportfs -u on all the >> exported directories? We're running only NFSv3 and have lots of exports and >> for security reasons can't have one giant NFS export. That may be a >> possibility with GPFS4.1 and NFSv4 but we won't be migrating to that >> anytime soon. >> >> Assume the network went down for the cnfs server or the system >> panicked/crashed, what would be the purpose of exportfs -u be in that case, >> so what's the purpose at all? >> >> Thanks, >> Sabuj >> >> >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Mon Sep 1 20:44:45 2014 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Mon, 1 Sep 2014 19:44:45 +0000 Subject: [gpfsug-discuss] GPFS admin host name vs subnets Message-ID: I was just reading through the docs at: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/GPFS+Network+Communication+Overview And was wondering about using admin host name bs using subnets. My reading of the page is that if say I have a 1GbE network and a 40GbE network, I could have an admin host name on the 1GbE network. But equally from the docs, it looks like I could also use subnets to achieve the same whilst allowing the admin network to be a fall back for data if necessary. For example, create the cluster using the primary name on the 1GbE network, then use the subnets property to use set the network on the 40GbE network as the first and the network on the 1GbE network as the second in the list, thus GPFS data will pass over the 40GbE network in preference and the 1GbE network will, by default only be used for admin traffic as the admin host name will just be the name of the host on the 1GbE network. Is my reading of the docs correct? Or do I really want to be creating the cluster using the 40GbE network hostnames and set the admin node name to the name of the 1GbE network interface? (there's actually also an FDR switch in there somewhere for verbs as well) Thanks Simon From ewahl at osc.edu Tue Sep 2 14:44:29 2014 From: ewahl at osc.edu (Ed Wahl) Date: Tue, 2 Sep 2014 13:44:29 +0000 Subject: [gpfsug-discuss] GPFS admin host name vs subnets In-Reply-To: References: Message-ID: Seems like you are on the correct track. This is similar to my setup. subnett'ed daemon 10GbE, 1GbE with main being QDR RDMA, admin 1GbE. To my mind the most important part is Setting "privateSubnetOverride" to 1. This allows both your 1GbE and your 40GbE to be on a private subnet. Serving block over public IPs just seems wrong on SO many levels. Whether truly private/internal or not. And how many people use public IPs internally? Wait, maybe I don't want to know... Using 'verbsRdma enable' for your FDR seems to override Daemon node name for block, at least in my experience. I love the fallback to 10GbE and then 1GbE in case of disaster when using IB. Lately we seem to be generating bugs in OpenSM at a frightening rate so that has been _extremely_ helpful. Now if we could just monitor when it happens more easily than running mmfsadm test verbs conn, say by logging a failure of RDMA? Ed OSC ________________________________________ From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of Simon Thompson (Research Computing - IT Services) [S.J.Thompson at bham.ac.uk] Sent: Monday, September 01, 2014 3:44 PM To: gpfsug main discussion list Subject: [gpfsug-discuss] GPFS admin host name vs subnets I was just reading through the docs at: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/GPFS+Network+Communication+Overview And was wondering about using admin host name bs using subnets. My reading of the page is that if say I have a 1GbE network and a 40GbE network, I could have an admin host name on the 1GbE network. But equally from the docs, it looks like I could also use subnets to achieve the same whilst allowing the admin network to be a fall back for data if necessary. For example, create the cluster using the primary name on the 1GbE network, then use the subnets property to use set the network on the 40GbE network as the first and the network on the 1GbE network as the second in the list, thus GPFS data will pass over the 40GbE network in preference and the 1GbE network will, by default only be used for admin traffic as the admin host name will just be the name of the host on the 1GbE network. Is my reading of the docs correct? Or do I really want to be creating the cluster using the 40GbE network hostnames and set the admin node name to the name of the 1GbE network interface? (there's actually also an FDR switch in there somewhere for verbs as well) Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From oehmes at gmail.com Tue Sep 2 15:11:03 2014 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 2 Sep 2014 07:11:03 -0700 Subject: [gpfsug-discuss] GPFS admin host name vs subnets In-Reply-To: References: Message-ID: Ed, if you enable RDMA, GPFS will always use this as preferred data transfer. if you have subnets configured, GPFS will prefer this for communication with higher priority as the default interface. so the order is RDMA , subnets, default. if RDMA will fail for whatever reason we will use the subnets defined interface and if that fails as well we will use the default interface. the easiest way to see what is used is to run mmdiag --network (only avail on more recent versions of GPFS) it will tell you if RDMA is enabled between individual nodes as well as if a subnet connection is used or not : [root at client05 ~]# mmdiag --network === mmdiag: network === Pending messages: (none) Inter-node communication configuration: tscTcpPort 1191 my address 192.167.13.5/16 (eth0) my addr list 192.1.13.5/16 (ib1) 192.0.13.5/16 (ib0)/ client04.clientad.almaden.ibm.com 192.167.13.5/16 (eth0) my node number 17 TCP Connections between nodes: Device ib0: hostname node destination status err sock sent(MB) recvd(MB) ostype client04n1 192.0.4.1 connected 0 69 0 37 Linux/L client04n2 192.0.4.2 connected 0 70 0 37 Linux/L client04n3 192.0.4.3 connected 0 68 0 0 Linux/L Device ib1: hostname node destination status err sock sent(MB) recvd(MB) ostype clientcl21 192.1.201.21 connected 0 65 0 0 Linux/L clientcl25 192.1.201.25 connected 0 66 0 0 Linux/L clientcl26 192.1.201.26 connected 0 67 0 0 Linux/L clientcl21 192.1.201.21 connected 0 71 0 0 Linux/L clientcl22 192.1.201.22 connected 0 63 0 0 Linux/L client10 192.1.13.10 connected 0 73 0 0 Linux/L client08 192.1.13.8 connected 0 72 0 0 Linux/L RDMA Connections between nodes: Fabric 1 - Device mlx4_0 Port 1 Width 4x Speed FDR lid 13 hostname idx CM state VS buff RDMA_CT(ERR) RDMA_RCV_MB RDMA_SND_MB VS_CT(ERR) VS_SND_MB VS_RCV_MB WAIT_CON_SLOT WAIT_NODE_SLOT clientcl21 0 N RTS (Y)903 0 (0 ) 0 0 192 (0 ) 0 0 0 0 client04n1 0 N RTS (Y)477 0 (0 ) 0 0 12367404(0 ) 107905 594 0 0 client04n1 1 N RTS (Y)477 0 (0 ) 0 0 12367404(0 ) 107901 593 0 0 client04n2 0 N RTS (Y)477 0 (0 ) 0 0 12371352(0 ) 107911 594 0 0 client04n2 2 N RTS (Y)477 0 (0 ) 0 0 12371352(0 ) 107902 594 0 0 clientcl21 0 N RTS (Y)880 0 (0 ) 0 0 11 (0 ) 0 0 0 0 client04n3 0 N RTS (Y)969 0 (0 ) 0 0 5 (0 ) 0 0 0 0 clientcl26 0 N RTS (Y)702 0 (0 ) 0 0 35 (0 ) 0 0 0 0 client08 0 N RTS (Y)637 0 (0 ) 0 0 16 (0 ) 0 0 0 0 clientcl25 0 N RTS (Y)574 0 (0 ) 0 0 14 (0 ) 0 0 0 0 clientcl22 0 N RTS (Y)507 0 (0 ) 0 0 2 (0 ) 0 0 0 0 client10 0 N RTS (Y)568 0 (0 ) 0 0 121 (0 ) 0 0 0 0 Fabric 2 - Device mlx4_0 Port 2 Width 4x Speed FDR lid 65 hostname idx CM state VS buff RDMA_CT(ERR) RDMA_RCV_MB RDMA_SND_MB VS_CT(ERR) VS_SND_MB VS_RCV_MB WAIT_CON_SLOT WAIT_NODE_SLOT clientcl21 1 N RTS (Y)904 0 (0 ) 0 0 192 (0 ) 0 0 0 0 client04n1 2 N RTS (Y)477 0 (0 ) 0 0 12367404(0 ) 107897 593 0 0 client04n2 1 N RTS (Y)477 0 (0 ) 0 0 12371352(0 ) 107903 594 0 0 clientcl21 1 N RTS (Y)881 0 (0 ) 0 0 10 (0 ) 0 0 0 0 clientcl26 1 N RTS (Y)701 0 (0 ) 0 0 35 (0 ) 0 0 0 0 client08 1 N RTS (Y)637 0 (0 ) 0 0 16 (0 ) 0 0 0 0 clientcl25 1 N RTS (Y)574 0 (0 ) 0 0 14 (0 ) 0 0 0 0 clientcl22 1 N RTS (Y)507 0 (0 ) 0 0 2 (0 ) 0 0 0 0 client10 1 N RTS (Y)568 0 (0 ) 0 0 121 (0 ) 0 0 0 0 in this example you can see thet my client (client05) has multiple subnets configured as well as RDMA. so to connected to the various TCP devices (ib0 and ib1) to different cluster nodes and also has a RDMA connection to a different set of nodes. as you can see there is basically no traffic on the TCP devices, as all the traffic uses the 2 defined RDMA fabrics. there is not a single connection using the daemon interface (eth0) as all nodes are either connected via subnets or via RDMA. hope this helps. Sven On Tue, Sep 2, 2014 at 6:44 AM, Ed Wahl wrote: > Seems like you are on the correct track. This is similar to my setup. > subnett'ed daemon 10GbE, 1GbE with main being QDR RDMA, admin 1GbE. To > my mind the most important part is Setting "privateSubnetOverride" to 1. > This allows both your 1GbE and your 40GbE to be on a private subnet. > Serving block over public IPs just seems wrong on SO many levels. Whether > truly private/internal or not. And how many people use public IPs > internally? Wait, maybe I don't want to know... > > Using 'verbsRdma enable' for your FDR seems to override Daemon node > name for block, at least in my experience. I love the fallback to 10GbE > and then 1GbE in case of disaster when using IB. Lately we seem to be > generating bugs in OpenSM at a frightening rate so that has been > _extremely_ helpful. Now if we could just monitor when it happens more > easily than running mmfsadm test verbs conn, say by logging a failure of > RDMA? > > > Ed > OSC > > ________________________________________ > From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] > on behalf of Simon Thompson (Research Computing - IT Services) [ > S.J.Thompson at bham.ac.uk] > Sent: Monday, September 01, 2014 3:44 PM > To: gpfsug main discussion list > Subject: [gpfsug-discuss] GPFS admin host name vs subnets > > I was just reading through the docs at: > > > https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/GPFS+Network+Communication+Overview > > And was wondering about using admin host name bs using subnets. My reading > of the page is that if say I have a 1GbE network and a 40GbE network, I > could have an admin host name on the 1GbE network. But equally from the > docs, it looks like I could also use subnets to achieve the same whilst > allowing the admin network to be a fall back for data if necessary. > > For example, create the cluster using the primary name on the 1GbE > network, then use the subnets property to use set the network on the 40GbE > network as the first and the network on the 1GbE network as the second in > the list, thus GPFS data will pass over the 40GbE network in preference and > the 1GbE network will, by default only be used for admin traffic as the > admin host name will just be the name of the host on the 1GbE network. > > Is my reading of the docs correct? Or do I really want to be creating the > cluster using the 40GbE network hostnames and set the admin node name to > the name of the 1GbE network interface? > > (there's actually also an FDR switch in there somewhere for verbs as well) > > Thanks > > Simon > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Wed Sep 3 18:27:44 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Wed, 03 Sep 2014 18:27:44 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring Message-ID: <54074F90.7000303@ebi.ac.uk> Hello everybody, here i come here again, this time to ask some hint about how to monitor GPFS. I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is that they return number based only on the request done in the current host, so i have to run them on all the clients ( over 600 nodes) so its quite unpractical. Instead i would like to know from the servers whats going on, and i came across the vio_s statistics wich are less documented and i dont know exacly what they mean. There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that runs VIO_S. My problems with the output of this command: echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second timestamp: 1409763206/477366 recovery group: * declustered array: * vdisk: * client reads: 2584229 client short writes: 55299693 client medium writes: 190071 client promoted full track writes: 465145 client full track writes: 9249 flushed update writes: 4187708 flushed promoted full track writes: 123 migrate operations: 114 scrub operations: 450590 log writes: 28509602 it sais "VIOPS per second", but they seem to me just counters as every time i re-run the command, the numbers increase by a bit.. Can anyone confirm if those numbers are counter or if they are OPS/sec. On a closer eye about i dont understand what most of thosevalues mean. For example, what exacly are "flushed promoted full track write" ?? I tried to find a documentation about this output , but could not find any. can anyone point me a link where output of vio_s is explained? Another thing i dont understand about those numbers is if they are just operations, or the number of blocks that was read/write/etc . I'm asking that because if they are just ops, i don't know how much they could be usefull. For example one write operation could eman write 1 block or write a file of 100GB. If those are oprations, there is a way to have the oupunt in bytes or blocks? Last but not least.. and this is what i really would like to accomplish, i would to be able to monitor the latency of metadata operations. In my environment there are users that litterally overhelm our storages with metadata request, so even if there is no massive throughput or huge waiters, any "ls" could take ages. I would like to be able to monitor metadata behaviour. There is a way to to do that from the NSD servers? Thanks in advance for any tip/help. Regards, Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From chekh at stanford.edu Wed Sep 3 21:55:14 2014 From: chekh at stanford.edu (Alex Chekholko) Date: Wed, 03 Sep 2014 13:55:14 -0700 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <54074F90.7000303@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> Message-ID: <54078032.2050605@stanford.edu> The usual way to do that is to re-architect your filesystem so that the system pool is metadata-only, and then you can just look at the storage layer and see total metadata throughput that way. Otherwise your metadata ops are mixed in with your data ops. Of course, both NSDs and clients also have metadata caches. On 09/03/2014 10:27 AM, Salvatore Di Nardo wrote: > > Last but not least.. and this is what i really would like to accomplish, > i would to be able to monitor the latency of metadata operations. > In my environment there are users that litterally overhelm our storages > with metadata request, so even if there is no massive throughput or huge > waiters, any "ls" could take ages. I would like to be able to monitor > metadata behaviour. There is a way to to do that from the NSD servers? -- Alex Chekholko chekh at stanford.edu 347-401-4860 From oehmes at us.ibm.com Thu Sep 4 01:50:25 2014 From: oehmes at us.ibm.com (Sven Oehme) Date: Wed, 3 Sep 2014 17:50:25 -0700 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <54074F90.7000303@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> Message-ID: > Hello everybody, Hi > here i come here again, this time to ask some hint about how to monitor GPFS. > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is > that they return number based only on the request done in the > current host, so i have to run them on all the clients ( over 600 > nodes) so its quite unpractical. Instead i would like to know from > the servers whats going on, and i came across the vio_s statistics > wich are less documented and i dont know exacly what they mean. > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that > runs VIO_S. > > My problems with the output of this command: > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second > timestamp: 1409763206/477366 > recovery group: * > declustered array: * > vdisk: * > client reads: 2584229 > client short writes: 55299693 > client medium writes: 190071 > client promoted full track writes: 465145 > client full track writes: 9249 > flushed update writes: 4187708 > flushed promoted full track writes: 123 > migrate operations: 114 > scrub operations: 450590 > log writes: 28509602 > > it sais "VIOPS per second", but they seem to me just counters as > every time i re-run the command, the numbers increase by a bit.. > Can anyone confirm if those numbers are counter or if they are OPS/sec. the numbers are accumulative so everytime you run them they just show the value since start (or last reset) time. > > On a closer eye about i dont understand what most of thosevalues > mean. For example, what exacly are "flushed promoted full track write" ?? > I tried to find a documentation about this output , but could not > find any. can anyone point me a link where output of vio_s is explained? > > Another thing i dont understand about those numbers is if they are > just operations, or the number of blocks that was read/write/etc . its just operations and if i would explain what the numbers mean i might confuse you even more because this is not what you are really looking for. what you are looking for is what the client io's look like on the Server side, while the VIO layer is the Server side to the disks, so one lever lower than what you are looking for from what i could read out of the description above. so the Layer you care about is the NSD Server layer, which sits on top of the VIO layer (which is essentially the SW RAID Layer in GNR) > I'm asking that because if they are just ops, i don't know how much > they could be usefull. For example one write operation could eman > write 1 block or write a file of 100GB. If those are oprations, > there is a way to have the oupunt in bytes or blocks? there are multiple ways to get infos on the NSD layer, one would be to use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats counts again. the alternative option is to use mmdiag --iohist. this shows you a history of the last X numbers of io operations on either the client or the server side like on a client : # mmdiag --iohist === mmdiag: iohist === I/O history: I/O start time RW Buf type disk:sectorNum nSec time ms qTime ms RpcTimes ms Type Device/NSD ID NSD server --------------- -- ----------- ----------------- ----- ------- -------- ----------------- ---- ------------------ --------------- 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 14:25:22.182723 R inode 1:1071252480 8 6.970 0.000 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 14:25:53.668262 R inode 2:1081373696 8 14.117 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 14:25:53.692019 R inode 2:1064356608 8 14.899 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 14:25:53.707100 R inode 2:1077830152 8 16.499 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 14:25:53.728082 R inode 2:1081918976 8 7.760 0.000 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 14:25:57.877416 R metadata 2:678978560 16 13.343 0.000 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 14:25:57.906556 R inode 2:1083476520 8 11.723 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 14:25:57.926592 R inode 1:1076503480 8 8.087 0.000 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 14:25:57.941441 R inode 2:1069885984 8 11.686 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 14:25:57.953294 R inode 2:1083476936 8 8.951 0.000 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 14:25:57.965475 R inode 1:1076503504 8 0.477 0.000 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 14:25:57.965755 R inode 2:1083476488 8 0.410 0.000 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 14:25:57.965787 R inode 2:1083476512 8 0.439 0.000 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 you basically see if its a inode , data block , what size it has (in sectors) , which nsd server you did send this request to, etc. on the Server side you see the type , which physical disk it goes to and also what size of disk i/o it causes like : 14:26:50.129995 R inode 12:3211886376 64 14.261 0.000 0.000 0.000 pd sdis 14:26:50.137102 R inode 19:3003969520 64 9.004 0.000 0.000 0.000 pd sdad 14:26:50.136116 R inode 55:3591710992 64 11.057 0.000 0.000 0.000 pd sdoh 14:26:50.141510 R inode 21:3066810504 64 5.909 0.000 0.000 0.000 pd sdaf 14:26:50.130529 R inode 89:2962370072 64 17.437 0.000 0.000 0.000 pd sddi 14:26:50.131063 R inode 78:1889457000 64 17.062 0.000 0.000 0.000 pd sdsj 14:26:50.143403 R inode 36:3323035688 64 4.807 0.000 0.000 0.000 pd sdmw 14:26:50.131044 R inode 37:2513579736 128 17.181 0.000 0.000 0.000 pd sddv 14:26:50.138181 R inode 72:3868810400 64 10.951 0.000 0.000 0.000 pd sdbz 14:26:50.138188 R inode 131:2443484784 128 11.792 0.000 0.000 0.000 pd sdug 14:26:50.138003 R inode 102:3696843872 64 11.994 0.000 0.000 0.000 pd sdgp 14:26:50.137099 R inode 145:3370922504 64 13.225 0.000 0.000 0.000 pd sdmi 14:26:50.141576 R inode 62:2668579904 64 9.313 0.000 0.000 0.000 pd sdou 14:26:50.134689 R inode 159:2786164648 64 16.577 0.000 0.000 0.000 pd sdpq 14:26:50.145034 R inode 34:2097217320 64 7.409 0.000 0.000 0.000 pd sdmt 14:26:50.138140 R inode 139:2831038792 64 14.898 0.000 0.000 0.000 pd sdlw 14:26:50.130954 R inode 164:282120312 64 22.274 0.000 0.000 0.000 pd sdzd 14:26:50.137038 R inode 41:3421909608 64 16.314 0.000 0.000 0.000 pd sdef 14:26:50.137606 R inode 104:1870962416 64 16.644 0.000 0.000 0.000 pd sdgx 14:26:50.141306 R inode 65:2276184264 64 16.593 0.000 0.000 0.000 pd sdrk > > Last but not least.. and this is what i really would like to > accomplish, i would to be able to monitor the latency of metadata operations. you can't do this on the server side as you don't know how much time you spend on the client , network or anything between the app and the physical disk, so you can only reliably look at this from the client, the iohist output only shows you the Server disk i/o processing time, but that can be a fraction of the overall time (in other cases this obviously can also be the dominant part depending on your workload). the easiest way on the client is to run mmfsadm vfsstats enable from now on vfs stats are collected until you restart GPFS. then run : vfs statistics currently enabled started at: Fri Aug 29 13:15:05.380 2014 duration: 448446.970 sec name calls time per call total time -------------------- -------- -------------- -------------- statfs 9 0.000002 0.000021 startIO 246191176 0.005853 1441049.976740 to dump what ever you collected so far on this node. > In my environment there are users that litterally overhelm our > storages with metadata request, so even if there is no massive > throughput or huge waiters, any "ls" could take ages. I would like > to be able to monitor metadata behaviour. There is a way to to do > that from the NSD servers? not this simple as described above. > > Thanks in advance for any tip/help. > > Regards, > Salvatore_______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Thu Sep 4 11:05:18 2014 From: service at metamodul.com (service at metamodul.com) Date: Thu, 4 Sep 2014 12:05:18 +0200 (CEST) Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <54074F90.7000303@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> Message-ID: <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> > , any "ls" could take ages. Check if you large directories either with many files or simply large. Verify if you have NFS exported GPFS. Verify that your cache settings on the clients are large enough ( maxStatCache , maxFilesToCache , sharedMemLimit ) Verify that you have dedicated metadata luns ( metadataOnly ) Reference: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Tuning%20Parameters Note: If possible monitor your metadata luns on the storage directly. hth Hajo -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Thu Sep 4 11:43:36 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 04 Sep 2014 11:43:36 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: References: <54074F90.7000303@ebi.ac.uk> Message-ID: <54084258.90508@ebi.ac.uk> On 04/09/14 01:50, Sven Oehme wrote: > > Hello everybody, > > Hi > > > here i come here again, this time to ask some hint about how to > monitor GPFS. > > > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is > > that they return number based only on the request done in the > > current host, so i have to run them on all the clients ( over 600 > > nodes) so its quite unpractical. Instead i would like to know from > > the servers whats going on, and i came across the vio_s statistics > > wich are less documented and i dont know exacly what they mean. > > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that > > runs VIO_S. > > > > My problems with the output of this command: > > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 > > > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second > > timestamp: 1409763206/477366 > > recovery group: * > > declustered array: * > > vdisk: * > > client reads: 2584229 > > client short writes: 55299693 > > client medium writes: 190071 > > client promoted full track writes: 465145 > > client full track writes: 9249 > > flushed update writes: 4187708 > > flushed promoted full track writes: 123 > > migrate operations: 114 > > scrub operations: 450590 > > log writes: 28509602 > > > > it sais "VIOPS per second", but they seem to me just counters as > > every time i re-run the command, the numbers increase by a bit.. > > Can anyone confirm if those numbers are counter or if they are OPS/sec. > > the numbers are accumulative so everytime you run them they just show > the value since start (or last reset) time. OK, you confirmed my toughts, thatks > > > > > On a closer eye about i dont understand what most of thosevalues > > mean. For example, what exacly are "flushed promoted full track > write" ?? > > I tried to find a documentation about this output , but could not > > find any. can anyone point me a link where output of vio_s is explained? > > > > Another thing i dont understand about those numbers is if they are > > just operations, or the number of blocks that was read/write/etc . > > its just operations and if i would explain what the numbers mean i > might confuse you even more because this is not what you are really > looking for. > what you are looking for is what the client io's look like on the > Server side, while the VIO layer is the Server side to the disks, so > one lever lower than what you are looking for from what i could read > out of the description above. No.. what I'm looking its exactly how the disks are busy to keep the requests. Obviously i'm not looking just that, but I feel the needs to monitor _*also*_ those things. Ill explain you why. It happens when our storage is quite busy ( 180Gb/s of read/write ) that the FS start to be slowin normal /*cd*/ or /*ls*/ requests. This might be normal, but in those situation i want to know where the bottleneck is. Is the server CPU? Memory? Network? Spindles? knowing where the bottlenek is might help me to understand if we can tweak the system a bit more. If its the CPU on the servers then there is no much to do beside replacing or add more servers.If its not the CPU, maybe more memory would help? Maybe its just the network that filled up? so i can add more links Or if we reached the point there the bottleneck its the spindles, then there is no much point o look somethere else, we just reached the hardware limit.. Sometimes, it also happens that there is very low IO (10Gb/s ), almost no cpu usage on the servers but huge slownes ( ls can take 10 seconds). Why that happens? There is not much data ops , but we think there is a huge ammount of metadata ops. So what i want to know is if the metadata vdisks are busy or not. If this is our problem, could some SSD disks dedicated to metadata help? In particular im, a bit puzzled with the design of our GSS storage. Each recovery groups have 3 declustered arrays, and each declustered aray have 1 data and 1 metadata vdisk, but in the end both metadata and data vdisks use the same spindles. The problem that, its that I dont understand if we have a metadata bottleneck there. Maybe some SSD disks in a dedicated declustered array would perform much better, but this is just theory. I really would like to be able to monitor IO activities on the metadata vdisks. > > > so the Layer you care about is the NSD Server layer, which sits on top > of the VIO layer (which is essentially the SW RAID Layer in GNR) > > > I'm asking that because if they are just ops, i don't know how much > > they could be usefull. For example one write operation could eman > > write 1 block or write a file of 100GB. If those are oprations, > > there is a way to have the oupunt in bytes or blocks? > > there are multiple ways to get infos on the NSD layer, one would be to > use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats counts > again. Counters its not a problem. I can collect them and create some graphs in a monitoring tool. I will check that. > > the alternative option is to use mmdiag --iohist. this shows you a > history of the last X numbers of io operations on either the client or > the server side like on a client : > > # mmdiag --iohist > > === mmdiag: iohist === > > I/O history: > > I/O start time RW Buf type disk:sectorNum nSec time ms qTime > ms RpcTimes ms Type Device/NSD ID NSD server > --------------- -- ----------- ----------------- ----- ------- > -------- ----------------- ---- ------------------ --------------- > 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 > 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:22.182723 R inode 1:1071252480 8 6.970 0.000 > 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 > 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.668262 R inode 2:1081373696 8 14.117 > 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 > 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.692019 R inode 2:1064356608 8 14.899 > 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.707100 R inode 2:1077830152 8 16.499 > 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 > 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.728082 R inode 2:1081918976 8 7.760 0.000 > 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.877416 R metadata 2:678978560 16 13.343 0.000 > 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 > 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.906556 R inode 2:1083476520 8 11.723 > 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 > 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.926592 R inode 1:1076503480 8 8.087 0.000 > 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 > 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.941441 R inode 2:1069885984 8 11.686 > 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.953294 R inode 2:1083476936 8 8.951 0.000 > 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965475 R inode 1:1076503504 8 0.477 0.000 > 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.965755 R inode 2:1083476488 8 0.410 0.000 > 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965787 R inode 2:1083476512 8 0.439 0.000 > 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 > > you basically see if its a inode , data block , what size it has (in > sectors) , which nsd server you did send this request to, etc. > > on the Server side you see the type , which physical disk it goes to > and also what size of disk i/o it causes like : > > 14:26:50.129995 R inode 12:3211886376 64 14.261 > 0.000 0.000 0.000 pd sdis > 14:26:50.137102 R inode 19:3003969520 64 9.004 > 0.000 0.000 0.000 pd sdad > 14:26:50.136116 R inode 55:3591710992 64 11.057 > 0.000 0.000 0.000 pd sdoh > 14:26:50.141510 R inode 21:3066810504 64 5.909 > 0.000 0.000 0.000 pd sdaf > 14:26:50.130529 R inode 89:2962370072 64 17.437 > 0.000 0.000 0.000 pd sddi > 14:26:50.131063 R inode 78:1889457000 64 17.062 > 0.000 0.000 0.000 pd sdsj > 14:26:50.143403 R inode 36:3323035688 64 4.807 > 0.000 0.000 0.000 pd sdmw > 14:26:50.131044 R inode 37:2513579736 128 17.181 > 0.000 0.000 0.000 pd sddv > 14:26:50.138181 R inode 72:3868810400 64 10.951 > 0.000 0.000 0.000 pd sdbz > 14:26:50.138188 R inode 131:2443484784 128 11.792 > 0.000 0.000 0.000 pd sdug > 14:26:50.138003 R inode 102:3696843872 64 11.994 > 0.000 0.000 0.000 pd sdgp > 14:26:50.137099 R inode 145:3370922504 64 13.225 > 0.000 0.000 0.000 pd sdmi > 14:26:50.141576 R inode 62:2668579904 64 9.313 > 0.000 0.000 0.000 pd sdou > 14:26:50.134689 R inode 159:2786164648 64 16.577 > 0.000 0.000 0.000 pd sdpq > 14:26:50.145034 R inode 34:2097217320 64 7.409 > 0.000 0.000 0.000 pd sdmt > 14:26:50.138140 R inode 139:2831038792 64 14.898 > 0.000 0.000 0.000 pd sdlw > 14:26:50.130954 R inode 164:282120312 64 22.274 > 0.000 0.000 0.000 pd sdzd > 14:26:50.137038 R inode 41:3421909608 64 16.314 > 0.000 0.000 0.000 pd sdef > 14:26:50.137606 R inode 104:1870962416 64 16.644 > 0.000 0.000 0.000 pd sdgx > 14:26:50.141306 R inode 65:2276184264 64 16.593 > 0.000 0.000 0.000 pd sdrk > > mmdiag --iohist its another think i looked at it, but i could not find good explanation for all the "buf type" ( third column ) allocSeg data iallocSeg indBlock inode LLIndBlock logData logDesc logWrap metadata vdiskAULog vdiskBuf vdiskFWLog vdiskMDLog vdiskMeta vdiskRGDesc If i want to monifor metadata operation whan should i look at? just the metadata flag or also inode? this command takes also long to run, especially if i run it a second time it hangs for a lot before to rerun again, so i'm not sure that run it every 30secs or minute its viable, but i will look also into that. THere is any documentation that descibes clearly the whole output? what i found its quite generic and don't go into details... > > > > Last but not least.. and this is what i really would like to > > accomplish, i would to be able to monitor the latency of metadata > operations. > > you can't do this on the server side as you don't know how much time > you spend on the client , network or anything between the app and the > physical disk, so you can only reliably look at this from the client, > the iohist output only shows you the Server disk i/o processing time, > but that can be a fraction of the overall time (in other cases this > obviously can also be the dominant part depending on your workload). > > the easiest way on the client is to run > > mmfsadm vfsstats enable > from now on vfs stats are collected until you restart GPFS. > > then run : > > vfs statistics currently enabled > started at: Fri Aug 29 13:15:05.380 2014 > duration: 448446.970 sec > > name calls time per call total time > -------------------- -------- -------------- -------------- > statfs 9 0.000002 0.000021 > startIO 246191176 0.005853 1441049.976740 > > to dump what ever you collected so far on this node. > We already do that, but as I said, I want to check specifically how gss servers are keeping the requests to identify or exlude server side bottlenecks. Thanks for your help, you gave me definitely few things where to look at. Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Thu Sep 4 11:58:51 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 04 Sep 2014 11:58:51 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> References: <54074F90.7000303@ebi.ac.uk> <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> Message-ID: <540845EB.1020202@ebi.ac.uk> Little clarification, the filsystemn its not always slow. It happens that became very slow with particular users jobs in the farm. Maybe its just an indication thant we have huge ammount of metadata requestes, that's why i want to be able to monitor them On 04/09/14 11:05, service at metamodul.com wrote: > > , any "ls" could take ages. > Check if you large directories either with many files or simply large. it happens that the files are very large ( over 100G), but there usually ther are no many files. > Verify if you have NFS exported GPFS. No NFS > Verify that your cache settings on the clients are large enough ( > maxStatCache , maxFilesToCache , sharedMemLimit ) will look at them, but i'm not sure that the best number will be on the client. Obviously i cannot use all the memory of the client because those blients are meant to run jobs.... > Verify that you have dedicated metadata luns ( metadataOnly ) Yes, we have dedicate vdisks for metadata, but they are in the same declustered arrays/recoverygroups, so they whare the same spindles > Reference: > https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Tuning%20Parameters > > Note: > If possible monitor your metadata luns on the storage directly. that?s exactly than I'm trying to do !!!! :-D > hth > Hajo > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Thu Sep 4 13:04:21 2014 From: service at metamodul.com (service at metamodul.com) Date: Thu, 4 Sep 2014 14:04:21 +0200 (CEST) Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <540845EB.1020202@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> <540845EB.1020202@ebi.ac.uk> Message-ID: <1374296431.733076.1409832261718.open-xchange@oxbaltgw12.schlund.de> ... , any "ls" could take ages. >Check if you large directories either with many files or simply large. >> it happens that the files are very large ( over 100G), but there usually >> ther are no many files. >>> Please check that the directory size is not large. In a worst case you have a directory with 10MB in size but it contains only one file. In any way GPFS must fetch the whole directory structure might causing unnecassery IO. Thus my request that you check your directory sizes. >Verify that your cache settings on the clients are large enough ( maxStatCache >, maxFilesToCache , sharedMemLimit ) >>will look at them, but i'm not sure that the best number will be on the >>client. Obviously i cannot use all the memory of the client because those >>blients are meant to run jobs.... Use lsof on the client to determine the amount of open filese. mmdiag --stats ( >From my memory ) shows a little bit about the cache usage. maxStatCache does not use that much memory. > Verify that you have dedicated metadata luns ( metadataOnly ) >> Yes, we have dedicate vdisks for metadata, but they are in the same >> declustered arrays/recoverygroups, so they whare the same spindles Thats imho not a good approach. Metadata operation are small and random, data io is large and streaming. Just think you have a highway full of large trucks and you try to get with a high speed bike to your destination. You will be blocked. The same problem you have at your destiation. If many large trucks would like to get their stuff off there is no time for somebody with a small parcel. Thats the same reason why you should not access tape storage and disk storage via the same FC adapter. ( Streaming IO version v. random/small IO ) So even without your current problem and motivation for measureing i would strongly suggest to have at least dediacted SSD for metadata and if possible even dedicated NSD server for the metadata. Meaning have a dedicated path for your data and a dedicated path for your metadata. All from a users point of view Hajo -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Thu Sep 4 14:25:09 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 04 Sep 2014 14:25:09 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <1374296431.733076.1409832261718.open-xchange@oxbaltgw12.schlund.de> References: <54074F90.7000303@ebi.ac.uk> <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> <540845EB.1020202@ebi.ac.uk> <1374296431.733076.1409832261718.open-xchange@oxbaltgw12.schlund.de> Message-ID: <54086835.6050603@ebi.ac.uk> > >> Yes, we have dedicate vdisks for metadata, but they are in the same > declustered arrays/recoverygroups, so they whare the same spindles > > Thats imho not a good approach. Metadata operation are small and > random, data io is large and streaming. > > Just think you have a highway full of large trucks and you try to get > with a high speed bike to your destination. You will be blocked. > The same problem you have at your destiation. If many large trucks > would like to get their stuff off there is no time for somebody with a > small parcel. > > Thats the same reason why you should not access tape storage and disk > storage via the same FC adapter. ( Streaming IO version v. > random/small IO ) > > So even without your current problem and motivation for measureing i > would strongly suggest to have at least dediacted SSD for metadata and > if possible even dedicated NSD server for the metadata. > Meaning have a dedicated path for your data and a dedicated path for > your metadata. > > All from a users point of view > Hajo > That's where i was puzzled too. GSS its a gpfs appliance and came configured this way. Also official GSS documentation suggest to create separate vdisks for data and meatadata, but in the same declustered arrays. I always felt this a strange choice, specially if we consider that metadata require a very small abbount of space, so few ssd could do the trick.... From sdinardo at ebi.ac.uk Thu Sep 4 14:32:15 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 04 Sep 2014 14:32:15 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: References: <54074F90.7000303@ebi.ac.uk> Message-ID: <540869DF.5060100@ebi.ac.uk> Sorry to bother you again but dstat have some issues with the plugin: [root at gss01a util]# dstat --gpfs /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use the subprocess module. pipes[cmd] = os.popen3(cmd, 't', 0) Module dstat_gpfs failed to load. (global name 'select' is not defined) None of the stats you selected are available. I found this solution , but involve dstat recompile.... https://github.com/dagwieers/dstat/issues/44 Are you aware about any easier solution (we use RHEL6.3) ? Regards, Salvatore On 04/09/14 01:50, Sven Oehme wrote: > > Hello everybody, > > Hi > > > here i come here again, this time to ask some hint about how to > monitor GPFS. > > > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is > > that they return number based only on the request done in the > > current host, so i have to run them on all the clients ( over 600 > > nodes) so its quite unpractical. Instead i would like to know from > > the servers whats going on, and i came across the vio_s statistics > > wich are less documented and i dont know exacly what they mean. > > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that > > runs VIO_S. > > > > My problems with the output of this command: > > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 > > > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second > > timestamp: 1409763206/477366 > > recovery group: * > > declustered array: * > > vdisk: * > > client reads: 2584229 > > client short writes: 55299693 > > client medium writes: 190071 > > client promoted full track writes: 465145 > > client full track writes: 9249 > > flushed update writes: 4187708 > > flushed promoted full track writes: 123 > > migrate operations: 114 > > scrub operations: 450590 > > log writes: 28509602 > > > > it sais "VIOPS per second", but they seem to me just counters as > > every time i re-run the command, the numbers increase by a bit.. > > Can anyone confirm if those numbers are counter or if they are OPS/sec. > > the numbers are accumulative so everytime you run them they just show > the value since start (or last reset) time. > > > > > On a closer eye about i dont understand what most of thosevalues > > mean. For example, what exacly are "flushed promoted full track > write" ?? > > I tried to find a documentation about this output , but could not > > find any. can anyone point me a link where output of vio_s is explained? > > > > Another thing i dont understand about those numbers is if they are > > just operations, or the number of blocks that was read/write/etc . > > its just operations and if i would explain what the numbers mean i > might confuse you even more because this is not what you are really > looking for. > what you are looking for is what the client io's look like on the > Server side, while the VIO layer is the Server side to the disks, so > one lever lower than what you are looking for from what i could read > out of the description above. > > so the Layer you care about is the NSD Server layer, which sits on top > of the VIO layer (which is essentially the SW RAID Layer in GNR) > > > I'm asking that because if they are just ops, i don't know how much > > they could be usefull. For example one write operation could eman > > write 1 block or write a file of 100GB. If those are oprations, > > there is a way to have the oupunt in bytes or blocks? > > there are multiple ways to get infos on the NSD layer, one would be to > use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats counts > again. > > the alternative option is to use mmdiag --iohist. this shows you a > history of the last X numbers of io operations on either the client or > the server side like on a client : > > # mmdiag --iohist > > === mmdiag: iohist === > > I/O history: > > I/O start time RW Buf type disk:sectorNum nSec time ms qTime > ms RpcTimes ms Type Device/NSD ID NSD server > --------------- -- ----------- ----------------- ----- ------- > -------- ----------------- ---- ------------------ --------------- > 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 > 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:22.182723 R inode 1:1071252480 8 6.970 0.000 > 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 > 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.668262 R inode 2:1081373696 8 14.117 > 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 > 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.692019 R inode 2:1064356608 8 14.899 > 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.707100 R inode 2:1077830152 8 16.499 > 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 > 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.728082 R inode 2:1081918976 8 7.760 0.000 > 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.877416 R metadata 2:678978560 16 13.343 0.000 > 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 > 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.906556 R inode 2:1083476520 8 11.723 > 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 > 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.926592 R inode 1:1076503480 8 8.087 0.000 > 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 > 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.941441 R inode 2:1069885984 8 11.686 > 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.953294 R inode 2:1083476936 8 8.951 0.000 > 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965475 R inode 1:1076503504 8 0.477 0.000 > 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.965755 R inode 2:1083476488 8 0.410 0.000 > 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965787 R inode 2:1083476512 8 0.439 0.000 > 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 > > you basically see if its a inode , data block , what size it has (in > sectors) , which nsd server you did send this request to, etc. > > on the Server side you see the type , which physical disk it goes to > and also what size of disk i/o it causes like : > > 14:26:50.129995 R inode 12:3211886376 64 14.261 > 0.000 0.000 0.000 pd sdis > 14:26:50.137102 R inode 19:3003969520 64 9.004 > 0.000 0.000 0.000 pd sdad > 14:26:50.136116 R inode 55:3591710992 64 11.057 > 0.000 0.000 0.000 pd sdoh > 14:26:50.141510 R inode 21:3066810504 64 5.909 > 0.000 0.000 0.000 pd sdaf > 14:26:50.130529 R inode 89:2962370072 64 17.437 > 0.000 0.000 0.000 pd sddi > 14:26:50.131063 R inode 78:1889457000 64 17.062 > 0.000 0.000 0.000 pd sdsj > 14:26:50.143403 R inode 36:3323035688 64 4.807 > 0.000 0.000 0.000 pd sdmw > 14:26:50.131044 R inode 37:2513579736 128 17.181 > 0.000 0.000 0.000 pd sddv > 14:26:50.138181 R inode 72:3868810400 64 10.951 > 0.000 0.000 0.000 pd sdbz > 14:26:50.138188 R inode 131:2443484784 128 11.792 > 0.000 0.000 0.000 pd sdug > 14:26:50.138003 R inode 102:3696843872 64 11.994 > 0.000 0.000 0.000 pd sdgp > 14:26:50.137099 R inode 145:3370922504 64 13.225 > 0.000 0.000 0.000 pd sdmi > 14:26:50.141576 R inode 62:2668579904 64 9.313 > 0.000 0.000 0.000 pd sdou > 14:26:50.134689 R inode 159:2786164648 64 16.577 > 0.000 0.000 0.000 pd sdpq > 14:26:50.145034 R inode 34:2097217320 64 7.409 > 0.000 0.000 0.000 pd sdmt > 14:26:50.138140 R inode 139:2831038792 64 14.898 > 0.000 0.000 0.000 pd sdlw > 14:26:50.130954 R inode 164:282120312 64 22.274 > 0.000 0.000 0.000 pd sdzd > 14:26:50.137038 R inode 41:3421909608 64 16.314 > 0.000 0.000 0.000 pd sdef > 14:26:50.137606 R inode 104:1870962416 64 16.644 > 0.000 0.000 0.000 pd sdgx > 14:26:50.141306 R inode 65:2276184264 64 16.593 > 0.000 0.000 0.000 pd sdrk > > > > > > Last but not least.. and this is what i really would like to > > accomplish, i would to be able to monitor the latency of metadata > operations. > > you can't do this on the server side as you don't know how much time > you spend on the client , network or anything between the app and the > physical disk, so you can only reliably look at this from the client, > the iohist output only shows you the Server disk i/o processing time, > but that can be a fraction of the overall time (in other cases this > obviously can also be the dominant part depending on your workload). > > the easiest way on the client is to run > > mmfsadm vfsstats enable > from now on vfs stats are collected until you restart GPFS. > > then run : > > vfs statistics currently enabled > started at: Fri Aug 29 13:15:05.380 2014 > duration: 448446.970 sec > > name calls time per call total time > -------------------- -------- -------------- -------------- > statfs 9 0.000002 0.000021 > startIO 246191176 0.005853 1441049.976740 > > to dump what ever you collected so far on this node. > > > In my environment there are users that litterally overhelm our > > storages with metadata request, so even if there is no massive > > throughput or huge waiters, any "ls" could take ages. I would like > > to be able to monitor metadata behaviour. There is a way to to do > > that from the NSD servers? > > not this simple as described above. > > > > > Thanks in advance for any tip/help. > > > > Regards, > > Salvatore_______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at gpfsug.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From orlando.richards at ed.ac.uk Thu Sep 4 14:54:37 2014 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Thu, 04 Sep 2014 14:54:37 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <540869DF.5060100@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <540869DF.5060100@ebi.ac.uk> Message-ID: <54086F1D.1000401@ed.ac.uk> On 04/09/14 14:32, Salvatore Di Nardo wrote: > Sorry to bother you again but dstat have some issues with the plugin: > > [root at gss01a util]# dstat --gpfs > /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is > deprecated. Use the subprocess module. > pipes[cmd] = os.popen3(cmd, 't', 0) > Module dstat_gpfs failed to load. (global name 'select' is not > defined) > None of the stats you selected are available. > > I found this solution , but involve dstat recompile.... > > https://github.com/dagwieers/dstat/issues/44 > > Are you aware about any easier solution (we use RHEL6.3) ? > This worked for me the other day on a dev box I was poking at: # rm /usr/share/dstat/dstat_gpfsops* # cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 /usr/share/dstat/dstat_gpfsops.py # dstat --gpfsops /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use the subprocess module. pipes[cmd] = os.popen3(cmd, 't', 0) ---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o----------------------------- cr del op/cl rd wr trunc fsync looku gattr sattr other mb_rd mb_wr pref wrbeh steal clean sync revok logwr logda oth_r oth_w 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... > > Regards, > Salvatore > > On 04/09/14 01:50, Sven Oehme wrote: >> > Hello everybody, >> >> Hi >> >> > here i come here again, this time to ask some hint about how to >> monitor GPFS. >> > >> > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is >> > that they return number based only on the request done in the >> > current host, so i have to run them on all the clients ( over 600 >> > nodes) so its quite unpractical. Instead i would like to know from >> > the servers whats going on, and i came across the vio_s statistics >> > wich are less documented and i dont know exacly what they mean. >> > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that >> > runs VIO_S. >> > >> > My problems with the output of this command: >> > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 >> > >> > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second >> > timestamp: 1409763206/477366 >> > recovery group: * >> > declustered array: * >> > vdisk: * >> > client reads: 2584229 >> > client short writes: 55299693 >> > client medium writes: 190071 >> > client promoted full track writes: 465145 >> > client full track writes: 9249 >> > flushed update writes: 4187708 >> > flushed promoted full track writes: 123 >> > migrate operations: 114 >> > scrub operations: 450590 >> > log writes: 28509602 >> > >> > it sais "VIOPS per second", but they seem to me just counters as >> > every time i re-run the command, the numbers increase by a bit.. >> > Can anyone confirm if those numbers are counter or if they are OPS/sec. >> >> the numbers are accumulative so everytime you run them they just show >> the value since start (or last reset) time. >> >> > >> > On a closer eye about i dont understand what most of thosevalues >> > mean. For example, what exacly are "flushed promoted full track >> write" ?? >> > I tried to find a documentation about this output , but could not >> > find any. can anyone point me a link where output of vio_s is explained? >> > >> > Another thing i dont understand about those numbers is if they are >> > just operations, or the number of blocks that was read/write/etc . >> >> its just operations and if i would explain what the numbers mean i >> might confuse you even more because this is not what you are really >> looking for. >> what you are looking for is what the client io's look like on the >> Server side, while the VIO layer is the Server side to the disks, so >> one lever lower than what you are looking for from what i could read >> out of the description above. >> >> so the Layer you care about is the NSD Server layer, which sits on top >> of the VIO layer (which is essentially the SW RAID Layer in GNR) >> >> > I'm asking that because if they are just ops, i don't know how much >> > they could be usefull. For example one write operation could eman >> > write 1 block or write a file of 100GB. If those are oprations, >> > there is a way to have the oupunt in bytes or blocks? >> >> there are multiple ways to get infos on the NSD layer, one would be to >> use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats counts >> again. >> >> the alternative option is to use mmdiag --iohist. this shows you a >> history of the last X numbers of io operations on either the client or >> the server side like on a client : >> >> # mmdiag --iohist >> >> === mmdiag: iohist === >> >> I/O history: >> >> I/O start time RW Buf type disk:sectorNum nSec time ms qTime >> ms RpcTimes ms Type Device/NSD ID NSD server >> --------------- -- ----------- ----------------- ----- ------- >> -------- ----------------- ---- ------------------ --------------- >> 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 >> 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:22.182723 R inode 1:1071252480 8 6.970 0.000 >> 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 >> 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:53.668262 R inode 2:1081373696 8 14.117 >> 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 >> 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:53.692019 R inode 2:1064356608 8 14.899 >> 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:53.707100 R inode 2:1077830152 8 16.499 >> 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 >> 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:53.728082 R inode 2:1081918976 8 7.760 0.000 >> 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:57.877416 R metadata 2:678978560 16 13.343 0.000 >> 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 >> 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:57.906556 R inode 2:1083476520 8 11.723 >> 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 >> 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:57.926592 R inode 1:1076503480 8 8.087 0.000 >> 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 >> 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:57.941441 R inode 2:1069885984 8 11.686 >> 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:57.953294 R inode 2:1083476936 8 8.951 0.000 >> 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:57.965475 R inode 1:1076503504 8 0.477 0.000 >> 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 >> 14:25:57.965755 R inode 2:1083476488 8 0.410 0.000 >> 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 >> 14:25:57.965787 R inode 2:1083476512 8 0.439 0.000 >> 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 >> >> you basically see if its a inode , data block , what size it has (in >> sectors) , which nsd server you did send this request to, etc. >> >> on the Server side you see the type , which physical disk it goes to >> and also what size of disk i/o it causes like : >> >> 14:26:50.129995 R inode 12:3211886376 64 14.261 >> 0.000 0.000 0.000 pd sdis >> 14:26:50.137102 R inode 19:3003969520 64 9.004 >> 0.000 0.000 0.000 pd sdad >> 14:26:50.136116 R inode 55:3591710992 64 11.057 >> 0.000 0.000 0.000 pd sdoh >> 14:26:50.141510 R inode 21:3066810504 64 5.909 >> 0.000 0.000 0.000 pd sdaf >> 14:26:50.130529 R inode 89:2962370072 64 17.437 >> 0.000 0.000 0.000 pd sddi >> 14:26:50.131063 R inode 78:1889457000 64 17.062 >> 0.000 0.000 0.000 pd sdsj >> 14:26:50.143403 R inode 36:3323035688 64 4.807 >> 0.000 0.000 0.000 pd sdmw >> 14:26:50.131044 R inode 37:2513579736 128 17.181 >> 0.000 0.000 0.000 pd sddv >> 14:26:50.138181 R inode 72:3868810400 64 10.951 >> 0.000 0.000 0.000 pd sdbz >> 14:26:50.138188 R inode 131:2443484784 128 11.792 >> 0.000 0.000 0.000 pd sdug >> 14:26:50.138003 R inode 102:3696843872 64 11.994 >> 0.000 0.000 0.000 pd sdgp >> 14:26:50.137099 R inode 145:3370922504 64 13.225 >> 0.000 0.000 0.000 pd sdmi >> 14:26:50.141576 R inode 62:2668579904 64 9.313 >> 0.000 0.000 0.000 pd sdou >> 14:26:50.134689 R inode 159:2786164648 64 16.577 >> 0.000 0.000 0.000 pd sdpq >> 14:26:50.145034 R inode 34:2097217320 64 7.409 >> 0.000 0.000 0.000 pd sdmt >> 14:26:50.138140 R inode 139:2831038792 64 14.898 >> 0.000 0.000 0.000 pd sdlw >> 14:26:50.130954 R inode 164:282120312 64 22.274 >> 0.000 0.000 0.000 pd sdzd >> 14:26:50.137038 R inode 41:3421909608 64 16.314 >> 0.000 0.000 0.000 pd sdef >> 14:26:50.137606 R inode 104:1870962416 64 16.644 >> 0.000 0.000 0.000 pd sdgx >> 14:26:50.141306 R inode 65:2276184264 64 16.593 >> 0.000 0.000 0.000 pd sdrk >> >> >> > >> > Last but not least.. and this is what i really would like to >> > accomplish, i would to be able to monitor the latency of metadata >> operations. >> >> you can't do this on the server side as you don't know how much time >> you spend on the client , network or anything between the app and the >> physical disk, so you can only reliably look at this from the client, >> the iohist output only shows you the Server disk i/o processing time, >> but that can be a fraction of the overall time (in other cases this >> obviously can also be the dominant part depending on your workload). >> >> the easiest way on the client is to run >> >> mmfsadm vfsstats enable >> from now on vfs stats are collected until you restart GPFS. >> >> then run : >> >> vfs statistics currently enabled >> started at: Fri Aug 29 13:15:05.380 2014 >> duration: 448446.970 sec >> >> name calls time per call total time >> -------------------- -------- -------------- -------------- >> statfs 9 0.000002 0.000021 >> startIO 246191176 0.005853 1441049.976740 >> >> to dump what ever you collected so far on this node. >> >> > In my environment there are users that litterally overhelm our >> > storages with metadata request, so even if there is no massive >> > throughput or huge waiters, any "ls" could take ages. I would like >> > to be able to monitor metadata behaviour. There is a way to to do >> > that from the NSD servers? >> >> not this simple as described above. >> >> > >> > Thanks in advance for any tip/help. >> > >> > Regards, >> > Salvatore_______________________________________________ >> > gpfsug-discuss mailing list >> > gpfsug-discuss at gpfsug.org >> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- -- Dr Orlando Richards Research Facilities (ECDF) Systems Leader Information Services IT Infrastructure Division Tel: 0131 650 4994 skype: orlando.richards The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From sdinardo at ebi.ac.uk Thu Sep 4 15:07:42 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 04 Sep 2014 15:07:42 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <54086F1D.1000401@ed.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <540869DF.5060100@ebi.ac.uk> <54086F1D.1000401@ed.ac.uk> Message-ID: <5408722E.6060309@ebi.ac.uk> On 04/09/14 14:54, Orlando Richards wrote: > > > On 04/09/14 14:32, Salvatore Di Nardo wrote: >> Sorry to bother you again but dstat have some issues with the plugin: >> >> [root at gss01a util]# dstat --gpfs >> /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is >> deprecated. Use the subprocess module. >> pipes[cmd] = os.popen3(cmd, 't', 0) >> Module dstat_gpfs failed to load. (global name 'select' is not >> defined) >> None of the stats you selected are available. >> >> I found this solution , but involve dstat recompile.... >> >> https://github.com/dagwieers/dstat/issues/44 >> >> Are you aware about any easier solution (we use RHEL6.3) ? >> > > This worked for me the other day on a dev box I was poking at: > > # rm /usr/share/dstat/dstat_gpfsops* > > # cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 > /usr/share/dstat/dstat_gpfsops.py > > # dstat --gpfsops > /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use > the subprocess module. > pipes[cmd] = os.popen3(cmd, 't', 0) > ---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o----------------------------- > > cr del op/cl rd wr trunc fsync looku gattr sattr other > mb_rd mb_wr pref wrbeh steal clean sync revok logwr logda oth_r oth_w > 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 > > ... > NICE!! The only problem is that the box seems lacking those python scripts: ls /usr/lpp/mmfs/samples/util/ makefile README tsbackup tsbackup.C tsbackup.h tsfindinode tsfindinode.c tsgetusage tsgetusage.c tsinode tsinode.c tslistall tsreaddir tsreaddir.c tstimes tstimes.c Do you mind sending me those py files? They should be 3 as i see e gpfs options: gpfs, gpfs-ops, gpfsops (dunno what are the differences ) Regards, Salvatore -------------- next part -------------- An HTML attachment was scrubbed... URL: From orlando.richards at ed.ac.uk Thu Sep 4 15:14:02 2014 From: orlando.richards at ed.ac.uk (Orlando Richards) Date: Thu, 04 Sep 2014 15:14:02 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <5408722E.6060309@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <540869DF.5060100@ebi.ac.uk> <54086F1D.1000401@ed.ac.uk> <5408722E.6060309@ebi.ac.uk> Message-ID: <540873AA.5070401@ed.ac.uk> On 04/09/14 15:07, Salvatore Di Nardo wrote: > > On 04/09/14 14:54, Orlando Richards wrote: >> >> >> On 04/09/14 14:32, Salvatore Di Nardo wrote: >>> Sorry to bother you again but dstat have some issues with the plugin: >>> >>> [root at gss01a util]# dstat --gpfs >>> /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is >>> deprecated. Use the subprocess module. >>> pipes[cmd] = os.popen3(cmd, 't', 0) >>> Module dstat_gpfs failed to load. (global name 'select' is not >>> defined) >>> None of the stats you selected are available. >>> >>> I found this solution , but involve dstat recompile.... >>> >>> https://github.com/dagwieers/dstat/issues/44 >>> >>> Are you aware about any easier solution (we use RHEL6.3) ? >>> >> >> This worked for me the other day on a dev box I was poking at: >> >> # rm /usr/share/dstat/dstat_gpfsops* >> >> # cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 >> /usr/share/dstat/dstat_gpfsops.py >> >> # dstat --gpfsops >> /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use >> the subprocess module. >> pipes[cmd] = os.popen3(cmd, 't', 0) >> ---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o----------------------------- >> >> cr del op/cl rd wr trunc fsync looku gattr sattr other >> mb_rd mb_wr pref wrbeh steal clean sync revok logwr logda oth_r oth_w >> 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 >> >> ... >> > > NICE!! The only problem is that the box seems lacking those python scripts: > > ls /usr/lpp/mmfs/samples/util/ > makefile README tsbackup tsbackup.C tsbackup.h tsfindinode > tsfindinode.c tsgetusage tsgetusage.c tsinode tsinode.c > tslistall tsreaddir tsreaddir.c tstimes tstimes.c > It came from the gpfs.base rpm: # rpm -qf /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 gpfs.base-3.5.0-13.x86_64 > Do you mind sending me those py files? They should be 3 as i see e gpfs > options: gpfs, gpfs-ops, gpfsops (dunno what are the differences ) > Only the gpfsops.py is included in the bundle - one for dstat 0.6 and one for dstat 0.7. I've attached it to this mail as well (it seems to be GPL'd). > Regards, > Salvatore > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- -- Dr Orlando Richards Research Facilities (ECDF) Systems Leader Information Services IT Infrastructure Division Tel: 0131 650 4994 skype: orlando.richards The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -------------- next part -------------- # # Copyright (C) 2009, 2010 IBM Corporation # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2, or (at your option) # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. # global string, select, os, re, fnmatch import string, select, os, re, fnmatch # Dstat class to display selected gpfs performance counters returned by the # mmpmon "vfs_s", "ioc_s", "vio_s", "vflush_s", and "lroc_s" commands. # # The set of counters displayed can be customized via environment variables: # # DSTAT_GPFS_WHAT # # Selects which of the five mmpmon commands to display. # It is a comma separated list of any of the following: # "vfs": show mmpmon "vfs_s" counters # "ioc": show mmpmon "ioc_s" counters related to NSD client I/O # "nsd": show mmpmon "ioc_s" counters related to NSD server I/O # "vio": show mmpmon "vio_s" counters # "vflush": show mmpmon "vflush_s" counters # "lroc": show mmpmon "lroc_s" counters # "all": equivalent to specifying all of the above # # Example: # # DSTAT_GPFS_WHAT=vfs,lroc dstat -M gpfsops # # will display counters for mmpmon "vfs_s" and "lroc" commands. # # The default setting is "vfs,ioc", i.e., by default only "vfs_s" and NSD # client related "ioc_s" counters are displayed. # # DSTAT_GPFS_VFS # DSTAT_GPFS_IOC # DSTAT_GPFS_VIO # DSTAT_GPFS_VFLUSH # DSTAT_GPFS_LROC # # Allow finer grain control over exactly which values will be displayed for # each of the five mmpmon commands. Each variable is a comma separated list # of counter names with optional column header string. # # Example: # # export DSTAT_GPFS_VFS='create, remove, rd/wr=read+write' # export DSTAT_GPFS_IOC='sync*, logwrap*, oth_rd=*_rd, oth_wr=*_wr' # dstat -M gpfsops # # Under "vfs-ops" this will display three columns, showing creates, deletes # (removes), and a third column labelled "rd/wr" with a combined count of # read and write operations. # Under "disk-i/o" it will display four columns, showing all disk I/Os # initiated by sync, and log wrap, plus two columns labeled "oth_rd" and # "oth_wr" showing counts of all other disk reads and disk writes, # respectively. # # Note: setting one of these environment variables overrides the # corrosponding setting in DSTAT_GPFS_WHAT. For example, setting # DSTAT_GPFS_VFS="" will omit all "vfs_s" counters regardless of whether # "vfs" appears in DSTAT_GPFS_WHAT or not. # # Counter sets are specified as a comma-separated list of entries of one # of the following forms # # counter # label = counter # label = counter1 + counter2 + ... # # If no label is specified, the name of the counter is used as the column # header (truncated to 5 characters). # Counter names may contain shell-style wildcards. For example, the # pattern "sync*" matches the two ioc_s counters "sync_rd" and "sync_wr" and # therefore produce a column containing the combined count of disk reads and # disk writes initiated by sync. If a counter appears in or matches a name # pattern in more than one entry, it is included only in the count under the # first entry in which it appears. For example, adding an entry "other = *" # at the end of the list will add a column labeled "other" that shows the # sum of all counter values *not* included in any of the previous columns. # # DSTAT_GPFS_LIST=1 dstat -M gpfsops # # This will show all available counter names and the default definition # for which sets of counter values are displayed. # # An alternative to setting environment variables is to create a file # ~/.dstat_gpfs_rc # with python statements that sets any of the following variables # vfs_wanted: equivalent to setting DSTAT_GPFS_VFS # ioc_wanted: equivalent to setting DSTAT_GPFS_IOC # vio_wanted: equivalent to setting DSTAT_GPFS_VIO # vflush_wanted: equivalent to setting DSTAT_GPFS_VFLUSH # lroc_wanted: equivalent to setting DSTAT_GPFS_LROC # # For example, the following ~/.dstat_gpfs_rc file will produce the same # result as the environment variables in the example above: # # vfs_wanted = 'create, remove, rd/wr=read+write' # ioc_wanted = 'sync*, logwrap*, oth_rd=*_rd, oth_wr=*_wr' # # See also the default vfs_wanted, ioc_wanted, and vio_wanted settings in # the dstat_gpfsops __init__ method below. class dstat_plugin(dstat): def __init__(self): # list of all stats counters returned by mmpmon "vfs_s", "ioc_s", "vio_s", "vflush_s", and "lroc_s" # always ignore the first few chars like : io_s _io_s_ _n_ 172.31.136.2 _nn_ mgmt001st001 _rc_ 0 _t_ 1322526286 _tu_ 415518 vfs_keys = ('_access_', '_close_', '_create_', '_fclear_', '_fsync_', '_fsync_range_', '_ftrunc_', '_getattr_', '_link_', '_lockctl_', '_lookup_', '_map_lloff_', '_mkdir_', '_mknod_', '_open_', '_read_', '_write_', '_mmapRead_', '_mmapWrite_', '_aioRead_', '_aioWrite_','_readdir_', '_readlink_', '_readpage_', '_remove_', '_rename_', '_rmdir_', '_setacl_', '_setattr_', '_symlink_', '_unmap_', '_writepage_', '_tsfattr_', '_tsfsattr_', '_flock_', '_setxattr_', '_getxattr_', '_listxattr_', '_removexattr_', '_encode_fh_', '_decode_fh_', '_get_dentry_', '_get_parent_', '_mount_', '_statfs_', '_sync_', '_vget_') ioc_keys = ('_other_rd_', '_other_wr_','_mb_rd_', '_mb_wr_', '_steal_rd_', '_steal_wr_', '_cleaner_rd_', '_cleaner_wr_', '_sync_rd_', '_sync_wr_', '_logwrap_rd_', '_logwrap_wr_', '_revoke_rd_', '_revoke_wr_', '_prefetch_rd_', '_prefetch_wr_', '_logdata_rd_', '_logdata_wr_', '_nsdworker_rd_', '_nsdworker_wr_','_nsdlocal_rd_','_nsdlocal_wr_', '_vdisk_rd_','_vdisk_wr_', '_pdisk_rd_','_pdisk_wr_', '_logtip_rd_', '_logtip_wr_') vio_keys = ('_r_', '_sw_', '_mw_', '_pfw_', '_ftw_', '_fuw_', '_fpw_', '_m_', '_s_', '_l_', '_rgd_', '_meta_') vflush_keys = ('_ndt_', '_ngdb_', '_nfwlmb_', '_nfipt_', '_nfwwt_', '_ahwm_', '_susp_', '_uwrttf_', '_fftc_', '_nalth_', '_nasth_', '_nsigth_', '_ntgtth_') lroc_keys = ('_Inode_s_', '_Inode_sf_', '_Inode_smb_', '_Inode_r_', '_Inode_rf_', '_Inode_rmb_', '_Inode_i_', '_Inode_imb_', '_Directory_s_', '_Directory_sf_', '_Directory_smb_', '_Directory_r_', '_Directory_rf_', '_Directory_rmb_', '_Directory_i_', '_Directory_imb_', '_Data_s_', '_Data_sf_', '_Data_smb_', '_Data_r_', '_Data_rf_', '_Data_rmb_', '_Data_i_', '_Data_imb_', '_agt_i_', '_agt_i_rm_', '_agt_i_rM_', '_agt_i_ra_', '_agt_r_', '_agt_r_rm_', '_agt_r_rM_', '_agt_r_ra_', '_ssd_w_', '_ssd_w_p_', '_ssd_w_rm_', '_ssd_w_rM_', '_ssd_w_ra_', '_ssd_r_', '_ssd_r_p_', '_ssd_r_rm_', '_ssd_r_rM_', '_ssd_r_ra_') # Default counters to display for each mmpmon category vfs_wanted = '''cr = create + mkdir + link + symlink, del = remove + rmdir, op/cl = open + close + map_lloff + unmap, rd = read + readdir + readlink + mmapRead + readpage + aioRead + aioWrite, wr = write + mmapWrite + writepage, trunc = ftrunc + fclear, fsync = fsync + fsync_range, lookup, gattr = access + getattr + getxattr + getacl, sattr = setattr + setxattr + setacl, other = * ''' ioc_wanted1 = '''mb_rd, mb_wr, pref=prefetch_rd, wrbeh=prefetch_wr, steal*, cleaner*, sync*, revoke*, logwrap*, logdata*, oth_rd = other_rd, oth_wr = other_wr ''' ioc_wanted2 = '''rns_r=nsdworker_rd, rns_w=nsdworker_wr, lns_r=nsdlocal_rd, lns_w=nsdlocal_wr, vd_r=vdisk_rd, vd_w=vdisk_wr, pd_r=pdisk_rd, pd_w=pdisk_wr, ''' vio_wanted = '''ClRead=r, ClShWr=sw, ClMdWr=mw, ClPFTWr=pfw, ClFTWr=ftw, FlUpWr=fuw, FlPFTWr=fpw, Migrte=m, Scrub=s, LgWr=l, RGDsc=rgd, Meta=meta ''' vflush_wanted = '''DiTrk = ndt, DiBuf = ngdb, FwLog = nfwlmb, FinPr = nfipt, WraTh = nfwwt, HiWMa = ahwm, Suspd = susp, WrThF = uwrttf, Force = fftc, TrgTh = ntgtth, other = nalth + nasth + nsigth ''' lroc_wanted = '''StorS = Inode_s + Directory_s + Data_s, StorF = Inode_sf + Directory_sf + Data_sf, FetcS = Inode_r + Directory_r + Data_r, FetcF = Inode_rf + Directory_rf + Data_rf, InVAL = Inode_i + Directory_i + Data_i ''' # Coarse counter selection via DSTAT_GPFS_WHAT if 'DSTAT_GPFS_WHAT' in os.environ: what_wanted = os.environ['DSTAT_GPFS_WHAT'].split(',') else: what_wanted = [ 'vfs', 'ioc' ] # If ".dstat_gpfs_rc" exists in user's home directory, run it. # Otherwise, use DSTAT_GPFS_WHAT for counter selection and look for other # DSTAT_GPFS_XXX environment variables for additional customization. userprofile = os.path.join(os.environ['HOME'], '.dstat_gpfs_rc') if os.path.exists(userprofile): ioc_wanted = ioc_wanted1 + ioc_wanted2 exec file(userprofile) else: if 'all' not in what_wanted: if 'vfs' not in what_wanted: vfs_wanted = '' if 'ioc' not in what_wanted: ioc_wanted1 = '' if 'nsd' not in what_wanted: ioc_wanted2 = '' if 'vio' not in what_wanted: vio_wanted = '' if 'vflush' not in what_wanted: vflush_wanted = '' if 'lroc' not in what_wanted: lroc_wanted = '' ioc_wanted = ioc_wanted1 + ioc_wanted2 # Fine grain counter cusomization via DSTAT_GPFS_XXX if 'DSTAT_GPFS_VFS' in os.environ: vfs_wanted = os.environ['DSTAT_GPFS_VFS'] if 'DSTAT_GPFS_IOC' in os.environ: ioc_wanted = os.environ['DSTAT_GPFS_IOC'] if 'DSTAT_GPFS_VIO' in os.environ: vio_wanted = os.environ['DSTAT_GPFS_VIO'] if 'DSTAT_GPFS_VFLUSH' in os.environ: vflush_wanted = os.environ['DSTAT_GPFS_VFLUSH'] if 'DSTAT_GPFS_LROC' in os.environ: lroc_wanted = os.environ['DSTAT_GPFS_LROC'] self.debug = 0 vars1, nick1, keymap1 = self.make_keymap(vfs_keys, vfs_wanted, 'gpfs-vfs-') vars2, nick2, keymap2 = self.make_keymap(ioc_keys, ioc_wanted, 'gpfs-io-') vars3, nick3, keymap3 = self.make_keymap(vio_keys, vio_wanted, 'gpfs-vio-') vars4, nick4, keymap4 = self.make_keymap(vflush_keys, vflush_wanted, 'gpfs-vflush-') vars5, nick5, keymap5 = self.make_keymap(lroc_keys, lroc_wanted, 'gpfs-lroc-') if 'DSTAT_GPFS_LIST' in os.environ or self.debug: self.show_keymap('vfs_s', 'DSTAT_GPFS_VFS', vfs_keys, vfs_wanted, vars1, keymap1, 'gpfs-vfs-') self.show_keymap('ioc_s', 'DSTAT_GPFS_IOC', ioc_keys, ioc_wanted, vars2, keymap2, 'gpfs-io-') self.show_keymap('vio_s', 'DSTAT_GPFS_VIO', vio_keys, vio_wanted, vars3, keymap3, 'gpfs-vio-') self.show_keymap('vflush_stat', 'DSTAT_GPFS_VFLUSH', vflush_keys, vflush_wanted, vars4, keymap4, 'gpfs-vflush-') self.show_keymap('lroc_s', 'DSTAT_GPFS_LROC', lroc_keys, lroc_wanted, vars5, keymap5, 'gpfs-lroc-') print self.vars = vars1 + vars2 + vars3 + vars4 + vars5 self.varsrate = vars1 + vars2 + vars3 + vars5 self.varsconst = vars4 self.nick = nick1 + nick2 + nick3 + nick4 + nick5 self.vfs_keymap = keymap1 self.ioc_keymap = keymap2 self.vio_keymap = keymap3 self.vflush_keymap = keymap4 self.lroc_keymap = keymap5 names = [] self.addtitle(names, 'gpfs vfs ops', len(vars1)) self.addtitle(names, 'gpfs disk i/o', len(vars2)) self.addtitle(names, 'gpfs vio', len(vars3)) self.addtitle(names, 'gpfs vflush', len(vars4)) self.addtitle(names, 'gpfs lroc', len(vars5)) self.name = '#'.join(names) self.type = 'd' self.width = 5 self.scale = 1000 def make_keymap(self, keys, wanted, prefix): '''Parse the list of counter values to be displayd "keys" is the list of all available counters "wanted" is a string of the form "name1 = key1 + key2 + ..., name2 = key3 + key4 ..." Returns a list of all names found, e.g. ['name1', 'name2', ...], and a dictionary that maps counters to names, e.g., { 'key1': 'name1', 'key2': 'name1', 'key3': 'name2', ... }, ''' vars = [] nick = [] kmap = {} ## print re.split(r'\s*,\s*', wanted.strip()) for n in re.split(r'\s*,\s*', wanted.strip()): l = re.split(r'\s*=\s*', n, 2) if len(l) == 2: v = l[0] kl = re.split(r'\s*\+\s*', l[1]) elif l[0]: v = l[0].strip('*') kl = l else: continue nick.append(v[0:5]) v = prefix + v.replace('/', '-') vars.append(v) for s in kl: for k in keys: if fnmatch.fnmatch(k.strip('_'), s) and k not in kmap: kmap[k] = v return vars, nick, kmap def show_keymap(self, label, envname, keys, wanted, vars, kmap, prefix): 'show available counter names and current counter set definition' linewd = 100 print '\nAvailable counters for "%s":' % label mlen = max([len(k.strip('_')) for k in keys]) ncols = linewd // (mlen + 1) nrows = (len(keys) + ncols - 1) // ncols for r in range(nrows): print ' ', for c in range(ncols): i = c *nrows + r if not i < len(keys): break print keys[i].strip('_').ljust(mlen), print print '\nCurrent counter set selection:' print "\n%s='%s'\n" % (envname, re.sub(r'\s+', '', wanted).strip().replace(',', ', ')) if not vars: return mlen = 5 for v in vars: if v.startswith(prefix): s = v[len(prefix):] else: s = v n = ' %s = ' % s[0:mlen].rjust(mlen) kl = [ k.strip('_') for k in keys if kmap.get(k) == v ] i = 0 while i < len(kl): slen = len(n) + 3 + len(kl[i]) j = i + 1 while j < len(kl) and slen + 3 + len(kl[j]) < linewd: slen += 3 + len(kl[j]) j += 1 print n + ' + '.join(kl[i:j]) i = j n = ' %s + ' % ''.rjust(mlen) def addtitle(self, names, name, ncols): 'pad title given by "name" with minus signs to span "ncols" columns' if ncols == 1: names.append(name.split()[-1].center(6*ncols - 1)) elif ncols > 1: names.append(name.center(6*ncols - 1)) def check(self): 'start mmpmon command' if os.access('/usr/lpp/mmfs/bin/mmpmon', os.X_OK): try: self.stdin, self.stdout, self.stderr = dpopen('/usr/lpp/mmfs/bin/mmpmon -p -s') self.stdin.write('reset\n') readpipe(self.stdout) except IOError: raise Exception, 'Cannot interface with gpfs mmpmon binary' return True raise Exception, 'Needs GPFS mmpmon binary' def extract_vfs(self): 'collect "vfs_s" counter values' self.stdin.write('vfs_s\n') l = [] for line in readpipe(self.stdout): if not line: continue l += line.split() for i in range(11, len(l), 3): try: self.set2[self.vfs_keymap[l[i]]] += long(l[i+1]) except KeyError: pass def extract_ioc(self): 'collect "ioc_s" counter values' self.stdin.write('ioc_s\n') l = [] for line in readpipe(self.stdout): if not line: continue l += line.split() for i in range(11, len(l), 3): try: self.set2[self.ioc_keymap[l[i]+'rd_']] += long(l[i+1]) except KeyError: pass try: self.set2[self.ioc_keymap[l[i]+'wr_']] += long(l[i+2]) except KeyError: pass def extract_vio(self): 'collect "vio_s" counter values' self.stdin.write('vio_s\n') l = [] for line in readpipe(self.stdout): if not line: continue l += line.split() for i in range(19, len(l), 2): try: if l[i] in self.vio_keymap: self.set2[self.vio_keymap[l[i]]] += long(l[i+1]) except KeyError: pass def extract_vflush(self): 'collect "vflush_stat" counter values' self.stdin.write('vflush_stat\n') l = [] for line in readpipe(self.stdout): if not line: continue l += line.split() for i in range(11, len(l), 2): try: if l[i] in self.vflush_keymap: self.set2[self.vflush_keymap[l[i]]] += long(l[i+1]) except KeyError: pass def extract_lroc(self): 'collect "lroc_s" counter values' self.stdin.write('lroc_s\n') l = [] for line in readpipe(self.stdout): if not line: continue l += line.split() for i in range(11, len(l), 2): try: if l[i] in self.lroc_keymap: self.set2[self.lroc_keymap[l[i]]] += long(l[i+1]) except KeyError: pass def extract(self): try: for name in self.vars: self.set2[name] = 0 self.extract_ioc() self.extract_vfs() self.extract_vio() self.extract_vflush() self.extract_lroc() for name in self.varsrate: self.val[name] = (self.set2[name] - self.set1[name]) * 1.0 / elapsed for name in self.varsconst: self.val[name] = self.set2[name] except IOError, e: for name in self.vars: self.val[name] = -1 ## print 'dstat_gpfs: lost pipe to mmpmon,', e except Exception, e: for name in self.vars: self.val[name] = -1 print 'dstat_gpfs: exception', e if self.debug >= 0: self.debug -= 1 if step == op.delay: self.set1.update(self.set2) From ewahl at osc.edu Thu Sep 4 15:13:48 2014 From: ewahl at osc.edu (Ed Wahl) Date: Thu, 4 Sep 2014 14:13:48 +0000 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> References: <54074F90.7000303@ebi.ac.uk>, <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de> Message-ID: Another known issue with slow "ls" can be the annoyance that is 'sssd' under newer OSs (rhel 6) and properly configuring this for remote auth. I know on my nsd's I never did and the first ls in a directory where the cache is expired takes forever to make all the remote LDAP calls to get the UID info. bleh. Ed ________________________________ From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of service at metamodul.com [service at metamodul.com] Sent: Thursday, September 04, 2014 6:05 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] gpfs performance monitoring > , any "ls" could take ages. Check if you large directories either with many files or simply large. Verify if you have NFS exported GPFS. Verify that your cache settings on the clients are large enough ( maxStatCache , maxFilesToCache , sharedMemLimit ) Verify that you have dedicated metadata luns ( metadataOnly ) Reference: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Tuning%20Parameters Note: If possible monitor your metadata luns on the storage directly. hth Hajo -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdinardo at ebi.ac.uk Thu Sep 4 15:18:02 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Thu, 04 Sep 2014 15:18:02 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <540873AA.5070401@ed.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <540869DF.5060100@ebi.ac.uk> <54086F1D.1000401@ed.ac.uk> <5408722E.6060309@ebi.ac.uk> <540873AA.5070401@ed.ac.uk> Message-ID: <5408749A.9080306@ebi.ac.uk> On 04/09/14 15:14, Orlando Richards wrote: > > > On 04/09/14 15:07, Salvatore Di Nardo wrote: >> >> On 04/09/14 14:54, Orlando Richards wrote: >>> >>> >>> On 04/09/14 14:32, Salvatore Di Nardo wrote: >>>> Sorry to bother you again but dstat have some issues with the plugin: >>>> >>>> [root at gss01a util]# dstat --gpfs >>>> /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is >>>> deprecated. Use the subprocess module. >>>> pipes[cmd] = os.popen3(cmd, 't', 0) >>>> Module dstat_gpfs failed to load. (global name 'select' is not >>>> defined) >>>> None of the stats you selected are available. >>>> >>>> I found this solution , but involve dstat recompile.... >>>> >>>> https://github.com/dagwieers/dstat/issues/44 >>>> >>>> Are you aware about any easier solution (we use RHEL6.3) ? >>>> >>> >>> This worked for me the other day on a dev box I was poking at: >>> >>> # rm /usr/share/dstat/dstat_gpfsops* >>> >>> # cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 >>> /usr/share/dstat/dstat_gpfsops.py >>> >>> # dstat --gpfsops >>> /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use >>> the subprocess module. >>> pipes[cmd] = os.popen3(cmd, 't', 0) >>> ---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o----------------------------- >>> >>> >>> cr del op/cl rd wr trunc fsync looku gattr sattr other >>> mb_rd mb_wr pref wrbeh steal clean sync revok logwr logda oth_r oth_w >>> 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 0 0 0 >>> >>> ... >>> >> >> NICE!! The only problem is that the box seems lacking those python >> scripts: >> >> ls /usr/lpp/mmfs/samples/util/ >> makefile README tsbackup tsbackup.C tsbackup.h tsfindinode >> tsfindinode.c tsgetusage tsgetusage.c tsinode tsinode.c >> tslistall tsreaddir tsreaddir.c tstimes tstimes.c >> > > It came from the gpfs.base rpm: > > # rpm -qf /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 > gpfs.base-3.5.0-13.x86_64 > > >> Do you mind sending me those py files? They should be 3 as i see e gpfs >> options: gpfs, gpfs-ops, gpfsops (dunno what are the differences ) >> > > Only the gpfsops.py is included in the bundle - one for dstat 0.6 and > one for dstat 0.7. > > > I've attached it to this mail as well (it seems to be GPL'd). > Thanks. From J.R.Jones at soton.ac.uk Thu Sep 4 16:15:48 2014 From: J.R.Jones at soton.ac.uk (Jones J.R.) Date: Thu, 4 Sep 2014 15:15:48 +0000 Subject: [gpfsug-discuss] Building the portability layer for Xeon Phi Message-ID: <1409843748.7733.31.camel@uos-204812.clients.soton.ac.uk> Hi folks Has anyone managed to successfully build the portability layer for Xeon Phi? At the moment we are having to export the GPFS mounts from the host machine over NFS, which is proving rather unreliable. Jess From oehmes at us.ibm.com Fri Sep 5 01:48:40 2014 From: oehmes at us.ibm.com (Sven Oehme) Date: Thu, 4 Sep 2014 17:48:40 -0700 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <54084258.90508@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <54084258.90508@ebi.ac.uk> Message-ID: ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ gpfsug-discuss-bounces at gpfsug.org wrote on 09/04/2014 03:43:36 AM: > From: Salvatore Di Nardo > To: gpfsug main discussion list > Date: 09/04/2014 03:44 AM > Subject: Re: [gpfsug-discuss] gpfs performance monitoring > Sent by: gpfsug-discuss-bounces at gpfsug.org > > On 04/09/14 01:50, Sven Oehme wrote: > > Hello everybody, > > Hi > > > here i come here again, this time to ask some hint about how to > monitor GPFS. > > > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is > > that they return number based only on the request done in the > > current host, so i have to run them on all the clients ( over 600 > > nodes) so its quite unpractical. Instead i would like to know from > > the servers whats going on, and i came across the vio_s statistics > > wich are less documented and i dont know exacly what they mean. > > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that > > runs VIO_S. > > > > My problems with the output of this command: > > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 > > > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second > > timestamp: 1409763206/477366 > > recovery group: * > > declustered array: * > > vdisk: * > > client reads: 2584229 > > client short writes: 55299693 > > client medium writes: 190071 > > client promoted full track writes: 465145 > > client full track writes: 9249 > > flushed update writes: 4187708 > > flushed promoted full track writes: 123 > > migrate operations: 114 > > scrub operations: 450590 > > log writes: 28509602 > > > > it sais "VIOPS per second", but they seem to me just counters as > > every time i re-run the command, the numbers increase by a bit.. > > Can anyone confirm if those numbers are counter or if they are OPS/sec. > > the numbers are accumulative so everytime you run them they just > show the value since start (or last reset) time. > OK, you confirmed my toughts, thatks > > > > > On a closer eye about i dont understand what most of thosevalues > > mean. For example, what exacly are "flushed promoted full track write" ?? > > I tried to find a documentation about this output , but could not > > find any. can anyone point me a link where output of vio_s is explained? > > > > Another thing i dont understand about those numbers is if they are > > just operations, or the number of blocks that was read/write/etc . > > its just operations and if i would explain what the numbers mean i > might confuse you even more because this is not what you are really > looking for. > what you are looking for is what the client io's look like on the > Server side, while the VIO layer is the Server side to the disks, so > one lever lower than what you are looking for from what i could read > out of the description above. > No.. what I'm looking its exactly how the disks are busy to keep the > requests. Obviously i'm not looking just that, but I feel the needs > to monitor also those things. Ill explain you why. > > It happens when our storage is quite busy ( 180Gb/s of read/write ) > that the FS start to be slowin normal cd or ls requests. This might > be normal, but in those situation i want to know where the > bottleneck is. Is the server CPU? Memory? Network? Spindles? knowing > where the bottlenek is might help me to understand if we can tweak > the system a bit more. if cd or ls is very slow in GPFS in the majority of the cases it has nothing to do with NSD Server bottlenecks, only indirect. the main reason ls is slow in the field is you have some very powerful nodes that all do buffered writes into the same directory into 1 or multiple files while you do the ls on a different node. what happens now is that the ls you did run most likely is a alias for ls -l or something even more complex with color display, etc, but the point is it most likely returns file size. GPFS doesn't lie about the filesize, we only return accurate stat informations and while this is arguable, its a fact today. so what happens is that the stat on each file triggers a token revoke on the node that currently writing to the file you do stat on, lets say it has 1 gb of dirty data in its memory for this file (as its writes data buffered) this 1 GB of data now gets written to the NSD server, the client updates the inode info and returns the correct size. lets say you have very fast network and you have a fast storage device like GSS (which i see you have) it will be able to do this in a few 100 ms, but the problem is this now happens serialized for each single file in this directory that people write into as for each we need to get the exact stat info to satisfy your ls -l request. this is what takes so long, not the fact that the storage device might be slow or to much metadata activity is going on , this is token , means network traffic and obviously latency dependent. the best way to see this is to look at waiters on the client where you run the ls and see what they are waiting for. there are various ways to tune this to get better 'felt' ls responses but its not completely going away if all you try to with ls is if there is a file in the directory run unalias ls and check if ls after that runs fast as it shouldn't do the -l under the cover anymore. > > If its the CPU on the servers then there is no much to do beside > replacing or add more servers.If its not the CPU, maybe more memory > would help? Maybe its just the network that filled up? so i can add > more links > > Or if we reached the point there the bottleneck its the spindles, > then there is no much point o look somethere else, we just reached > the hardware limit.. > > Sometimes, it also happens that there is very low IO (10Gb/s ), > almost no cpu usage on the servers but huge slownes ( ls can take 10 > seconds). Why that happens? There is not much data ops , but we > think there is a huge ammount of metadata ops. So what i want to > know is if the metadata vdisks are busy or not. If this is our > problem, could some SSD disks dedicated to metadata help? the answer if ssd's would help or not are hard to say without knowing the root case and as i tried to explain above the most likely case is token revoke, not disk i/o. obviously as more busy your disks are as longer the token revoke will take. > > > In particular im, a bit puzzled with the design of our GSS storage. > Each recovery groups have 3 declustered arrays, and each declustered > aray have 1 data and 1 metadata vdisk, but in the end both metadata > and data vdisks use the same spindles. The problem that, its that I > dont understand if we have a metadata bottleneck there. Maybe some > SSD disks in a dedicated declustered array would perform much > better, but this is just theory. I really would like to be able to > monitor IO activities on the metadata vdisks. the short answer is we WANT the metadata disks to be with the data disks on the same spindles. compared to other storage systems, GSS is capable to handle different raid codes for different virtual disks on the same physical disks, this way we create raid1'ish 'LUNS' for metadata and raid6'is 'LUNS' for data so the small i/o penalty for a metadata is very small compared to a read/modify/write on the data disks. > > > > > so the Layer you care about is the NSD Server layer, which sits on > top of the VIO layer (which is essentially the SW RAID Layer in GNR) > > > I'm asking that because if they are just ops, i don't know how much > > they could be usefull. For example one write operation could eman > > write 1 block or write a file of 100GB. If those are oprations, > > there is a way to have the oupunt in bytes or blocks? > > there are multiple ways to get infos on the NSD layer, one would be > to use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats > counts again. > > Counters its not a problem. I can collect them and create some > graphs in a monitoring tool. I will check that. if you (let) upgrade your system to GSS 2.0 you get a graphical monitoring as part of it. if you want i can send you some direct email outside the group with additional informations on that. > > the alternative option is to use mmdiag --iohist. this shows you a > history of the last X numbers of io operations on either the client > or the server side like on a client : > > # mmdiag --iohist > > === mmdiag: iohist === > > I/O history: > > I/O start time RW Buf type disk:sectorNum nSec time ms > qTime ms RpcTimes ms Type Device/NSD ID NSD server > --------------- -- ----------- ----------------- ----- ------- > -------- ----------------- ---- ------------------ --------------- > 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 > 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:22.182723 R inode 1:1071252480 8 6.970 > 0.000 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 > 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.668262 R inode 2:1081373696 8 14.117 > 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 > 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.692019 R inode 2:1064356608 8 14.899 > 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.707100 R inode 2:1077830152 8 16.499 > 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 > 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.728082 R inode 2:1081918976 8 7.760 > 0.000 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.877416 R metadata 2:678978560 16 13.343 > 0.000 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 > 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.906556 R inode 2:1083476520 8 11.723 > 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 > 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.926592 R inode 1:1076503480 8 8.087 > 0.000 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 > 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.941441 R inode 2:1069885984 8 11.686 > 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.953294 R inode 2:1083476936 8 8.951 > 0.000 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965475 R inode 1:1076503504 8 0.477 > 0.000 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.965755 R inode 2:1083476488 8 0.410 > 0.000 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965787 R inode 2:1083476512 8 0.439 > 0.000 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 > > you basically see if its a inode , data block , what size it has (in > sectors) , which nsd server you did send this request to, etc. > > on the Server side you see the type , which physical disk it goes to > and also what size of disk i/o it causes like : > > 14:26:50.129995 R inode 12:3211886376 64 14.261 > 0.000 0.000 0.000 pd sdis > 14:26:50.137102 R inode 19:3003969520 64 9.004 > 0.000 0.000 0.000 pd sdad > 14:26:50.136116 R inode 55:3591710992 64 11.057 > 0.000 0.000 0.000 pd sdoh > 14:26:50.141510 R inode 21:3066810504 64 5.909 > 0.000 0.000 0.000 pd sdaf > 14:26:50.130529 R inode 89:2962370072 64 17.437 > 0.000 0.000 0.000 pd sddi > 14:26:50.131063 R inode 78:1889457000 64 17.062 > 0.000 0.000 0.000 pd sdsj > 14:26:50.143403 R inode 36:3323035688 64 4.807 > 0.000 0.000 0.000 pd sdmw > 14:26:50.131044 R inode 37:2513579736 128 17.181 > 0.000 0.000 0.000 pd sddv > 14:26:50.138181 R inode 72:3868810400 64 10.951 > 0.000 0.000 0.000 pd sdbz > 14:26:50.138188 R inode 131:2443484784 128 11.792 > 0.000 0.000 0.000 pd sdug > 14:26:50.138003 R inode 102:3696843872 64 11.994 > 0.000 0.000 0.000 pd sdgp > 14:26:50.137099 R inode 145:3370922504 64 13.225 > 0.000 0.000 0.000 pd sdmi > 14:26:50.141576 R inode 62:2668579904 64 9.313 > 0.000 0.000 0.000 pd sdou > 14:26:50.134689 R inode 159:2786164648 64 16.577 > 0.000 0.000 0.000 pd sdpq > 14:26:50.145034 R inode 34:2097217320 64 7.409 > 0.000 0.000 0.000 pd sdmt > 14:26:50.138140 R inode 139:2831038792 64 14.898 > 0.000 0.000 0.000 pd sdlw > 14:26:50.130954 R inode 164:282120312 64 22.274 > 0.000 0.000 0.000 pd sdzd > 14:26:50.137038 R inode 41:3421909608 64 16.314 > 0.000 0.000 0.000 pd sdef > 14:26:50.137606 R inode 104:1870962416 64 16.644 > 0.000 0.000 0.000 pd sdgx > 14:26:50.141306 R inode 65:2276184264 64 16.593 > 0.000 0.000 0.000 pd sdrk > > > mmdiag --iohist its another think i looked at it, but i could not > find good explanation for all the "buf type" ( third column ) > allocSeg > data > iallocSeg > indBlock > inode > LLIndBlock > logData > logDesc > logWrap > metadata > vdiskAULog > vdiskBuf > vdiskFWLog > vdiskMDLog > vdiskMeta > vdiskRGDesc > If i want to monifor metadata operation whan should i look at? just inodes =inodes , *alloc* = file or data allocation blocks , *ind* = indirect blocks (for very large files) and metadata , everyhing else is data or internal i/o's > the metadata flag or also inode? this command takes also long to > run, especially if i run it a second time it hangs for a lot before > to rerun again, so i'm not sure that run it every 30secs or minute > its viable, but i will look also into that. THere is any > documentation that descibes clearly the whole output? what i found > its quite generic and don't go into details... the reason it takes so long is because it collects 10's of thousands of i/os in a table and to not slow down the system when we dump the data we copy it to a separate buffer so we don't need locks :-) you can adjust the number of entries you want to collect by adjusting the ioHistorySize config parameter > > > > Last but not least.. and this is what i really would like to > > accomplish, i would to be able to monitor the latency of metadata > operations. > > you can't do this on the server side as you don't know how much time > you spend on the client , network or anything between the app and > the physical disk, so you can only reliably look at this from the > client, the iohist output only shows you the Server disk i/o > processing time, but that can be a fraction of the overall time (in > other cases this obviously can also be the dominant part depending > on your workload). > > the easiest way on the client is to run > > mmfsadm vfsstats enable > from now on vfs stats are collected until you restart GPFS. > > then run : > > vfs statistics currently enabled > started at: Fri Aug 29 13:15:05.380 2014 > duration: 448446.970 sec > > name calls time per call total time > -------------------- -------- -------------- -------------- > statfs 9 0.000002 0.000021 > startIO 246191176 0.005853 1441049.976740 > > to dump what ever you collected so far on this node. > > We already do that, but as I said, I want to check specifically how > gss servers are keeping the requests to identify or exlude server > side bottlenecks. > > > Thanks for your help, you gave me definitely few things where to look at. > > Salvatore > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at us.ibm.com Fri Sep 5 01:53:17 2014 From: oehmes at us.ibm.com (Sven Oehme) Date: Thu, 4 Sep 2014 17:53:17 -0700 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <5408722E.6060309@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <540869DF.5060100@ebi.ac.uk> <54086F1D.1000401@ed.ac.uk> <5408722E.6060309@ebi.ac.uk> Message-ID: if you don't have the files you need to update to a newer version of the GPFS client software on the node. they started shipping with 3.5.0.13 even you get the files you still wouldn't see many values as they never got exposed before. some more details are in a presentation i gave earlier this year which is archived in the list or here --> http://www.gpfsug.org/wp-content/uploads/2014/05/UG10_GPFS_Performance_Session_v10.pdf Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ From: Salvatore Di Nardo To: gpfsug-discuss at gpfsug.org Date: 09/04/2014 07:08 AM Subject: Re: [gpfsug-discuss] gpfs performance monitoring Sent by: gpfsug-discuss-bounces at gpfsug.org On 04/09/14 14:54, Orlando Richards wrote: On 04/09/14 14:32, Salvatore Di Nardo wrote: Sorry to bother you again but dstat have some issues with the plugin: [root at gss01a util]# dstat --gpfs /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use the subprocess module. pipes[cmd] = os.popen3(cmd, 't', 0) Module dstat_gpfs failed to load. (global name 'select' is not defined) None of the stats you selected are available. I found this solution , but involve dstat recompile.... https://github.com/dagwieers/dstat/issues/44 Are you aware about any easier solution (we use RHEL6.3) ? This worked for me the other day on a dev box I was poking at: # rm /usr/share/dstat/dstat_gpfsops* # cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 /usr/share/dstat/dstat_gpfsops.py # dstat --gpfsops /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use the subprocess module. pipes[cmd] = os.popen3(cmd, 't', 0) ---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o----------------------------- cr del op/cl rd wr trunc fsync looku gattr sattr other mb_rd mb_wr pref wrbeh steal clean sync revok logwr logda oth_r oth_w 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... NICE!! The only problem is that the box seems lacking those python scripts: ls /usr/lpp/mmfs/samples/util/ makefile README tsbackup tsbackup.C tsbackup.h tsfindinode tsfindinode.c tsgetusage tsgetusage.c tsinode tsinode.c tslistall tsreaddir tsreaddir.c tstimes tstimes.c Do you mind sending me those py files? They should be 3 as i see e gpfs options: gpfs, gpfs-ops, gpfsops (dunno what are the differences ) Regards, Salvatore _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan at buzzard.me.uk Fri Sep 5 10:29:27 2014 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Fri, 05 Sep 2014 10:29:27 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <54084258.90508@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <54084258.90508@ebi.ac.uk> Message-ID: <1409909367.30257.151.camel@buzzard.phy.strath.ac.uk> On Thu, 2014-09-04 at 11:43 +0100, Salvatore Di Nardo wrote: [SNIP] > > Sometimes, it also happens that there is very low IO (10Gb/s ), almost > no cpu usage on the servers but huge slownes ( ls can take 10 > seconds). Why that happens? There is not much data ops , but we think > there is a huge ammount of metadata ops. So what i want to know is if > the metadata vdisks are busy or not. If this is our problem, could > some SSD disks dedicated to metadata help? > This is almost always because you are using an external LDAP/NIS server for GECOS information and the values that you need are not cached for whatever reason and you are having to look them up again. Note that the standard aliasing for RHEL based distros of ls also causes it to do a stat on every file for the colouring etc. Also be aware that if you are trying to fill out your cd with TAB auto-completion you will run into similar issues. That is had you typed the path for the cd out in full you would get in instantly, doing a couple of letters and hitting cd it could take a while. You can test this on a RHEL based distro by doing "/bin/ls -n" The idea being to avoid any aliasing and not look up GECOS data and just report the raw numerical stuff. What I would suggest is that you set the cache time on UID/GID lookups for positive lookups to a long time, in general as long as possible because the values should almost never change. Even for a positive look up of a group membership I would have that cached for a couple of hours. For negative lookups something like five or 10 minutes is a good starting point. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From sdinardo at ebi.ac.uk Fri Sep 5 11:56:37 2014 From: sdinardo at ebi.ac.uk (Salvatore Di Nardo) Date: Fri, 05 Sep 2014 11:56:37 +0100 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: References: <54074F90.7000303@ebi.ac.uk> <54084258.90508@ebi.ac.uk> Message-ID: <540996E5.5000502@ebi.ac.uk> Little clarification: Our ls its plain ls, there is no alias. Consider that all those things are already set up properly as EBI run hi computing farms from many years, so those things are already fixed loong time ago. We have very little experience with GPFS, but good knowledge with LSF farms and own multiple NFS stotages ( several petabyte sized). about NIS, all clients run NSCD that cashes all informations to avoid such tipe of slownes, in fact then ls isslow, also ls -n is slow. Beside that, also a "cd" sometimes hangs, so it have nothing to do with getting attributes. Just to clarify a bit more. Now GSS usually seems working fine, we have users that run jobs on the farms that pushes 180Gb/s read ( reading and writing files of 100GB size). GPFS works very well there, where other systems had performance problems accessing portion of data in so huge files. Sadly, on the other hand, other users run jobs that do suge ammount of metadata operations, like toons of ls in directory with many files, or creating a silly amount of temporary files just to synchronize the jobs between the farm nodes, or just to store temporary data for few milliseconds and them immediately delete those temporary files. Imagine to create constantly thousands files just to write few bytes and they delete them after few milliseconds... When those thing happens we see 10-15Gb/sec throughput, low CPU usage on the server ( 80% iddle), but any cd, or ls or wathever takes few seconds. So my question is, if the bottleneck could be the spindles, or if the clients could be tuned a bit more? I read your PDF and all the paramenters seems already well configured except "maxFilesToCache", but I'm not sure how we should configure few of those parameters on the clients. As an example I cannot immagine a client that require 38g pagepool size. so what's the correct *pagepool* on a client? what about those others? *maxFilestoCache** **maxBufferdescs** **worker1threads** **worker3threads* Right now all the clients have 1 GB pagepool size. In theory, we can afford to use more ( i thing we can easily go up to 8GB) as they have plenty or available memory. If this could help, we can do that, but the client really really need more than 1G? They are just clients after all, so the memory in theory should be used for jobs not just for "caching". Last question about "maxFIlesToCache" you say that must be large on small cluster but small on large clusters. What do you consider 6 servers and almost 700 clients? on clienst we have: maxFilesToCache 4000 on servers we have maxFilesToCache 12288 Regards, Salvatore On 05/09/14 01:48, Sven Oehme wrote: > ------------------------------------------ > Sven Oehme > Scalable Storage Research > email: oehmes at us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > ------------------------------------------ > > gpfsug-discuss-bounces at gpfsug.org wrote on 09/04/2014 03:43:36 AM: > > > From: Salvatore Di Nardo > > To: gpfsug main discussion list > > Date: 09/04/2014 03:44 AM > > Subject: Re: [gpfsug-discuss] gpfs performance monitoring > > Sent by: gpfsug-discuss-bounces at gpfsug.org > > > > On 04/09/14 01:50, Sven Oehme wrote: > > > Hello everybody, > > > > Hi > > > > > here i come here again, this time to ask some hint about how to > > monitor GPFS. > > > > > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is > > > that they return number based only on the request done in the > > > current host, so i have to run them on all the clients ( over 600 > > > nodes) so its quite unpractical. Instead i would like to know from > > > the servers whats going on, and i came across the vio_s statistics > > > wich are less documented and i dont know exacly what they mean. > > > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that > > > runs VIO_S. > > > > > > My problems with the output of this command: > > > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 > > > > > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second > > > timestamp: 1409763206/477366 > > > recovery group: * > > > declustered array: * > > > vdisk: * > > > client reads: 2584229 > > > client short writes: 55299693 > > > client medium writes: 190071 > > > client promoted full track writes: 465145 > > > client full track writes: 9249 > > > flushed update writes: 4187708 > > > flushed promoted full track writes: 123 > > > migrate operations: 114 > > > scrub operations: 450590 > > > log writes: 28509602 > > > > > > it sais "VIOPS per second", but they seem to me just counters as > > > every time i re-run the command, the numbers increase by a bit.. > > > Can anyone confirm if those numbers are counter or if they are > OPS/sec. > > > > the numbers are accumulative so everytime you run them they just > > show the value since start (or last reset) time. > > OK, you confirmed my toughts, thatks > > > > > > > > > On a closer eye about i dont understand what most of thosevalues > > > mean. For example, what exacly are "flushed promoted full track > write" ?? > > > I tried to find a documentation about this output , but could not > > > find any. can anyone point me a link where output of vio_s is > explained? > > > > > > Another thing i dont understand about those numbers is if they are > > > just operations, or the number of blocks that was read/write/etc . > > > > its just operations and if i would explain what the numbers mean i > > might confuse you even more because this is not what you are really > > looking for. > > what you are looking for is what the client io's look like on the > > Server side, while the VIO layer is the Server side to the disks, so > > one lever lower than what you are looking for from what i could read > > out of the description above. > > No.. what I'm looking its exactly how the disks are busy to keep the > > requests. Obviously i'm not looking just that, but I feel the needs > > to monitor also those things. Ill explain you why. > > > > It happens when our storage is quite busy ( 180Gb/s of read/write ) > > that the FS start to be slowin normal cd or ls requests. This might > > be normal, but in those situation i want to know where the > > bottleneck is. Is the server CPU? Memory? Network? Spindles? knowing > > where the bottlenek is might help me to understand if we can tweak > > the system a bit more. > > if cd or ls is very slow in GPFS in the majority of the cases it has > nothing to do with NSD Server bottlenecks, only indirect. > the main reason ls is slow in the field is you have some very powerful > nodes that all do buffered writes into the same directory into 1 or > multiple files while you do the ls on a different node. what happens > now is that the ls you did run most likely is a alias for ls -l or > something even more complex with color display, etc, but the point is > it most likely returns file size. GPFS doesn't lie about the filesize, > we only return accurate stat informations and while this is arguable, > its a fact today. > so what happens is that the stat on each file triggers a token revoke > on the node that currently writing to the file you do stat on, lets > say it has 1 gb of dirty data in its memory for this file (as its > writes data buffered) this 1 GB of data now gets written to the NSD > server, the client updates the inode info and returns the correct size. > lets say you have very fast network and you have a fast storage device > like GSS (which i see you have) it will be able to do this in a few > 100 ms, but the problem is this now happens serialized for each single > file in this directory that people write into as for each we need to > get the exact stat info to satisfy your ls -l request. > this is what takes so long, not the fact that the storage device might > be slow or to much metadata activity is going on , this is token , > means network traffic and obviously latency dependent. > > the best way to see this is to look at waiters on the client where you > run the ls and see what they are waiting for. > > there are various ways to tune this to get better 'felt' ls responses > but its not completely going away > if all you try to with ls is if there is a file in the directory run > unalias ls and check if ls after that runs fast as it shouldn't do the > -l under the cover anymore. > > > > > If its the CPU on the servers then there is no much to do beside > > replacing or add more servers.If its not the CPU, maybe more memory > > would help? Maybe its just the network that filled up? so i can add > > more links > > > > Or if we reached the point there the bottleneck its the spindles, > > then there is no much point o look somethere else, we just reached > > the hardware limit.. > > > > Sometimes, it also happens that there is very low IO (10Gb/s ), > > almost no cpu usage on the servers but huge slownes ( ls can take 10 > > seconds). Why that happens? There is not much data ops , but we > > think there is a huge ammount of metadata ops. So what i want to > > know is if the metadata vdisks are busy or not. If this is our > > problem, could some SSD disks dedicated to metadata help? > > the answer if ssd's would help or not are hard to say without knowing > the root case and as i tried to explain above the most likely case is > token revoke, not disk i/o. obviously as more busy your disks are as > longer the token revoke will take. > > > > > > > In particular im, a bit puzzled with the design of our GSS storage. > > Each recovery groups have 3 declustered arrays, and each declustered > > aray have 1 data and 1 metadata vdisk, but in the end both metadata > > and data vdisks use the same spindles. The problem that, its that I > > dont understand if we have a metadata bottleneck there. Maybe some > > SSD disks in a dedicated declustered array would perform much > > better, but this is just theory. I really would like to be able to > > monitor IO activities on the metadata vdisks. > > the short answer is we WANT the metadata disks to be with the data > disks on the same spindles. compared to other storage systems, GSS is > capable to handle different raid codes for different virtual disks on > the same physical disks, this way we create raid1'ish 'LUNS' for > metadata and raid6'is 'LUNS' for data so the small i/o penalty for a > metadata is very small compared to a read/modify/write on the data disks. > > > > > > > > > > > > so the Layer you care about is the NSD Server layer, which sits on > > top of the VIO layer (which is essentially the SW RAID Layer in GNR) > > > > > I'm asking that because if they are just ops, i don't know how much > > > they could be usefull. For example one write operation could eman > > > write 1 block or write a file of 100GB. If those are oprations, > > > there is a way to have the oupunt in bytes or blocks? > > > > there are multiple ways to get infos on the NSD layer, one would be > > to use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats > > counts again. > > > > Counters its not a problem. I can collect them and create some > > graphs in a monitoring tool. I will check that. > > if you (let) upgrade your system to GSS 2.0 you get a graphical > monitoring as part of it. if you want i can send you some direct email > outside the group with additional informations on that. > > > > > the alternative option is to use mmdiag --iohist. this shows you a > > history of the last X numbers of io operations on either the client > > or the server side like on a client : > > > > # mmdiag --iohist > > > > === mmdiag: iohist === > > > > I/O history: > > > > I/O start time RW Buf type disk:sectorNum nSec time ms > > qTime ms RpcTimes ms Type Device/NSD ID NSD server > > --------------- -- ----------- ----------------- ----- ------- > > -------- ----------------- ---- ------------------ --------------- > > 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 > > 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:22.182723 R inode 1:1071252480 8 6.970 > > 0.000 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 > > 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:53.668262 R inode 2:1081373696 8 14.117 > > 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 > > 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:53.692019 R inode 2:1064356608 8 14.899 > > 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:53.707100 R inode 2:1077830152 8 16.499 > > 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 > > 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:53.728082 R inode 2:1081918976 8 7.760 > > 0.000 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:57.877416 R metadata 2:678978560 16 13.343 > > 0.000 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 > > 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:57.906556 R inode 2:1083476520 8 11.723 > > 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 > > 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:57.926592 R inode 1:1076503480 8 8.087 > > 0.000 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 > > 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:57.941441 R inode 2:1069885984 8 11.686 > > 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:57.953294 R inode 2:1083476936 8 8.951 > > 0.000 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:57.965475 R inode 1:1076503504 8 0.477 > > 0.000 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 > > 14:25:57.965755 R inode 2:1083476488 8 0.410 > > 0.000 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 > > 14:25:57.965787 R inode 2:1083476512 8 0.439 > > 0.000 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 > > > > you basically see if its a inode , data block , what size it has (in > > sectors) , which nsd server you did send this request to, etc. > > > > on the Server side you see the type , which physical disk it goes to > > and also what size of disk i/o it causes like : > > > > 14:26:50.129995 R inode 12:3211886376 64 14.261 > > 0.000 0.000 0.000 pd sdis > > 14:26:50.137102 R inode 19:3003969520 64 9.004 > > 0.000 0.000 0.000 pd sdad > > 14:26:50.136116 R inode 55:3591710992 64 11.057 > > 0.000 0.000 0.000 pd sdoh > > 14:26:50.141510 R inode 21:3066810504 64 5.909 > > 0.000 0.000 0.000 pd sdaf > > 14:26:50.130529 R inode 89:2962370072 64 17.437 > > 0.000 0.000 0.000 pd sddi > > 14:26:50.131063 R inode 78:1889457000 64 17.062 > > 0.000 0.000 0.000 pd sdsj > > 14:26:50.143403 R inode 36:3323035688 64 4.807 > > 0.000 0.000 0.000 pd sdmw > > 14:26:50.131044 R inode 37:2513579736 128 17.181 > > 0.000 0.000 0.000 pd sddv > > 14:26:50.138181 R inode 72:3868810400 64 10.951 > > 0.000 0.000 0.000 pd sdbz > > 14:26:50.138188 R inode 131:2443484784 128 11.792 > > 0.000 0.000 0.000 pd sdug > > 14:26:50.138003 R inode 102:3696843872 64 11.994 > > 0.000 0.000 0.000 pd sdgp > > 14:26:50.137099 R inode 145:3370922504 64 13.225 > > 0.000 0.000 0.000 pd sdmi > > 14:26:50.141576 R inode 62:2668579904 64 9.313 > > 0.000 0.000 0.000 pd sdou > > 14:26:50.134689 R inode 159:2786164648 64 16.577 > > 0.000 0.000 0.000 pd sdpq > > 14:26:50.145034 R inode 34:2097217320 64 7.409 > > 0.000 0.000 0.000 pd sdmt > > 14:26:50.138140 R inode 139:2831038792 64 14.898 > > 0.000 0.000 0.000 pd sdlw > > 14:26:50.130954 R inode 164:282120312 64 22.274 > > 0.000 0.000 0.000 pd sdzd > > 14:26:50.137038 R inode 41:3421909608 64 16.314 > > 0.000 0.000 0.000 pd sdef > > 14:26:50.137606 R inode 104:1870962416 64 16.644 > > 0.000 0.000 0.000 pd sdgx > > 14:26:50.141306 R inode 65:2276184264 64 16.593 > > 0.000 0.000 0.000 pd sdrk > > > > > > > mmdiag --iohist its another think i looked at it, but i could not > > find good explanation for all the "buf type" ( third column ) > > > allocSeg > > data > > iallocSeg > > indBlock > > inode > > LLIndBlock > > logData > > logDesc > > logWrap > > metadata > > vdiskAULog > > vdiskBuf > > vdiskFWLog > > vdiskMDLog > > vdiskMeta > > vdiskRGDesc > > If i want to monifor metadata operation whan should i look at? just > > inodes =inodes , *alloc* = file or data allocation blocks , *ind* = > indirect blocks (for very large files) and metadata , everyhing else > is data or internal i/o's > > > the metadata flag or also inode? this command takes also long to > > run, especially if i run it a second time it hangs for a lot before > > to rerun again, so i'm not sure that run it every 30secs or minute > > its viable, but i will look also into that. THere is any > > documentation that descibes clearly the whole output? what i found > > its quite generic and don't go into details... > > the reason it takes so long is because it collects 10's of thousands > of i/os in a table and to not slow down the system when we dump the > data we copy it to a separate buffer so we don't need locks :-) > you can adjust the number of entries you want to collect by adjusting > the ioHistorySize config parameter > > > > > > > > Last but not least.. and this is what i really would like to > > > accomplish, i would to be able to monitor the latency of metadata > > operations. > > > > you can't do this on the server side as you don't know how much time > > you spend on the client , network or anything between the app and > > the physical disk, so you can only reliably look at this from the > > client, the iohist output only shows you the Server disk i/o > > processing time, but that can be a fraction of the overall time (in > > other cases this obviously can also be the dominant part depending > > on your workload). > > > > the easiest way on the client is to run > > > > mmfsadm vfsstats enable > > from now on vfs stats are collected until you restart GPFS. > > > > then run : > > > > vfs statistics currently enabled > > started at: Fri Aug 29 13:15:05.380 2014 > > duration: 448446.970 sec > > > > name calls time per call total time > > -------------------- -------- -------------- -------------- > > statfs 9 0.000002 0.000021 > > startIO 246191176 0.005853 1441049.976740 > > > > to dump what ever you collected so far on this node. > > > > > We already do that, but as I said, I want to check specifically how > > gss servers are keeping the requests to identify or exlude server > > side bottlenecks. > > > > > > Thanks for your help, you gave me definitely few things where to > look at. > > > > Salvatore > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at gpfsug.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From chekh at stanford.edu Fri Sep 5 22:17:47 2014 From: chekh at stanford.edu (Alex Chekholko) Date: Fri, 05 Sep 2014 14:17:47 -0700 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <540996E5.5000502@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <54084258.90508@ebi.ac.uk> <540996E5.5000502@ebi.ac.uk> Message-ID: <540A287B.1050202@stanford.edu> On 9/5/14, 3:56 AM, Salvatore Di Nardo wrote: > Little clarification: > Our ls its plain ls, there is no alias. ... > Last question about "maxFIlesToCache" you say that must be large on > small cluster but small on large clusters. What do you consider 6 > servers and almost 700 clients? > > on clienst we have: > maxFilesToCache 4000 > > on servers we have > maxFilesToCache 12288 > > One thing to do is to try your 'ls', see it is slow, then immediately run it again. If it is fast the second and consecutive times, it's because now the stat info is coming out of local cache. e.g. /usr/bin/time ls /path/to/some/dir && /usr/bin/time ls /path/to/some/dir The second time is likely to be almost immediate. So long as your local cache is big enough. I see on one of our older clusters we have: tokenMemLimit 2G maxFilesToCache 40000 maxStatCache 80000 You can also interrogate the local cache to see how full it is. Of course, if many nodes are writing to same dirs, then the cache will need to be invalidated often which causes some overhead. Big local cache is good if clients are usually working in different directories. Regards, -- chekh at stanford.edu From oehmes at us.ibm.com Sat Sep 6 01:12:42 2014 From: oehmes at us.ibm.com (Sven Oehme) Date: Fri, 5 Sep 2014 17:12:42 -0700 Subject: [gpfsug-discuss] gpfs performance monitoring In-Reply-To: <540996E5.5000502@ebi.ac.uk> References: <54074F90.7000303@ebi.ac.uk> <54084258.90508@ebi.ac.uk> <540996E5.5000502@ebi.ac.uk> Message-ID: on your GSS nodes you have tuning files we suggest customers to use for mixed workloads clients. the files in /usr/lpp/mmfs/samples/gss/ if you create a nodeclass for all your clients you can run /usr/lpp/mmfs/samples/gss/gssClientConfig.sh NODECLASS and it applies all the settings to them so they will be active on next restart of the gpfs daemon. this should be a very good starting point for your config. please try that and let me know if it doesn't. there are also several enhancements in GPFS 4.1 which reduce contention in multiple areas, which would help as well, if you have the choice to update the nodes. btw. the GSS 2.0 package will update your GSS nodes to 4.1 also Sven ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ From: Salvatore Di Nardo To: gpfsug main discussion list Date: 09/05/2014 03:57 AM Subject: Re: [gpfsug-discuss] gpfs performance monitoring Sent by: gpfsug-discuss-bounces at gpfsug.org Little clarification: Our ls its plain ls, there is no alias. Consider that all those things are already set up properly as EBI run hi computing farms from many years, so those things are already fixed loong time ago. We have very little experience with GPFS, but good knowledge with LSF farms and own multiple NFS stotages ( several petabyte sized). about NIS, all clients run NSCD that cashes all informations to avoid such tipe of slownes, in fact then ls isslow, also ls -n is slow. Beside that, also a "cd" sometimes hangs, so it have nothing to do with getting attributes. Just to clarify a bit more. Now GSS usually seems working fine, we have users that run jobs on the farms that pushes 180Gb/s read ( reading and writing files of 100GB size). GPFS works very well there, where other systems had performance problems accessing portion of data in so huge files. Sadly, on the other hand, other users run jobs that do suge ammount of metadata operations, like toons of ls in directory with many files, or creating a silly amount of temporary files just to synchronize the jobs between the farm nodes, or just to store temporary data for few milliseconds and them immediately delete those temporary files. Imagine to create constantly thousands files just to write few bytes and they delete them after few milliseconds... When those thing happens we see 10-15Gb/sec throughput, low CPU usage on the server ( 80% iddle), but any cd, or ls or wathever takes few seconds. So my question is, if the bottleneck could be the spindles, or if the clients could be tuned a bit more? I read your PDF and all the paramenters seems already well configured except "maxFilesToCache", but I'm not sure how we should configure few of those parameters on the clients. As an example I cannot immagine a client that require 38g pagepool size. so what's the correct pagepool on a client? what about those others? maxFilestoCache maxBufferdescs worker1threads worker3threads Right now all the clients have 1 GB pagepool size. In theory, we can afford to use more ( i thing we can easily go up to 8GB) as they have plenty or available memory. If this could help, we can do that, but the client really really need more than 1G? They are just clients after all, so the memory in theory should be used for jobs not just for "caching". Last question about "maxFIlesToCache" you say that must be large on small cluster but small on large clusters. What do you consider 6 servers and almost 700 clients? on clienst we have: maxFilesToCache 4000 on servers we have maxFilesToCache 12288 Regards, Salvatore On 05/09/14 01:48, Sven Oehme wrote: ------------------------------------------ Sven Oehme Scalable Storage Research email: oehmes at us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab ------------------------------------------ gpfsug-discuss-bounces at gpfsug.org wrote on 09/04/2014 03:43:36 AM: > From: Salvatore Di Nardo > To: gpfsug main discussion list > Date: 09/04/2014 03:44 AM > Subject: Re: [gpfsug-discuss] gpfs performance monitoring > Sent by: gpfsug-discuss-bounces at gpfsug.org > > On 04/09/14 01:50, Sven Oehme wrote: > > Hello everybody, > > Hi > > > here i come here again, this time to ask some hint about how to > monitor GPFS. > > > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is > > that they return number based only on the request done in the > > current host, so i have to run them on all the clients ( over 600 > > nodes) so its quite unpractical. Instead i would like to know from > > the servers whats going on, and i came across the vio_s statistics > > wich are less documented and i dont know exacly what they mean. > > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that > > runs VIO_S. > > > > My problems with the output of this command: > > echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1 > > > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second > > timestamp: 1409763206/477366 > > recovery group: * > > declustered array: * > > vdisk: * > > client reads: 2584229 > > client short writes: 55299693 > > client medium writes: 190071 > > client promoted full track writes: 465145 > > client full track writes: 9249 > > flushed update writes: 4187708 > > flushed promoted full track writes: 123 > > migrate operations: 114 > > scrub operations: 450590 > > log writes: 28509602 > > > > it sais "VIOPS per second", but they seem to me just counters as > > every time i re-run the command, the numbers increase by a bit.. > > Can anyone confirm if those numbers are counter or if they are OPS/sec. > > the numbers are accumulative so everytime you run them they just > show the value since start (or last reset) time. > OK, you confirmed my toughts, thatks > > > > > On a closer eye about i dont understand what most of thosevalues > > mean. For example, what exacly are "flushed promoted full track write" ?? > > I tried to find a documentation about this output , but could not > > find any. can anyone point me a link where output of vio_s is explained? > > > > Another thing i dont understand about those numbers is if they are > > just operations, or the number of blocks that was read/write/etc . > > its just operations and if i would explain what the numbers mean i > might confuse you even more because this is not what you are really > looking for. > what you are looking for is what the client io's look like on the > Server side, while the VIO layer is the Server side to the disks, so > one lever lower than what you are looking for from what i could read > out of the description above. > No.. what I'm looking its exactly how the disks are busy to keep the > requests. Obviously i'm not looking just that, but I feel the needs > to monitor also those things. Ill explain you why. > > It happens when our storage is quite busy ( 180Gb/s of read/write ) > that the FS start to be slowin normal cd or ls requests. This might > be normal, but in those situation i want to know where the > bottleneck is. Is the server CPU? Memory? Network? Spindles? knowing > where the bottlenek is might help me to understand if we can tweak > the system a bit more. if cd or ls is very slow in GPFS in the majority of the cases it has nothing to do with NSD Server bottlenecks, only indirect. the main reason ls is slow in the field is you have some very powerful nodes that all do buffered writes into the same directory into 1 or multiple files while you do the ls on a different node. what happens now is that the ls you did run most likely is a alias for ls -l or something even more complex with color display, etc, but the point is it most likely returns file size. GPFS doesn't lie about the filesize, we only return accurate stat informations and while this is arguable, its a fact today. so what happens is that the stat on each file triggers a token revoke on the node that currently writing to the file you do stat on, lets say it has 1 gb of dirty data in its memory for this file (as its writes data buffered) this 1 GB of data now gets written to the NSD server, the client updates the inode info and returns the correct size. lets say you have very fast network and you have a fast storage device like GSS (which i see you have) it will be able to do this in a few 100 ms, but the problem is this now happens serialized for each single file in this directory that people write into as for each we need to get the exact stat info to satisfy your ls -l request. this is what takes so long, not the fact that the storage device might be slow or to much metadata activity is going on , this is token , means network traffic and obviously latency dependent. the best way to see this is to look at waiters on the client where you run the ls and see what they are waiting for. there are various ways to tune this to get better 'felt' ls responses but its not completely going away if all you try to with ls is if there is a file in the directory run unalias ls and check if ls after that runs fast as it shouldn't do the -l under the cover anymore. > > If its the CPU on the servers then there is no much to do beside > replacing or add more servers.If its not the CPU, maybe more memory > would help? Maybe its just the network that filled up? so i can add > more links > > Or if we reached the point there the bottleneck its the spindles, > then there is no much point o look somethere else, we just reached > the hardware limit.. > > Sometimes, it also happens that there is very low IO (10Gb/s ), > almost no cpu usage on the servers but huge slownes ( ls can take 10 > seconds). Why that happens? There is not much data ops , but we > think there is a huge ammount of metadata ops. So what i want to > know is if the metadata vdisks are busy or not. If this is our > problem, could some SSD disks dedicated to metadata help? the answer if ssd's would help or not are hard to say without knowing the root case and as i tried to explain above the most likely case is token revoke, not disk i/o. obviously as more busy your disks are as longer the token revoke will take. > > > In particular im, a bit puzzled with the design of our GSS storage. > Each recovery groups have 3 declustered arrays, and each declustered > aray have 1 data and 1 metadata vdisk, but in the end both metadata > and data vdisks use the same spindles. The problem that, its that I > dont understand if we have a metadata bottleneck there. Maybe some > SSD disks in a dedicated declustered array would perform much > better, but this is just theory. I really would like to be able to > monitor IO activities on the metadata vdisks. the short answer is we WANT the metadata disks to be with the data disks on the same spindles. compared to other storage systems, GSS is capable to handle different raid codes for different virtual disks on the same physical disks, this way we create raid1'ish 'LUNS' for metadata and raid6'is 'LUNS' for data so the small i/o penalty for a metadata is very small compared to a read/modify/write on the data disks. > > > > > so the Layer you care about is the NSD Server layer, which sits on > top of the VIO layer (which is essentially the SW RAID Layer in GNR) > > > I'm asking that because if they are just ops, i don't know how much > > they could be usefull. For example one write operation could eman > > write 1 block or write a file of 100GB. If those are oprations, > > there is a way to have the oupunt in bytes or blocks? > > there are multiple ways to get infos on the NSD layer, one would be > to use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats > counts again. > > Counters its not a problem. I can collect them and create some > graphs in a monitoring tool. I will check that. if you (let) upgrade your system to GSS 2.0 you get a graphical monitoring as part of it. if you want i can send you some direct email outside the group with additional informations on that. > > the alternative option is to use mmdiag --iohist. this shows you a > history of the last X numbers of io operations on either the client > or the server side like on a client : > > # mmdiag --iohist > > === mmdiag: iohist === > > I/O history: > > I/O start time RW Buf type disk:sectorNum nSec time ms > qTime ms RpcTimes ms Type Device/NSD ID NSD server > --------------- -- ----------- ----------------- ----- ------- > -------- ----------------- ---- ------------------ --------------- > 14:25:22.169617 R LLIndBlock 1:1075622848 64 13.073 > 0.000 12.959 0.063 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:22.182723 R inode 1:1071252480 8 6.970 > 0.000 6.908 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.659918 R LLIndBlock 1:1081202176 64 8.309 > 0.000 8.210 0.046 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.668262 R inode 2:1081373696 8 14.117 > 0.000 14.032 0.058 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.682750 R LLIndBlock 1:1065508736 64 9.254 > 0.000 9.180 0.038 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.692019 R inode 2:1064356608 8 14.899 > 0.000 14.847 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.707100 R inode 2:1077830152 8 16.499 > 0.000 16.449 0.025 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:53.723788 R LLIndBlock 1:1081202432 64 4.280 > 0.000 4.203 0.040 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:53.728082 R inode 2:1081918976 8 7.760 > 0.000 7.710 0.027 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.877416 R metadata 2:678978560 16 13.343 > 0.000 13.254 0.053 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.891048 R LLIndBlock 1:1065508608 64 15.491 > 0.000 15.401 0.058 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.906556 R inode 2:1083476520 8 11.723 > 0.000 11.676 0.029 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.918516 R LLIndBlock 1:1075622720 64 8.062 > 0.000 8.001 0.032 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.926592 R inode 1:1076503480 8 8.087 > 0.000 8.043 0.026 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.934856 R LLIndBlock 1:1071088512 64 6.572 > 0.000 6.510 0.033 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.941441 R inode 2:1069885984 8 11.686 > 0.000 11.641 0.024 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.953294 R inode 2:1083476936 8 8.951 > 0.000 8.912 0.021 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965475 R inode 1:1076503504 8 0.477 > 0.000 0.053 0.000 cli C0A70401:53BEEA7F 192.167.4.1 > 14:25:57.965755 R inode 2:1083476488 8 0.410 > 0.000 0.061 0.321 cli C0A70402:53BEEA5E 192.167.4.2 > 14:25:57.965787 R inode 2:1083476512 8 0.439 > 0.000 0.053 0.342 cli C0A70402:53BEEA5E 192.167.4.2 > > you basically see if its a inode , data block , what size it has (in > sectors) , which nsd server you did send this request to, etc. > > on the Server side you see the type , which physical disk it goes to > and also what size of disk i/o it causes like : > > 14:26:50.129995 R inode 12:3211886376 64 14.261 > 0.000 0.000 0.000 pd sdis > 14:26:50.137102 R inode 19:3003969520 64 9.004 > 0.000 0.000 0.000 pd sdad > 14:26:50.136116 R inode 55:3591710992 64 11.057 > 0.000 0.000 0.000 pd sdoh > 14:26:50.141510 R inode 21:3066810504 64 5.909 > 0.000 0.000 0.000 pd sdaf > 14:26:50.130529 R inode 89:2962370072 64 17.437 > 0.000 0.000 0.000 pd sddi > 14:26:50.131063 R inode 78:1889457000 64 17.062 > 0.000 0.000 0.000 pd sdsj > 14:26:50.143403 R inode 36:3323035688 64 4.807 > 0.000 0.000 0.000 pd sdmw > 14:26:50.131044 R inode 37:2513579736 128 17.181 > 0.000 0.000 0.000 pd sddv > 14:26:50.138181 R inode 72:3868810400 64 10.951 > 0.000 0.000 0.000 pd sdbz > 14:26:50.138188 R inode 131:2443484784 128 11.792 > 0.000 0.000 0.000 pd sdug > 14:26:50.138003 R inode 102:3696843872 64 11.994 > 0.000 0.000 0.000 pd sdgp > 14:26:50.137099 R inode 145:3370922504 64 13.225 > 0.000 0.000 0.000 pd sdmi > 14:26:50.141576 R inode 62:2668579904 64 9.313 > 0.000 0.000 0.000 pd sdou > 14:26:50.134689 R inode 159:2786164648 64 16.577 > 0.000 0.000 0.000 pd sdpq > 14:26:50.145034 R inode 34:2097217320 64 7.409 > 0.000 0.000 0.000 pd sdmt > 14:26:50.138140 R inode 139:2831038792 64 14.898 > 0.000 0.000 0.000 pd sdlw > 14:26:50.130954 R inode 164:282120312 64 22.274 > 0.000 0.000 0.000 pd sdzd > 14:26:50.137038 R inode 41:3421909608 64 16.314 > 0.000 0.000 0.000 pd sdef > 14:26:50.137606 R inode 104:1870962416 64 16.644 > 0.000 0.000 0.000 pd sdgx > 14:26:50.141306 R inode 65:2276184264 64 16.593 > 0.000 0.000 0.000 pd sdrk > > > mmdiag --iohist its another think i looked at it, but i could not > find good explanation for all the "buf type" ( third column ) > allocSeg > data > iallocSeg > indBlock > inode > LLIndBlock > logData > logDesc > logWrap > metadata > vdiskAULog > vdiskBuf > vdiskFWLog > vdiskMDLog > vdiskMeta > vdiskRGDesc > If i want to monifor metadata operation whan should i look at? just inodes =inodes , *alloc* = file or data allocation blocks , *ind* = indirect blocks (for very large files) and metadata , everyhing else is data or internal i/o's > the metadata flag or also inode? this command takes also long to > run, especially if i run it a second time it hangs for a lot before > to rerun again, so i'm not sure that run it every 30secs or minute > its viable, but i will look also into that. THere is any > documentation that descibes clearly the whole output? what i found > its quite generic and don't go into details... the reason it takes so long is because it collects 10's of thousands of i/os in a table and to not slow down the system when we dump the data we copy it to a separate buffer so we don't need locks :-) you can adjust the number of entries you want to collect by adjusting the ioHistorySize config parameter > > > > Last but not least.. and this is what i really would like to > > accomplish, i would to be able to monitor the latency of metadata > operations. > > you can't do this on the server side as you don't know how much time > you spend on the client , network or anything between the app and > the physical disk, so you can only reliably look at this from the > client, the iohist output only shows you the Server disk i/o > processing time, but that can be a fraction of the overall time (in > other cases this obviously can also be the dominant part depending > on your workload). > > the easiest way on the client is to run > > mmfsadm vfsstats enable > from now on vfs stats are collected until you restart GPFS. > > then run : > > vfs statistics currently enabled > started at: Fri Aug 29 13:15:05.380 2014 > duration: 448446.970 sec > > name calls time per call total time > -------------------- -------- -------------- -------------- > statfs 9 0.000002 0.000021 > startIO 246191176 0.005853 1441049.976740 > > to dump what ever you collected so far on this node. > > We already do that, but as I said, I want to check specifically how > gss servers are keeping the requests to identify or exlude server > side bottlenecks. > > > Thanks for your help, you gave me definitely few things where to look at. > > Salvatore > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From luke.raimbach at oerc.ox.ac.uk Tue Sep 9 11:23:47 2014 From: luke.raimbach at oerc.ox.ac.uk (Luke Raimbach) Date: Tue, 9 Sep 2014 10:23:47 +0000 Subject: [gpfsug-discuss] mmdiag output questions Message-ID: Hi All, When tracing a problem recently (which turned out to be a NIC failure), mmdiag proved useful in tracing broken cluster connections. I have some questions about the output of mmdiag using the --network switch: Occasionally I see nodes in the same cluster grouped, but in no readily identifiable way - for example, the following output has three headings "Device bon0:" with some nodes listed, but the nodes don't seem to share anything in common like status, err, ostype, etc. Also, is anyone able to explain what might be seen under the err column? Do these correspond to GPFS error codes as one might see in mmfs.log.latest? What is the sock column displaying - the number of open sockets or the socket state? Lastly, the sent/recvd columns seem very low. Is there a rolling time window within which these statistics are kept in some internal mmfsd buffer? Cheers. === mmdiag: network === Pending messages: (none) Inter-node communication configuration: tscTcpPort 1191 my address 10.100.10.51/22 (eth0) my addr list 10.200.21.1/16 (bond0)/cpdn.oerc.local 10.100.10.51/22 (eth0) my node number 9 TCP Connections between nodes: Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype gpfs01 10.200.1.1 connected 0 32 110 110 Linux/L gpfs02 10.200.2.1 connected 0 36 104 104 Linux/L linux 10.200.101.1 connected 0 37 0 0 Linux/L jupiter 10.200.102.1 connected 0 35 0 0 Windows/L cnfs0 10.200.10.10 connected 0 39 0 0 Linux/L cnfs1 10.200.10.11 init 0 -1 0 0 Linux/L Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype cnfs2 10.200.10.12 connected 0 33 5 5 Linux/L cnfs3 10.200.10.13 init 0 -1 0 0 Linux/L cpdn-ppc02 10.200.61.1 init 0 -1 0 0 Linux/L cpdn-ppc03 10.200.62.1 init 0 -1 0 0 Linux/L Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype cpdn-ppc01 10.200.60.1 connected 0 38 0 0 Linux/L diag verbs: VERBS RDMA class not initialized Conversely, the output of mmdiag --network on the file system manager node for the same cluster looks like this: === mmdiag: network === Pending messages: (none) Inter-node communication configuration: tscTcpPort 1191 my address 10.100.10.21/22 (eth0) my addr list 10.200.1.1/16 (bond0)/cpdn.oerc.local 10.100.10.21/22 (eth0) my node number 1 TCP Connections between nodes: Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype gpfs02 10.200.2.1 connected 0 73 219 219 Linux/L linux 10.200.101.1 connected 0 49 180 181 Linux/L jupiter 10.200.102.1 connected 0 33 3 3 Windows/L cnfs0 10.200.10.10 connected 0 61 3 3 Linux/L cnfs1 10.200.10.11 connected 0 81 0 0 Linux/L cnfs2 10.200.10.12 connected 0 64 23 23 Linux/L cnfs3 10.200.10.13 connected 0 60 2 2 Linux/L tsm01 10.200.21.1 connected 0 50 110 110 Linux/L cpdn-ppc02 10.200.61.1 connected 0 63 0 0 Linux/L cpdn-ppc03 10.200.62.1 connected 0 65 0 0 Linux/L cpdn-ppc01 10.200.60.1 connected 0 62 94 94 Linux/L diag verbs: VERBS RDMA class not initialized All neatly connected! -- Luke Raimbach IT Manager Oxford e-Research Centre 7 Keble Road, Oxford, OX1 3QG +44(0)1865 610639 From chair at gpfsug.org Wed Sep 10 15:33:24 2014 From: chair at gpfsug.org (Jez Tucker (Chair)) Date: Wed, 10 Sep 2014 15:33:24 +0100 Subject: [gpfsug-discuss] GPFS Request for Enhancements Message-ID: <54106134.7010902@gpfsug.org> Hi all Just a quick reminder that the RFEs that you all gave feedback at the last UG on are live on IBM's RFE site: goo.gl/1K6LBa Please take the time to have a look and add your votes to the GPFS RFEs. Jez -------------- next part -------------- An HTML attachment was scrubbed... URL: From dmetcalfe at ocf.co.uk Thu Sep 11 21:18:58 2014 From: dmetcalfe at ocf.co.uk (Daniel Metcalfe) Date: Thu, 11 Sep 2014 21:18:58 +0100 Subject: [gpfsug-discuss] mmdiag output questions In-Reply-To: References: Message-ID: Hi Luke, I've seen the same apparent grouping of nodes, I don't believe the nodes are actually being grouped but instead the "Device Bond0:" and column headers are being re-printed to screen whenever there is a node that has the "init" status followed by a node that is "connected". It is something I've noticed on many different versions of GPFS so I imagine it's a "feature". I've not noticed anything but '0' in the err column so I'm not sure if these correspond to error codes in the GPFS logs. If you run the command "mmfsadm dump tscomm", you'll see a bit more detail than the mmdiag -network shows. This suggests the sock column is number of sockets. I've seen the low numbers to for sent / recv using mmdiag --network, again the mmfsadm command above gives a better representation I've found. All that being said, if you want to get in touch with us then we'll happily open a PMR for you and find out the answer to any of your questions. Kind regards, Danny Metcalfe Systems Engineer OCF plc Tel: 0114 257 2200 [cid:image001.jpg at 01CFCE04.575B8380] Twitter Fax: 0114 257 0022 [cid:image002.jpg at 01CFCE04.575B8380] Blog Mob: 07960 503404 [cid:image003.jpg at 01CFCE04.575B8380] Web Please note, any emails relating to an OCF Support request must always be sent to support at ocf.co.uk for a ticket number to be generated or existing support ticket to be updated. Should this not be done then OCF cannot be held responsible for requests not dealt with in a timely manner. OCF plc is a company registered in England and Wales. Registered number 4132533. Registered office address: OCF plc, 5 Rotunda Business Centre, Thorncliffe Park, Chapeltown, Sheffield, S35 2PG This message is private and confidential. If you have received this message in error, please notify us immediately and remove it from your system. -----Original Message----- From: gpfsug-discuss-bounces at gpfsug.org [mailto:gpfsug-discuss-bounces at gpfsug.org] On Behalf Of Luke Raimbach Sent: 09 September 2014 11:24 To: gpfsug-discuss at gpfsug.org Subject: [gpfsug-discuss] mmdiag output questions Hi All, When tracing a problem recently (which turned out to be a NIC failure), mmdiag proved useful in tracing broken cluster connections. I have some questions about the output of mmdiag using the --network switch: Occasionally I see nodes in the same cluster grouped, but in no readily identifiable way - for example, the following output has three headings "Device bon0:" with some nodes listed, but the nodes don't seem to share anything in common like status, err, ostype, etc. Also, is anyone able to explain what might be seen under the err column? Do these correspond to GPFS error codes as one might see in mmfs.log.latest? What is the sock column displaying - the number of open sockets or the socket state? Lastly, the sent/recvd columns seem very low. Is there a rolling time window within which these statistics are kept in some internal mmfsd buffer? Cheers. === mmdiag: network === Pending messages: (none) Inter-node communication configuration: tscTcpPort 1191 my address 10.100.10.51/22 (eth0) my addr list 10.200.21.1/16 (bond0)/cpdn.oerc.local 10.100.10.51/22 (eth0) my node number 9 TCP Connections between nodes: Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype gpfs01 10.200.1.1 connected 0 32 110 110 Linux/L gpfs02 10.200.2.1 connected 0 36 104 104 Linux/L linux 10.200.101.1 connected 0 37 0 0 Linux/L jupiter 10.200.102.1 connected 0 35 0 0 Windows/L cnfs0 10.200.10.10 connected 0 39 0 0 Linux/L cnfs1 10.200.10.11 init 0 -1 0 0 Linux/L Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype cnfs2 10.200.10.12 connected 0 33 5 5 Linux/L cnfs3 10.200.10.13 init 0 -1 0 0 Linux/L cpdn-ppc02 10.200.61.1 init 0 -1 0 0 Linux/L cpdn-ppc03 10.200.62.1 init 0 -1 0 0 Linux/L Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype cpdn-ppc01 10.200.60.1 connected 0 38 0 0 Linux/L diag verbs: VERBS RDMA class not initialized Conversely, the output of mmdiag --network on the file system manager node for the same cluster looks like this: === mmdiag: network === Pending messages: (none) Inter-node communication configuration: tscTcpPort 1191 my address 10.100.10.21/22 (eth0) my addr list 10.200.1.1/16 (bond0)/cpdn.oerc.local 10.100.10.21/22 (eth0) my node number 1 TCP Connections between nodes: Device bond0: hostname node destination status err sock sent(MB) recvd(MB) ostype gpfs02 10.200.2.1 connected 0 73 219 219 Linux/L linux 10.200.101.1 connected 0 49 180 181 Linux/L jupiter 10.200.102.1 connected 0 33 3 3 Windows/L cnfs0 10.200.10.10 connected 0 61 3 3 Linux/L cnfs1 10.200.10.11 connected 0 81 0 0 Linux/L cnfs2 10.200.10.12 connected 0 64 23 23 Linux/L cnfs3 10.200.10.13 connected 0 60 2 2 Linux/L tsm01 10.200.21.1 connected 0 50 110 110 Linux/L cpdn-ppc02 10.200.61.1 connected 0 63 0 0 Linux/L cpdn-ppc03 10.200.62.1 connected 0 65 0 0 Linux/L cpdn-ppc01 10.200.60.1 connected 0 62 94 94 Linux/L diag verbs: VERBS RDMA class not initialized All neatly connected! -- Luke Raimbach IT Manager Oxford e-Research Centre 7 Keble Road, Oxford, OX1 3QG +44(0)1865 610639 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ----- No virus found in this message. Checked by AVG - www.avg.com Version: 2014.0.4765 / Virus Database: 4015/8158 - Release Date: 09/05/14 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 4696 bytes Desc: image001.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.jpg Type: image/jpeg Size: 4725 bytes Desc: image002.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.jpg Type: image/jpeg Size: 4820 bytes Desc: image003.jpg URL: From stuartb at 4gh.net Tue Sep 23 16:47:09 2014 From: stuartb at 4gh.net (Stuart Barkley) Date: Tue, 23 Sep 2014 11:47:09 -0400 (EDT) Subject: [gpfsug-discuss] filesets and mountpoint naming Message-ID: When we first started using GPFS we created several filesystems and just directly mounted them where seemed appropriate. We have something like: /home /scratch /projects /reference /applications We are finding the overhead of separate filesystems to be troublesome and are looking at using filesets inside fewer filesystems to accomplish our goals (we will probably keep /home separate for now). We can put symbolic links in place to provide the same user experience, but I'm looking for suggestions as to where to mount the actual gpfs filesystems. We have multiple compute clusters with multiple gpfs systems, one cluster has a traditional gpfs system and a separate gss system which will obviously need multiple mount points. We also want to consider possible future cross cluster mounts. Some thoughts are to just do filesystems as: /gpfs01, /gpfs02, etc. /mnt/gpfs01, etc /mnt/clustera/gpfs01, etc. What have other people done? Are you happy with it? What would you do differently? Thanks, Stuart -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone From sabujp at gmail.com Thu Sep 25 13:39:14 2014 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Thu, 25 Sep 2014 07:39:14 -0500 Subject: [gpfsug-discuss] really slow cnfs vip failover due to exportfs -u in bin/mmnfsfuncs unexportAll and unexportFS Message-ID: Hi all, We're running 3.5.0.19 with CNFS and noticed really slow CNFS vip failover times > 4.5mins . It looks like it's being caused by all the exportfs -u calls being made in the unexportAll and the unexportFS function in bin/mmnfsfuncs . What's the purpose of running exportfs -u on all the exported directories? We're running only NFSv3 and have lots of exports and for security reasons can't have one giant NFS export. That may be a possibility with GPFS4.1 and NFSv4 but we won't be migrating to that anytime soon. Assume the network went down for the cnfs server or the system panicked/crashed, what would be the purpose of exportfs -u be in that case, so what's the purpose at all? Thanks, Sabuj -------------- next part -------------- An HTML attachment was scrubbed... URL: From sabujp at gmail.com Thu Sep 25 14:11:18 2014 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Thu, 25 Sep 2014 08:11:18 -0500 Subject: [gpfsug-discuss] really slow cnfs vip failover due to exportfs -u in bin/mmnfsfuncs unexportAll and unexportFS In-Reply-To: References: Message-ID: our support engineer suggests adding & to the end of the exportfs -u lines in the mmnfsfunc script, which is a good workaround, can this be added to future gpfs 3.5 and 4.1 rels (haven't even looked at 4.1 yet). I was looking at the unexportfs all in nfs-utils/exportfs.c and it looks like the limiting factor there would be all the hostname lookups? I don't see what exportfs -u is doing other than doing slow reverse lookups and removing the export from the nfs stack. On Thu, Sep 25, 2014 at 7:39 AM, Sabuj Pattanayek wrote: > Hi all, > > We're running 3.5.0.19 with CNFS and noticed really slow CNFS vip failover > times > 4.5mins . It looks like it's being caused by all the exportfs -u > calls being made in the unexportAll and the unexportFS function in > bin/mmnfsfuncs . What's the purpose of running exportfs -u on all the > exported directories? We're running only NFSv3 and have lots of exports and > for security reasons can't have one giant NFS export. That may be a > possibility with GPFS4.1 and NFSv4 but we won't be migrating to that > anytime soon. > > Assume the network went down for the cnfs server or the system > panicked/crashed, what would be the purpose of exportfs -u be in that case, > so what's the purpose at all? > > Thanks, > Sabuj > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sabujp at gmail.com Thu Sep 25 14:15:19 2014 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Thu, 25 Sep 2014 08:15:19 -0500 Subject: [gpfsug-discuss] really slow cnfs vip failover due to exportfs -u in bin/mmnfsfuncs unexportAll and unexportFS In-Reply-To: References: Message-ID: yes, it's doing a getaddrinfo() call for every hostname that's a fqdn and not an ip addr, which we have lots of in our export entries since sometimes clients update their dns (ip's). On Thu, Sep 25, 2014 at 8:11 AM, Sabuj Pattanayek wrote: > our support engineer suggests adding & to the end of the exportfs -u lines > in the mmnfsfunc script, which is a good workaround, can this be added to > future gpfs 3.5 and 4.1 rels (haven't even looked at 4.1 yet). I was > looking at the unexportfs all in nfs-utils/exportfs.c and it looks like the > limiting factor there would be all the hostname lookups? I don't see what > exportfs -u is doing other than doing slow reverse lookups and removing the > export from the nfs stack. > > On Thu, Sep 25, 2014 at 7:39 AM, Sabuj Pattanayek > wrote: > >> Hi all, >> >> We're running 3.5.0.19 with CNFS and noticed really slow CNFS vip >> failover times > 4.5mins . It looks like it's being caused by all the >> exportfs -u calls being made in the unexportAll and the unexportFS function >> in bin/mmnfsfuncs . What's the purpose of running exportfs -u on all the >> exported directories? We're running only NFSv3 and have lots of exports and >> for security reasons can't have one giant NFS export. That may be a >> possibility with GPFS4.1 and NFSv4 but we won't be migrating to that >> anytime soon. >> >> Assume the network went down for the cnfs server or the system >> panicked/crashed, what would be the purpose of exportfs -u be in that case, >> so what's the purpose at all? >> >> Thanks, >> Sabuj >> >> >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: