From S.J.Thompson at bham.ac.uk  Mon Sep  1 20:44:45 2014
From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services))
Date: Mon, 1 Sep 2014 19:44:45 +0000
Subject: [gpfsug-discuss] GPFS admin host name vs subnets
Message-ID: <CF45EE16DEF2FE4B9AA7FF2B6EE26545074A0E09@EX5.adf.bham.ac.uk>

I was just reading through the docs at:

https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/GPFS+Network+Communication+Overview

And was wondering about using admin host name bs using subnets. My reading of the page is that if say I have a 1GbE network and a 40GbE network, I could have an admin host name on the 1GbE network. But equally from the docs, it looks like I could also use subnets to achieve the same whilst allowing the admin network to be a fall back for data if necessary.

For example, create the cluster using the primary name on the 1GbE network, then use the subnets property to use set the network on the 40GbE network as the first and the network on the 1GbE network as the second in the list, thus GPFS data will pass over the 40GbE network in preference and the 1GbE network will, by default only be used for admin traffic as the admin host name will just be the name of the host on the 1GbE network.

Is my reading of the docs correct? Or do I really want to be creating the cluster using the 40GbE network hostnames and set the admin node name to the name of the 1GbE network interface?

(there's actually also an FDR switch in there somewhere for verbs as well)

Thanks

Simon

From ewahl at osc.edu  Tue Sep  2 14:44:29 2014
From: ewahl at osc.edu (Ed Wahl)
Date: Tue, 2 Sep 2014 13:44:29 +0000
Subject: [gpfsug-discuss] GPFS admin host name vs subnets
In-Reply-To: <CF45EE16DEF2FE4B9AA7FF2B6EE26545074A0E09@EX5.adf.bham.ac.uk>
References: <CF45EE16DEF2FE4B9AA7FF2B6EE26545074A0E09@EX5.adf.bham.ac.uk>
Message-ID: <C59E5201836F7147BAD35189FFBB35D101164D7385@USOAPP09V04P.si.lan>

Seems like you are on the correct track.  This is similar to my setup.   subnett'ed daemon 10GbE, 1GbE with main being QDR RDMA,   admin 1GbE.   To my mind the most important part is  Setting "privateSubnetOverride" to 1. This allows both your 1GbE and your 40GbE to be on a private subnet.  Serving block over public IPs just seems wrong on SO many levels. Whether truly private/internal or not.  And how many people use public IPs internally? Wait, maybe I don't want to know...

   Using 'verbsRdma enable' for your FDR seems to override Daemon node name for block, at least in my experience.  I love the fallback to 10GbE and then 1GbE in case of disaster when using IB.  Lately we seem to be generating bugs in OpenSM at a frightening rate so that has been _extremely_ helpful. Now if we could just monitor when it happens more easily than running mmfsadm test verbs conn, say by logging a failure of RDMA?


Ed
OSC

________________________________________
From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of Simon Thompson (Research Computing - IT Services) [S.J.Thompson at bham.ac.uk]
Sent: Monday, September 01, 2014 3:44 PM
To: gpfsug main discussion list
Subject: [gpfsug-discuss] GPFS admin host name vs subnets

I was just reading through the docs at:

https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/GPFS+Network+Communication+Overview

And was wondering about using admin host name bs using subnets. My reading of the page is that if say I have a 1GbE network and a 40GbE network, I could have an admin host name on the 1GbE network. But equally from the docs, it looks like I could also use subnets to achieve the same whilst allowing the admin network to be a fall back for data if necessary.

For example, create the cluster using the primary name on the 1GbE network, then use the subnets property to use set the network on the 40GbE network as the first and the network on the 1GbE network as the second in the list, thus GPFS data will pass over the 40GbE network in preference and the 1GbE network will, by default only be used for admin traffic as the admin host name will just be the name of the host on the 1GbE network.

Is my reading of the docs correct? Or do I really want to be creating the cluster using the 40GbE network hostnames and set the admin node name to the name of the 1GbE network interface?

(there's actually also an FDR switch in there somewhere for verbs as well)

Thanks

Simon
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


From oehmes at gmail.com  Tue Sep  2 15:11:03 2014
From: oehmes at gmail.com (Sven Oehme)
Date: Tue, 2 Sep 2014 07:11:03 -0700
Subject: [gpfsug-discuss] GPFS admin host name vs subnets
In-Reply-To: <C59E5201836F7147BAD35189FFBB35D101164D7385@USOAPP09V04P.si.lan>
References: <CF45EE16DEF2FE4B9AA7FF2B6EE26545074A0E09@EX5.adf.bham.ac.uk>
	<C59E5201836F7147BAD35189FFBB35D101164D7385@USOAPP09V04P.si.lan>
Message-ID: <CALssuR2R=Pv+sm3w_uem3_EP3_9=j-2EmXRHMjvKBLXgp1O6Jg@mail.gmail.com>

Ed,

if you enable RDMA, GPFS will always use this as preferred data transfer.
if you have subnets configured, GPFS will prefer this for communication
with higher priority as the default interface.
so the order is RDMA , subnets, default.
if RDMA will fail for whatever reason we will use the subnets defined
interface and if that fails as well we will use the default interface.

the easiest way to see what is used is to run mmdiag --network (only avail
on more recent versions of GPFS) it will tell you if RDMA is enabled
between individual nodes as well as if a subnet connection is used or not :

[root at client05 ~]# mmdiag --network

=== mmdiag: network ===

Pending messages:
  (none)
Inter-node communication configuration:
  tscTcpPort      1191
  my address      192.167.13.5/16 (eth0) <c0n6>
  my addr list    192.1.13.5/16 (ib1)  192.0.13.5/16 (ib0)/
client04.clientad.almaden.ibm.com  192.167.13.5/16 (eth0)
  my node number  17
TCP Connections between nodes:
  Device ib0:
    hostname                            node     destination     status
err  sock  sent(MB)  recvd(MB)  ostype
    client04n1                           <c1n0>   192.0.4.1       connected
 0    69    0         37         Linux/L
    client04n2                           <c1n1>   192.0.4.2       connected
 0    70    0         37         Linux/L
    client04n3                           <c1n2>   192.0.4.3       connected
 0    68    0         0          Linux/L
  Device ib1:
    hostname                            node     destination     status
err  sock  sent(MB)  recvd(MB)  ostype
    clientcl21                           <c0n0>   192.1.201.21    connected
 0    65    0         0          Linux/L
    clientcl25                           <c0n3>   192.1.201.25    connected
 0    66    0         0          Linux/L
    clientcl26                           <c0n4>   192.1.201.26    connected
 0    67    0         0          Linux/L
    clientcl21                           <c1n3>   192.1.201.21    connected
 0    71    0         0          Linux/L
    clientcl22                           <c1n4>   192.1.201.22    connected
 0    63    0         0          Linux/L
    client10                            <c1n5>   192.1.13.10     connected
 0    73    0         0          Linux/L
    client08                            <c1n7>   192.1.13.8      connected
 0    72    0         0          Linux/L
RDMA Connections between nodes:
  Fabric 1 - Device mlx4_0 Port 1 Width 4x Speed FDR lid 13
    hostname                            idx CM state VS buff RDMA_CT(ERR)
RDMA_RCV_MB RDMA_SND_MB VS_CT(ERR) VS_SND_MB VS_RCV_MB WAIT_CON_SLOT
WAIT_NODE_SLOT
    clientcl21                           0   N  RTS   (Y)903  0      (0  )
0           0           192  (0  ) 0         0         0             0
    client04n1                           0   N  RTS   (Y)477  0      (0  )
0           0           12367404(0  ) 107905    594       0             0
    client04n1                           1   N  RTS   (Y)477  0      (0  )
0           0           12367404(0  ) 107901    593       0             0
    client04n2                           0   N  RTS   (Y)477  0      (0  )
0           0           12371352(0  ) 107911    594       0             0
    client04n2                           2   N  RTS   (Y)477  0      (0  )
0           0           12371352(0  ) 107902    594       0             0
    clientcl21                           0   N  RTS   (Y)880  0      (0  )
0           0           11   (0  ) 0         0         0             0
    client04n3                           0   N  RTS   (Y)969  0      (0  )
0           0           5    (0  ) 0         0         0             0
    clientcl26                           0   N  RTS   (Y)702  0      (0  )
0           0           35   (0  ) 0         0         0             0
    client08                            0   N  RTS   (Y)637  0      (0  ) 0
          0           16   (0  ) 0         0         0             0
    clientcl25                           0   N  RTS   (Y)574  0      (0  )
0           0           14   (0  ) 0         0         0             0
    clientcl22                           0   N  RTS   (Y)507  0      (0  )
0           0           2    (0  ) 0         0         0             0
    client10                            0   N  RTS   (Y)568  0      (0  ) 0
          0           121  (0  ) 0         0         0             0
  Fabric 2 - Device mlx4_0 Port 2 Width 4x Speed FDR lid 65
    hostname                            idx CM state VS buff RDMA_CT(ERR)
RDMA_RCV_MB RDMA_SND_MB VS_CT(ERR) VS_SND_MB VS_RCV_MB WAIT_CON_SLOT
WAIT_NODE_SLOT
    clientcl21                           1   N  RTS   (Y)904  0      (0  )
0           0           192  (0  ) 0         0         0             0
    client04n1                           2   N  RTS   (Y)477  0      (0  )
0           0           12367404(0  ) 107897    593       0             0
    client04n2                           1   N  RTS   (Y)477  0      (0  )
0           0           12371352(0  ) 107903    594       0             0
    clientcl21                           1   N  RTS   (Y)881  0      (0  )
0           0           10   (0  ) 0         0         0             0
    clientcl26                           1   N  RTS   (Y)701  0      (0  )
0           0           35   (0  ) 0         0         0             0
    client08                            1   N  RTS   (Y)637  0      (0  ) 0
          0           16   (0  ) 0         0         0             0
    clientcl25                           1   N  RTS   (Y)574  0      (0  )
0           0           14   (0  ) 0         0         0             0
    clientcl22                           1   N  RTS   (Y)507  0      (0  )
0           0           2    (0  ) 0         0         0             0
    client10                            1   N  RTS   (Y)568  0      (0  ) 0
          0           121  (0  ) 0         0         0             0

in this example you can see thet my client (client05) has multiple subnets
configured as well as RDMA.
so to connected to the various TCP devices (ib0 and ib1) to different
cluster nodes and also has a RDMA connection to a different set of nodes.
as you can see there is basically no traffic on the TCP devices, as all the
traffic uses the 2 defined RDMA fabrics.
there is not a single connection using the daemon interface (eth0) as all
nodes are either connected via subnets or via RDMA.

hope this helps. Sven


On Tue, Sep 2, 2014 at 6:44 AM, Ed Wahl <ewahl at osc.edu> wrote:

> Seems like you are on the correct track.  This is similar to my setup.
>  subnett'ed daemon 10GbE, 1GbE with main being QDR RDMA,   admin 1GbE.   To
> my mind the most important part is  Setting "privateSubnetOverride" to 1.
> This allows both your 1GbE and your 40GbE to be on a private subnet.
> Serving block over public IPs just seems wrong on SO many levels. Whether
> truly private/internal or not.  And how many people use public IPs
> internally? Wait, maybe I don't want to know...
>
>    Using 'verbsRdma enable' for your FDR seems to override Daemon node
> name for block, at least in my experience.  I love the fallback to 10GbE
> and then 1GbE in case of disaster when using IB.  Lately we seem to be
> generating bugs in OpenSM at a frightening rate so that has been
> _extremely_ helpful. Now if we could just monitor when it happens more
> easily than running mmfsadm test verbs conn, say by logging a failure of
> RDMA?
>
>
> Ed
> OSC
>
> ________________________________________
> From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org]
> on behalf of Simon Thompson (Research Computing - IT Services) [
> S.J.Thompson at bham.ac.uk]
> Sent: Monday, September 01, 2014 3:44 PM
> To: gpfsug main discussion list
> Subject: [gpfsug-discuss] GPFS admin host name vs subnets
>
> I was just reading through the docs at:
>
>
> https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/GPFS+Network+Communication+Overview
>
> And was wondering about using admin host name bs using subnets. My reading
> of the page is that if say I have a 1GbE network and a 40GbE network, I
> could have an admin host name on the 1GbE network. But equally from the
> docs, it looks like I could also use subnets to achieve the same whilst
> allowing the admin network to be a fall back for data if necessary.
>
> For example, create the cluster using the primary name on the 1GbE
> network, then use the subnets property to use set the network on the 40GbE
> network as the first and the network on the 1GbE network as the second in
> the list, thus GPFS data will pass over the 40GbE network in preference and
> the 1GbE network will, by default only be used for admin traffic as the
> admin host name will just be the name of the host on the 1GbE network.
>
> Is my reading of the docs correct? Or do I really want to be creating the
> cluster using the 40GbE network hostnames and set the admin node name to
> the name of the 1GbE network interface?
>
> (there's actually also an FDR switch in there somewhere for verbs as well)
>
> Thanks
>
> Simon
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140902/16c2aa99/attachment.htm>

From sdinardo at ebi.ac.uk  Wed Sep  3 18:27:44 2014
From: sdinardo at ebi.ac.uk (Salvatore Di Nardo)
Date: Wed, 03 Sep 2014 18:27:44 +0100
Subject: [gpfsug-discuss] gpfs performance monitoring
Message-ID: <54074F90.7000303@ebi.ac.uk>

Hello everybody,
here i come here again, this time to ask some hint about how to monitor 
GPFS.

I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is that 
they return number based only on the request done in the current host, 
so i have to run them on all the clients ( over 600 nodes) so its quite 
unpractical.  Instead i would like to know from the servers whats going 
on, and i came across the vio_s statistics wich are less documented and 
i dont know exacly what they mean. There is also this script 
"/usr/lpp/mmfs/samples/vdisk/viostat" that runs VIO_S.

My problems with the output of this command:

          echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1

        mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second
        timestamp: 1409763206/477366
        recovery group:                     *
        declustered array:                  *
        vdisk:                              *
        client reads: 2584229
        client short writes: 55299693
        client medium writes: 190071
        client promoted full track writes: 465145
        client full track writes: 9249
        flushed update writes: 4187708
        flushed promoted full track writes: 123
        migrate operations: 114
        scrub operations: 450590
        log writes: 28509602


it sais "VIOPS per second", but they seem to me just counters as every 
time i re-run the command, the numbers increase by a bit..
Can anyone confirm if those numbers are counter or if they are OPS/sec.

On a closer eye about i dont understand what most of thosevalues mean. 
For example, what exacly are "flushed promoted full track write" ??
I tried to find a documentation about this output , but could not find 
any. can anyone point me a link where output of vio_s is explained?

Another thing i dont understand about those numbers is if they are just 
operations, or the number of blocks that was read/write/etc . I'm asking 
that because if they are just ops, i don't know how much they could be 
usefull. For example one write operation could eman write 1 block or 
write a file of 100GB. If those are oprations, there is a way to have 
the oupunt in bytes or blocks?

Last but not least.. and this is what i really would like to accomplish, 
i would to be able to monitor the latency of metadata operations.
In my environment there are users that litterally overhelm our storages 
with metadata request, so even if there is no massive throughput or huge 
waiters, any "ls" could take ages. I would like to be able to monitor 
metadata behaviour. There is a way to to do that from the NSD servers?

Thanks in advance for any tip/help.

Regards,
Salvatore
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140903/d96e6643/attachment.htm>

From chekh at stanford.edu  Wed Sep  3 21:55:14 2014
From: chekh at stanford.edu (Alex Chekholko)
Date: Wed, 03 Sep 2014 13:55:14 -0700
Subject: [gpfsug-discuss] gpfs performance monitoring
In-Reply-To: <54074F90.7000303@ebi.ac.uk>
References: <54074F90.7000303@ebi.ac.uk>
Message-ID: <54078032.2050605@stanford.edu>

The usual way to do that is to re-architect your filesystem so that the 
system pool is metadata-only, and then you can just look at the storage 
layer and see total metadata throughput that way.  Otherwise your 
metadata ops are mixed in with your data ops.

Of course, both NSDs and clients also have metadata caches.

On 09/03/2014 10:27 AM, Salvatore Di Nardo wrote:
>
> Last but not least.. and this is what i really would like to accomplish,
> i would to be able to monitor the latency of metadata operations.
> In my environment there are users that litterally overhelm our storages
> with metadata request, so even if there is no massive throughput or huge
> waiters, any "ls" could take ages. I would like to be able to monitor
> metadata behaviour. There is a way to to do that from the NSD servers?

-- 
Alex Chekholko chekh at stanford.edu 347-401-4860


From oehmes at us.ibm.com  Thu Sep  4 01:50:25 2014
From: oehmes at us.ibm.com (Sven Oehme)
Date: Wed, 3 Sep 2014 17:50:25 -0700
Subject: [gpfsug-discuss] gpfs performance monitoring
In-Reply-To: <54074F90.7000303@ebi.ac.uk>
References: <54074F90.7000303@ebi.ac.uk>
Message-ID: <OFB11720D9.723B90AC-ON88257D49.00032BBC-88257D49.00049D93@us.ibm.com>

> Hello everybody,

Hi

> here i come here again, this time to ask some hint about how to monitor 
GPFS.
> 
> I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is 
> that they return number based only on the request done in the 
> current host, so i have to run them on all the clients ( over 600 
> nodes) so its quite unpractical.  Instead i would like to know from 
> the servers whats going on, and i came across the vio_s statistics 
> wich are less documented and i dont know exacly what they mean. 
> There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that
> runs VIO_S.
> 
> My problems with the output of this command:
>  echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1
> 
> mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second
> timestamp:                          1409763206/477366
> recovery group:                     *
> declustered array:                  *
> vdisk:                              *
> client reads:                          2584229
> client short writes:                  55299693
> client medium writes:                   190071
> client promoted full track writes:      465145
> client full track writes:                 9249
> flushed update writes:                 4187708
> flushed promoted full track writes:        123
> migrate operations:                        114
> scrub operations:                       450590
> log writes:                           28509602
> 
> it sais "VIOPS per second", but they seem to me just counters as 
> every time i re-run the command, the numbers increase by a bit..  
> Can anyone confirm if those numbers are counter or if they are OPS/sec.

the numbers are accumulative so everytime you run them they just show the 
value since start (or last reset) time.

> 
> On a closer eye about i dont understand what most of thosevalues 
> mean. For example, what exacly are "flushed promoted full track write" 
?? 
> I tried to find a documentation about this output , but could not 
> find any. can anyone point me a link where output of vio_s is explained?
> 
> Another thing i dont understand about those numbers is if they are 
> just operations, or the number of blocks that was read/write/etc . 

its just operations and if i would explain what the numbers mean i might 
confuse you even more because this is not what you are really looking for. 

what you are looking for is what the client io's look like on the Server 
side, while the VIO layer is the Server side to the disks, so one lever 
lower than what you are looking for from what i could read out of the 
description above. 

so the Layer you care about is the NSD Server layer, which sits on top of 
the VIO layer (which is essentially the SW RAID Layer in GNR) 

> I'm asking that because if they are just ops, i don't know how much 
> they could be usefull. For example one write operation could eman 
> write 1 block or write a file of 100GB. If those are oprations, 
> there is a way to have the oupunt in bytes or blocks?

there are multiple ways to get infos on the NSD layer, one would be to use 
the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats counts again. 

the alternative option is to use mmdiag --iohist. this shows you a history 
of the last X numbers of io operations on either the client or the server 
side like on a client : 

# mmdiag --iohist

=== mmdiag: iohist ===

I/O history:

 I/O start time RW    Buf type disk:sectorNum     nSec  time ms qTime ms   
 RpcTimes ms  Type  Device/NSD ID         NSD server
--------------- -- ----------- ----------------- -----  ------- -------- 
-----------------  ---- ------------------ ---------------
14:25:22.169617  R  LLIndBlock    1:1075622848      64   13.073    0.000 
12.959    0.063  cli   C0A70401:53BEEA7F     192.167.4.1
14:25:22.182723  R       inode    1:1071252480       8    6.970    0.000  
6.908    0.038  cli   C0A70401:53BEEA7F     192.167.4.1
14:25:53.659918  R  LLIndBlock    1:1081202176      64    8.309    0.000  
8.210    0.046  cli   C0A70401:53BEEA7F     192.167.4.1
14:25:53.668262  R       inode    2:1081373696       8   14.117    0.000 
14.032    0.058  cli   C0A70402:53BEEA5E     192.167.4.2
14:25:53.682750  R  LLIndBlock    1:1065508736      64    9.254    0.000  
9.180    0.038  cli   C0A70401:53BEEA7F     192.167.4.1
14:25:53.692019  R       inode    2:1064356608       8   14.899    0.000 
14.847    0.029  cli   C0A70402:53BEEA5E     192.167.4.2
14:25:53.707100  R       inode    2:1077830152       8   16.499    0.000 
16.449    0.025  cli   C0A70402:53BEEA5E     192.167.4.2
14:25:53.723788  R  LLIndBlock    1:1081202432      64    4.280    0.000  
4.203    0.040  cli   C0A70401:53BEEA7F     192.167.4.1
14:25:53.728082  R       inode    2:1081918976       8    7.760    0.000  
7.710    0.027  cli   C0A70402:53BEEA5E     192.167.4.2
14:25:57.877416  R    metadata    2:678978560       16   13.343    0.000 
13.254    0.053  cli   C0A70402:53BEEA5E     192.167.4.2
14:25:57.891048  R  LLIndBlock    1:1065508608      64   15.491    0.000 
15.401    0.058  cli   C0A70401:53BEEA7F     192.167.4.1
14:25:57.906556  R       inode    2:1083476520       8   11.723    0.000 
11.676    0.029  cli   C0A70402:53BEEA5E     192.167.4.2
14:25:57.918516  R  LLIndBlock    1:1075622720      64    8.062    0.000  
8.001    0.032  cli   C0A70401:53BEEA7F     192.167.4.1
14:25:57.926592  R       inode    1:1076503480       8    8.087    0.000  
8.043    0.026  cli   C0A70401:53BEEA7F     192.167.4.1
14:25:57.934856  R  LLIndBlock    1:1071088512      64    6.572    0.000  
6.510    0.033  cli   C0A70401:53BEEA7F     192.167.4.1
14:25:57.941441  R       inode    2:1069885984       8   11.686    0.000 
11.641    0.024  cli   C0A70402:53BEEA5E     192.167.4.2
14:25:57.953294  R       inode    2:1083476936       8    8.951    0.000  
8.912    0.021  cli   C0A70402:53BEEA5E     192.167.4.2
14:25:57.965475  R       inode    1:1076503504       8    0.477    0.000  
0.053    0.000  cli   C0A70401:53BEEA7F     192.167.4.1
14:25:57.965755  R       inode    2:1083476488       8    0.410    0.000  
0.061    0.321  cli   C0A70402:53BEEA5E     192.167.4.2
14:25:57.965787  R       inode    2:1083476512       8    0.439    0.000  
0.053    0.342  cli   C0A70402:53BEEA5E     192.167.4.2

you basically see if its a inode , data block , what size it has (in 
sectors) , which nsd server you did send this request to, etc. 

on the Server side you see the type , which physical disk it goes to and 
also what size of disk i/o it causes like : 

14:26:50.129995  R       inode   12:3211886376      64   14.261    0.000  
0.000    0.000  pd   sdis
14:26:50.137102  R       inode   19:3003969520      64    9.004    0.000  
0.000    0.000  pd   sdad
14:26:50.136116  R       inode   55:3591710992      64   11.057    0.000  
0.000    0.000  pd   sdoh
14:26:50.141510  R       inode   21:3066810504      64    5.909    0.000  
0.000    0.000  pd   sdaf
14:26:50.130529  R       inode   89:2962370072      64   17.437    0.000  
0.000    0.000  pd   sddi
14:26:50.131063  R       inode   78:1889457000      64   17.062    0.000  
0.000    0.000  pd   sdsj
14:26:50.143403  R       inode   36:3323035688      64    4.807    0.000  
0.000    0.000  pd   sdmw
14:26:50.131044  R       inode   37:2513579736     128   17.181    0.000  
0.000    0.000  pd   sddv
14:26:50.138181  R       inode   72:3868810400      64   10.951    0.000  
0.000    0.000  pd   sdbz
14:26:50.138188  R       inode  131:2443484784     128   11.792    0.000  
0.000    0.000  pd   sdug
14:26:50.138003  R       inode  102:3696843872      64   11.994    0.000  
0.000    0.000  pd   sdgp
14:26:50.137099  R       inode  145:3370922504      64   13.225    0.000  
0.000    0.000  pd   sdmi
14:26:50.141576  R       inode   62:2668579904      64    9.313    0.000  
0.000    0.000  pd   sdou
14:26:50.134689  R       inode  159:2786164648      64   16.577    0.000  
0.000    0.000  pd   sdpq
14:26:50.145034  R       inode   34:2097217320      64    7.409    0.000  
0.000    0.000  pd   sdmt
14:26:50.138140  R       inode  139:2831038792      64   14.898    0.000  
0.000    0.000  pd   sdlw
14:26:50.130954  R       inode  164:282120312       64   22.274    0.000  
0.000    0.000  pd   sdzd
14:26:50.137038  R       inode   41:3421909608      64   16.314    0.000  
0.000    0.000  pd   sdef
14:26:50.137606  R       inode  104:1870962416      64   16.644    0.000  
0.000    0.000  pd   sdgx
14:26:50.141306  R       inode   65:2276184264      64   16.593    0.000  
0.000    0.000  pd   sdrk


> 
> Last but not least.. and this is what i really would like to 
> accomplish, i would to be able to monitor the latency of metadata 
operations. 

you can't do this on the server side as you don't know how much time you 
spend on the client , network or anything between the app and the physical 
disk, so you can only reliably look at this from the client, the iohist 
output only shows you the Server disk i/o processing time, but that can be 
a fraction of the overall time (in other cases this obviously can also be 
the dominant part depending on your workload).

the easiest way on the client is to run 

mmfsadm vfsstats enable
from now on vfs stats are collected until you restart GPFS. 

then run :

vfs statistics currently enabled
started at: Fri Aug 29 13:15:05.380 2014
  duration: 448446.970 sec

 name                    calls  time per call     total time
 -------------------- -------- -------------- --------------
 statfs                      9       0.000002       0.000021
 startIO              246191176       0.005853 1441049.976740

to dump what ever you collected so far on this node. 

> In my environment there are users that litterally overhelm our 
> storages with metadata request, so even if there is no massive 
> throughput or huge waiters, any "ls" could take ages. I would like 
> to be able to monitor metadata behaviour. There is a way to to do 
> that from the NSD servers?

not this simple as described above. 

> 
> Thanks in advance for any tip/help.
> 
> Regards,
> Salvatore_______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140903/d9a18474/attachment.htm>

From service at metamodul.com  Thu Sep  4 11:05:18 2014
From: service at metamodul.com (service at metamodul.com)
Date: Thu, 4 Sep 2014 12:05:18 +0200 (CEST)
Subject: [gpfsug-discuss] gpfs performance monitoring
In-Reply-To: <54074F90.7000303@ebi.ac.uk>
References: <54074F90.7000303@ebi.ac.uk>
Message-ID: <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de>

> , any "ls" could take ages.

Check if you large directories either with many files or simply large.
Verify if you have NFS exported GPFS.
Verify that your cache settings on the clients are large enough ( maxStatCache ,
maxFilesToCache , sharedMemLimit )
Verify that you have dedicated metadata luns ( metadataOnly )

Reference:
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Tuning%20Parameters

Note:
If possible monitor your metadata luns on the storage directly.

hth
Hajo


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140904/4e0aaadb/attachment.htm>

From sdinardo at ebi.ac.uk  Thu Sep  4 11:43:36 2014
From: sdinardo at ebi.ac.uk (Salvatore Di Nardo)
Date: Thu, 04 Sep 2014 11:43:36 +0100
Subject: [gpfsug-discuss] gpfs performance monitoring
In-Reply-To: <OFB11720D9.723B90AC-ON88257D49.00032BBC-88257D49.00049D93@us.ibm.com>
References: <54074F90.7000303@ebi.ac.uk>
	<OFB11720D9.723B90AC-ON88257D49.00032BBC-88257D49.00049D93@us.ibm.com>
Message-ID: <54084258.90508@ebi.ac.uk>


On 04/09/14 01:50, Sven Oehme wrote:
> > Hello everybody,
>
> Hi
>
> > here i come here again, this time to ask some hint about how to 
> monitor GPFS.
> >
> > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is
> > that they return number based only on the request done in the
> > current host, so i have to run them on all the clients ( over 600
> > nodes) so its quite unpractical.  Instead i would like to know from
> > the servers whats going on, and i came across the vio_s statistics
> > wich are less documented and i dont know exacly what they mean.
> > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that
> > runs VIO_S.
> >
> > My problems with the output of this command:
> >  echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1
> >
> > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second
> > timestamp: 1409763206/477366
> > recovery group: *
> > declustered array: *
> > vdisk: *
> > client reads: 2584229
> > client short writes: 55299693
> > client medium writes: 190071
> > client promoted full track writes:      465145
> > client full track writes: 9249
> > flushed update writes: 4187708
> > flushed promoted full track writes: 123
> > migrate operations: 114
> > scrub operations: 450590
> > log writes: 28509602
> >
> > it sais "VIOPS per second", but they seem to me just counters as
> > every time i re-run the command, the numbers increase by a bit..
> > Can anyone confirm if those numbers are counter or if they are OPS/sec.
>
> the numbers are accumulative so everytime you run them they just show 
> the value since start (or last reset) time.
OK, you confirmed my toughts, thatks

>
> >
> > On a closer eye about i dont understand what most of thosevalues
> > mean. For example, what exacly are "flushed promoted full track 
> write" ??
> > I tried to find a documentation about this output , but could not
> > find any. can anyone point me a link where output of vio_s is explained?
> >
> > Another thing i dont understand about those numbers is if they are
> > just operations, or the number of blocks that was read/write/etc .
>
> its just operations and if i would explain what the numbers mean i 
> might confuse you even more because this is not what you are really 
> looking for.
> what you are looking for is what the client io's look like on the 
> Server side, while the VIO layer is the Server side to the disks, so 
> one lever lower than what you are looking for from what i could read 
> out of the description above.
No.. what I'm looking its exactly how the disks are busy to keep the 
requests. Obviously i'm not looking just that, but I feel the needs to 
monitor _*also*_ those things. Ill explain you why.

It happens when our storage is quite busy ( 180Gb/s of read/write ) that 
the FS start to be slowin normal /*cd*/ or /*ls*/ requests. This might 
be normal, but in those situation i want to know where the bottleneck 
is. Is the server CPU? Memory? Network? Spindles? knowing where the 
bottlenek is might help me to understand if we can tweak the system a 
bit more.

If its the CPU on the servers then there is no much to do beside 
replacing or add more servers.If its not the CPU, maybe more memory 
would help? Maybe its just the network that filled up? so i can add more 
links

Or if we reached the point there the bottleneck its the spindles, then 
there is no much point o look somethere else, we just reached the 
hardware limit..

Sometimes, it also happens that there is very low IO (10Gb/s ), almost 
no cpu usage on the servers but huge slownes ( ls can take 10 seconds).  
Why that happens? There is not much data ops , but we think there is a 
huge ammount of metadata ops. So what i want to know is if the metadata 
vdisks are busy or not. If this is our problem, could some SSD disks 
dedicated to metadata help?


In particular im, a bit puzzled with the design of our GSS storage.
Each recovery groups have 3 declustered arrays, and each declustered 
aray have 1 data and 1 metadata vdisk, but in the end both metadata and 
data vdisks use the same spindles. The problem that, its that I dont 
understand if we have a metadata bottleneck there. Maybe some SSD disks 
in a dedicated declustered array would perform much better, but this is 
just theory. I really would like to be able to monitor IO activities on 
the metadata vdisks.


>
>
> so the Layer you care about is the NSD Server layer, which sits on top 
> of the VIO layer (which is essentially the SW RAID Layer in GNR)
>
> > I'm asking that because if they are just ops, i don't know how much
> > they could be usefull. For example one write operation could eman
> > write 1 block or write a file of 100GB. If those are oprations,
> > there is a way to have the oupunt in bytes or blocks?
>
> there are multiple ways to get infos on the NSD layer, one would be to 
> use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats counts 
> again.

Counters its not a problem. I can collect them and create some graphs in 
a monitoring tool. I will check that.

>
> the alternative option is to use mmdiag --iohist. this shows you a 
> history of the last X numbers of io operations on either the client or 
> the server side like on a client :
>
> # mmdiag --iohist
>
> === mmdiag: iohist ===
>
> I/O history:
>
>  I/O start time RW    Buf type disk:sectorNum     nSec  time ms qTime 
> ms       RpcTimes ms  Type  Device/NSD ID         NSD server
> --------------- -- ----------- ----------------- -----  ------- 
> -------- -----------------  ---- ------------------ ---------------
> 14:25:22.169617  R  LLIndBlock    1:1075622848      64   13.073   
>  0.000   12.959  0.063  cli   C0A70401:53BEEA7F     192.167.4.1
> 14:25:22.182723  R       inode    1:1071252480       8    6.970  0.000 
>    6.908    0.038  cli   C0A70401:53BEEA7F     192.167.4.1
> 14:25:53.659918  R  LLIndBlock    1:1081202176      64    8.309   
>  0.000    8.210    0.046  cli   C0A70401:53BEEA7F     192.167.4.1
> 14:25:53.668262  R       inode    2:1081373696       8   14.117   
>  0.000   14.032    0.058  cli   C0A70402:53BEEA5E   192.167.4.2
> 14:25:53.682750  R  LLIndBlock    1:1065508736      64    9.254   
>  0.000    9.180    0.038  cli   C0A70401:53BEEA7F     192.167.4.1
> 14:25:53.692019  R       inode    2:1064356608       8   14.899   
>  0.000   14.847    0.029  cli   C0A70402:53BEEA5E   192.167.4.2
> 14:25:53.707100  R       inode    2:1077830152       8   16.499   
>  0.000   16.449    0.025  cli   C0A70402:53BEEA5E   192.167.4.2
> 14:25:53.723788  R  LLIndBlock    1:1081202432      64    4.280   
>  0.000    4.203    0.040  cli   C0A70401:53BEEA7F     192.167.4.1
> 14:25:53.728082  R       inode    2:1081918976       8    7.760  0.000 
>    7.710    0.027  cli   C0A70402:53BEEA5E     192.167.4.2
> 14:25:57.877416  R    metadata  2:678978560       16   13.343    0.000 
>   13.254    0.053  cli   C0A70402:53BEEA5E   192.167.4.2
> 14:25:57.891048  R  LLIndBlock    1:1065508608      64   15.491   
>  0.000   15.401  0.058  cli   C0A70401:53BEEA7F     192.167.4.1
> 14:25:57.906556  R       inode    2:1083476520       8   11.723   
>  0.000   11.676    0.029  cli   C0A70402:53BEEA5E   192.167.4.2
> 14:25:57.918516  R  LLIndBlock    1:1075622720      64    8.062   
>  0.000    8.001    0.032  cli   C0A70401:53BEEA7F     192.167.4.1
> 14:25:57.926592  R       inode    1:1076503480       8    8.087  0.000 
>    8.043    0.026  cli   C0A70401:53BEEA7F     192.167.4.1
> 14:25:57.934856  R  LLIndBlock    1:1071088512      64    6.572   
>  0.000    6.510    0.033  cli   C0A70401:53BEEA7F     192.167.4.1
> 14:25:57.941441  R       inode    2:1069885984       8   11.686   
>  0.000   11.641    0.024  cli   C0A70402:53BEEA5E   192.167.4.2
> 14:25:57.953294  R       inode    2:1083476936       8    8.951  0.000 
>    8.912    0.021  cli   C0A70402:53BEEA5E     192.167.4.2
> 14:25:57.965475  R       inode    1:1076503504       8    0.477  0.000 
>    0.053    0.000  cli   C0A70401:53BEEA7F     192.167.4.1
> 14:25:57.965755  R       inode    2:1083476488       8    0.410  0.000 
>    0.061    0.321  cli   C0A70402:53BEEA5E     192.167.4.2
> 14:25:57.965787  R       inode    2:1083476512       8    0.439  0.000 
>    0.053    0.342  cli   C0A70402:53BEEA5E     192.167.4.2
>
> you basically see if its a inode , data block , what size it has (in 
> sectors) , which nsd server you did send this request to, etc.
>
> on the Server side you see the type , which physical disk it goes to 
> and also what size of disk i/o it causes like :
>
> 14:26:50.129995  R       inode   12:3211886376      64   14.261   
>  0.000    0.000    0.000  pd   sdis
> 14:26:50.137102  R       inode   19:3003969520      64    9.004   
>  0.000    0.000    0.000  pd   sdad
> 14:26:50.136116  R       inode   55:3591710992      64   11.057   
>  0.000    0.000    0.000  pd   sdoh
> 14:26:50.141510  R       inode   21:3066810504      64    5.909   
>  0.000    0.000    0.000  pd   sdaf
> 14:26:50.130529  R       inode   89:2962370072      64   17.437   
>  0.000    0.000    0.000  pd   sddi
> 14:26:50.131063  R       inode   78:1889457000      64   17.062   
>  0.000    0.000    0.000  pd   sdsj
> 14:26:50.143403  R       inode   36:3323035688      64    4.807   
>  0.000    0.000    0.000  pd   sdmw
> 14:26:50.131044  R       inode   37:2513579736     128   17.181   
>  0.000    0.000    0.000  pd   sddv
> 14:26:50.138181  R       inode   72:3868810400      64   10.951   
>  0.000    0.000    0.000  pd   sdbz
> 14:26:50.138188  R       inode  131:2443484784     128   11.792   
>  0.000    0.000    0.000  pd   sdug
> 14:26:50.138003  R       inode  102:3696843872      64   11.994   
>  0.000    0.000    0.000  pd   sdgp
> 14:26:50.137099  R       inode  145:3370922504      64   13.225   
>  0.000    0.000    0.000  pd   sdmi
> 14:26:50.141576  R       inode   62:2668579904      64    9.313   
>  0.000    0.000    0.000  pd   sdou
> 14:26:50.134689  R       inode  159:2786164648      64   16.577   
>  0.000    0.000    0.000  pd   sdpq
> 14:26:50.145034  R       inode   34:2097217320      64    7.409   
>  0.000    0.000    0.000  pd   sdmt
> 14:26:50.138140  R       inode  139:2831038792      64   14.898   
>  0.000    0.000    0.000  pd   sdlw
> 14:26:50.130954  R       inode  164:282120312       64   22.274   
>  0.000    0.000    0.000  pd   sdzd
> 14:26:50.137038  R       inode   41:3421909608      64   16.314   
>  0.000    0.000    0.000  pd   sdef
> 14:26:50.137606  R       inode  104:1870962416      64   16.644   
>  0.000    0.000    0.000  pd   sdgx
> 14:26:50.141306  R       inode   65:2276184264      64   16.593   
>  0.000    0.000    0.000  pd   sdrk
>
>

mmdiag --iohist its another think i looked at it, but i could not find 
good explanation for all the "buf type" ( third column )

            allocSeg
            data
            iallocSeg
            indBlock
            inode
            LLIndBlock
            logData
            logDesc
            logWrap
            metadata
            vdiskAULog
            vdiskBuf
            vdiskFWLog
            vdiskMDLog
            vdiskMeta
            vdiskRGDesc

If i want to monifor metadata operation whan should i look at? just the 
metadata flag or also inode? this command takes also long to run, 
especially if i run it a second time it hangs for a lot before to rerun 
again, so i'm not sure that run it every 30secs or minute its viable, 
but i will look also into that. THere is any documentation that descibes 
clearly the whole output? what i found its quite generic and don't go 
into details...

> >
> > Last but not least.. and this is what i really would like to
> > accomplish, i would to be able to monitor the latency of metadata 
> operations.
>
> you can't do this on the server side as you don't know how much time 
> you spend on the client , network or anything between the app and the 
> physical disk, so you can only reliably look at this from the client, 
> the iohist output only shows you the Server disk i/o processing time, 
> but that can be a fraction of the overall time (in other cases this 
> obviously can also be the dominant part depending on your workload).
>
> the easiest way on the client is to run
>
> mmfsadm vfsstats enable
> from now on vfs stats are collected until you restart GPFS.
>
> then run :
>
> vfs statistics currently enabled
> started at: Fri Aug 29 13:15:05.380 2014
>   duration: 448446.970 sec
>
>  name        calls  time per call     total time
>  -------------------- -------- -------------- --------------
>  statfs          9       0.000002     0.000021
>  startIO  246191176       0.005853 1441049.976740
>
> to dump what ever you collected so far on this node.
>

We already do that, but as I said, I want to check specifically how gss 
servers are keeping the requests to identify or exlude server side 
bottlenecks.


Thanks for your help, you gave me definitely few things where to look at.

Salvatore

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140904/c99359d9/attachment.htm>

From sdinardo at ebi.ac.uk  Thu Sep  4 11:58:51 2014
From: sdinardo at ebi.ac.uk (Salvatore Di Nardo)
Date: Thu, 04 Sep 2014 11:58:51 +0100
Subject: [gpfsug-discuss] gpfs performance monitoring
In-Reply-To: <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de>
References: <54074F90.7000303@ebi.ac.uk>
	<1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de>
Message-ID: <540845EB.1020202@ebi.ac.uk>

Little clarification, the filsystemn its not always slow. It happens 
that became very slow with particular users jobs in the farm. Maybe its 
just an indication thant we have huge ammount of metadata requestes, 
that's why i want to be able to monitor them

On 04/09/14 11:05, service at metamodul.com wrote:
> > , any "ls" could take ages.
> Check if you large directories either with many files or simply large.

     it happens that the files are very large ( over 100G), but there 
usually ther are no many files.
> Verify if you have NFS exported GPFS.

No NFS
> Verify that your cache settings on the clients are large enough ( 
> maxStatCache , maxFilesToCache , sharedMemLimit )
will look at them, but i'm not sure that the best number will be on the 
client. Obviously i cannot use all the memory of the client because 
those blients are meant to run jobs....

> Verify that you have dedicated metadata luns ( metadataOnly )
Yes, we have dedicate vdisks for metadata, but they are in the same 
declustered arrays/recoverygroups, so they whare the same spindles

> Reference:
> https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Tuning%20Parameters 
>
> Note:
> If possible monitor your metadata luns on the storage directly.

that?s exactly than I'm trying to do !!!! :-D
> hth
> Hajo
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140904/68ff9b7a/attachment.htm>

From service at metamodul.com  Thu Sep  4 13:04:21 2014
From: service at metamodul.com (service at metamodul.com)
Date: Thu, 4 Sep 2014 14:04:21 +0200 (CEST)
Subject: [gpfsug-discuss] gpfs performance monitoring
In-Reply-To: <540845EB.1020202@ebi.ac.uk>
References: <54074F90.7000303@ebi.ac.uk>
	<1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de>
	<540845EB.1020202@ebi.ac.uk>
Message-ID: <1374296431.733076.1409832261718.open-xchange@oxbaltgw12.schlund.de>

... , any "ls" could take ages.

>Check if you large directories either with many files or simply large.

>>    it happens that the files are very large ( over 100G), but there usually
>> ther are no many files.

>>> Please check that the directory size is not large.
In a worst case you have a directory with 10MB in size but it contains only one
file. In any way GPFS must fetch the whole directory structure might causing
unnecassery IO. Thus my request that you check your directory sizes.


>Verify that your cache settings on the clients are large enough ( maxStatCache
>, maxFilesToCache , sharedMemLimit )
>>will look at them, but i'm not sure that the best number will be on the
>>client. Obviously i cannot use all the memory of the client because those
>>blients are meant to run jobs....

Use lsof on the client to determine the amount of open filese. mmdiag --stats  (
>From my memory ) shows a little bit about the cache usage. maxStatCache does not
use that much memory.


> Verify that you have dedicated metadata luns ( metadataOnly )
>> Yes, we have dedicate vdisks for metadata, but they are in the same
>> declustered arrays/recoverygroups, so they whare the same spindles

Thats imho not a good approach. Metadata operation are small and random, data io
is large and streaming.

Just think you have a highway full of large trucks and you try to get with a
high speed bike to your destination. You will be blocked.
The same problem you have at your destiation. If many large trucks would like to
get their stuff off there is no time for somebody with a small parcel.

Thats the same reason why you should not access tape storage and disk storage
via the same FC adapter. ( Streaming IO version v. random/small IO )

So even without your current problem and motivation for measureing i would
strongly suggest to have at least dediacted SSD for metadata and if possible
even dedicated NSD server for the metadata.
Meaning have a dedicated path for your data and a dedicated path for your
metadata.

All from a users point of view
Hajo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140904/a74a36fb/attachment.htm>

From sdinardo at ebi.ac.uk  Thu Sep  4 14:25:09 2014
From: sdinardo at ebi.ac.uk (Salvatore Di Nardo)
Date: Thu, 04 Sep 2014 14:25:09 +0100
Subject: [gpfsug-discuss] gpfs performance monitoring
In-Reply-To: <1374296431.733076.1409832261718.open-xchange@oxbaltgw12.schlund.de>
References: <54074F90.7000303@ebi.ac.uk>
	<1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de>
	<540845EB.1020202@ebi.ac.uk>
	<1374296431.733076.1409832261718.open-xchange@oxbaltgw12.schlund.de>
Message-ID: <54086835.6050603@ebi.ac.uk>


> >> Yes, we have dedicate vdisks for metadata, but they are in the same 
> declustered arrays/recoverygroups, so they whare the same spindles
>
> Thats imho not a good approach. Metadata operation are small and 
> random, data io is large and streaming.
>
> Just think you have a highway full of large trucks and you try to get 
> with a high speed bike to your destination. You will be blocked.
> The same problem you have at your destiation. If many large trucks 
> would like to get their stuff off there is no time for somebody with a 
> small parcel.
>
> Thats the same reason why you should not access tape storage and disk 
> storage via the same FC adapter. ( Streaming IO version v. 
> random/small IO )
>
> So even without your current problem and motivation for measureing i 
> would strongly suggest to have at least dediacted SSD for metadata and 
> if possible even dedicated NSD server for the metadata.
> Meaning have a dedicated path for your data and a dedicated path for 
> your metadata.
>
> All from a users point of view
> Hajo
>
That's where i  was puzzled too. GSS its a gpfs appliance and came 
configured this way. Also official GSS documentation suggest to create 
separate vdisks for data and meatadata, but in the same declustered 
arrays. I always felt this a strange choice, specially if we consider 
that metadata require a very small abbount of space, so few ssd could do 
the trick....


From sdinardo at ebi.ac.uk  Thu Sep  4 14:32:15 2014
From: sdinardo at ebi.ac.uk (Salvatore Di Nardo)
Date: Thu, 04 Sep 2014 14:32:15 +0100
Subject: [gpfsug-discuss] gpfs performance monitoring
In-Reply-To: <OFB11720D9.723B90AC-ON88257D49.00032BBC-88257D49.00049D93@us.ibm.com>
References: <54074F90.7000303@ebi.ac.uk>
	<OFB11720D9.723B90AC-ON88257D49.00032BBC-88257D49.00049D93@us.ibm.com>
Message-ID: <540869DF.5060100@ebi.ac.uk>

Sorry to bother you again but dstat have some issues with the plugin:

        [root at gss01a util]# dstat --gpfs
        /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is
        deprecated.  Use the subprocess module.
           pipes[cmd] = os.popen3(cmd, 't', 0)
        Module dstat_gpfs failed to load. (global name 'select' is not
        defined)
        None of the stats you selected are available.

I found this solution , but involve dstat recompile....

https://github.com/dagwieers/dstat/issues/44

Are you aware about any easier solution (we use RHEL6.3) ?


Regards,
Salvatore

On 04/09/14 01:50, Sven Oehme wrote:
> > Hello everybody,
>
> Hi
>
> > here i come here again, this time to ask some hint about how to 
> monitor GPFS.
> >
> > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is
> > that they return number based only on the request done in the
> > current host, so i have to run them on all the clients ( over 600
> > nodes) so its quite unpractical.  Instead i would like to know from
> > the servers whats going on, and i came across the vio_s statistics
> > wich are less documented and i dont know exacly what they mean.
> > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that
> > runs VIO_S.
> >
> > My problems with the output of this command:
> >  echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1
> >
> > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second
> > timestamp: 1409763206/477366
> > recovery group: *
> > declustered array: *
> > vdisk: *
> > client reads: 2584229
> > client short writes: 55299693
> > client medium writes: 190071
> > client promoted full track writes:      465145
> > client full track writes: 9249
> > flushed update writes: 4187708
> > flushed promoted full track writes: 123
> > migrate operations: 114
> > scrub operations: 450590
> > log writes: 28509602
> >
> > it sais "VIOPS per second", but they seem to me just counters as
> > every time i re-run the command, the numbers increase by a bit..
> > Can anyone confirm if those numbers are counter or if they are OPS/sec.
>
> the numbers are accumulative so everytime you run them they just show 
> the value since start (or last reset) time.
>
> >
> > On a closer eye about i dont understand what most of thosevalues
> > mean. For example, what exacly are "flushed promoted full track 
> write" ??
> > I tried to find a documentation about this output , but could not
> > find any. can anyone point me a link where output of vio_s is explained?
> >
> > Another thing i dont understand about those numbers is if they are
> > just operations, or the number of blocks that was read/write/etc .
>
> its just operations and if i would explain what the numbers mean i 
> might confuse you even more because this is not what you are really 
> looking for.
> what you are looking for is what the client io's look like on the 
> Server side, while the VIO layer is the Server side to the disks, so 
> one lever lower than what you are looking for from what i could read 
> out of the description above.
>
> so the Layer you care about is the NSD Server layer, which sits on top 
> of the VIO layer (which is essentially the SW RAID Layer in GNR)
>
> > I'm asking that because if they are just ops, i don't know how much
> > they could be usefull. For example one write operation could eman
> > write 1 block or write a file of 100GB. If those are oprations,
> > there is a way to have the oupunt in bytes or blocks?
>
> there are multiple ways to get infos on the NSD layer, one would be to 
> use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats counts 
> again.
>
> the alternative option is to use mmdiag --iohist. this shows you a 
> history of the last X numbers of io operations on either the client or 
> the server side like on a client :
>
> # mmdiag --iohist
>
> === mmdiag: iohist ===
>
> I/O history:
>
>  I/O start time RW    Buf type disk:sectorNum     nSec  time ms qTime 
> ms       RpcTimes ms  Type  Device/NSD ID         NSD server
> --------------- -- ----------- ----------------- -----  ------- 
> -------- -----------------  ---- ------------------ ---------------
> 14:25:22.169617  R  LLIndBlock    1:1075622848      64   13.073   
>  0.000   12.959  0.063  cli   C0A70401:53BEEA7F     192.167.4.1
> 14:25:22.182723  R       inode    1:1071252480       8    6.970  0.000 
>    6.908    0.038  cli   C0A70401:53BEEA7F     192.167.4.1
> 14:25:53.659918  R  LLIndBlock    1:1081202176      64    8.309   
>  0.000    8.210    0.046  cli   C0A70401:53BEEA7F     192.167.4.1
> 14:25:53.668262  R       inode    2:1081373696       8   14.117   
>  0.000   14.032    0.058  cli   C0A70402:53BEEA5E   192.167.4.2
> 14:25:53.682750  R  LLIndBlock    1:1065508736      64    9.254   
>  0.000    9.180    0.038  cli   C0A70401:53BEEA7F     192.167.4.1
> 14:25:53.692019  R       inode    2:1064356608       8   14.899   
>  0.000   14.847    0.029  cli   C0A70402:53BEEA5E   192.167.4.2
> 14:25:53.707100  R       inode    2:1077830152       8   16.499   
>  0.000   16.449    0.025  cli   C0A70402:53BEEA5E   192.167.4.2
> 14:25:53.723788  R  LLIndBlock    1:1081202432      64    4.280   
>  0.000    4.203    0.040  cli   C0A70401:53BEEA7F     192.167.4.1
> 14:25:53.728082  R       inode    2:1081918976       8    7.760  0.000 
>    7.710    0.027  cli   C0A70402:53BEEA5E     192.167.4.2
> 14:25:57.877416  R    metadata  2:678978560       16   13.343    0.000 
>   13.254    0.053  cli   C0A70402:53BEEA5E   192.167.4.2
> 14:25:57.891048  R  LLIndBlock    1:1065508608      64   15.491   
>  0.000   15.401  0.058  cli   C0A70401:53BEEA7F     192.167.4.1
> 14:25:57.906556  R       inode    2:1083476520       8   11.723   
>  0.000   11.676    0.029  cli   C0A70402:53BEEA5E   192.167.4.2
> 14:25:57.918516  R  LLIndBlock    1:1075622720      64    8.062   
>  0.000    8.001    0.032  cli   C0A70401:53BEEA7F     192.167.4.1
> 14:25:57.926592  R       inode    1:1076503480       8    8.087  0.000 
>    8.043    0.026  cli   C0A70401:53BEEA7F     192.167.4.1
> 14:25:57.934856  R  LLIndBlock    1:1071088512      64    6.572   
>  0.000    6.510    0.033  cli   C0A70401:53BEEA7F     192.167.4.1
> 14:25:57.941441  R       inode    2:1069885984       8   11.686   
>  0.000   11.641    0.024  cli   C0A70402:53BEEA5E   192.167.4.2
> 14:25:57.953294  R       inode    2:1083476936       8    8.951  0.000 
>    8.912    0.021  cli   C0A70402:53BEEA5E     192.167.4.2
> 14:25:57.965475  R       inode    1:1076503504       8    0.477  0.000 
>    0.053    0.000  cli   C0A70401:53BEEA7F     192.167.4.1
> 14:25:57.965755  R       inode    2:1083476488       8    0.410  0.000 
>    0.061    0.321  cli   C0A70402:53BEEA5E     192.167.4.2
> 14:25:57.965787  R       inode    2:1083476512       8    0.439  0.000 
>    0.053    0.342  cli   C0A70402:53BEEA5E     192.167.4.2
>
> you basically see if its a inode , data block , what size it has (in 
> sectors) , which nsd server you did send this request to, etc.
>
> on the Server side you see the type , which physical disk it goes to 
> and also what size of disk i/o it causes like :
>
> 14:26:50.129995  R       inode   12:3211886376      64   14.261   
>  0.000    0.000    0.000  pd   sdis
> 14:26:50.137102  R       inode   19:3003969520      64    9.004   
>  0.000    0.000    0.000  pd   sdad
> 14:26:50.136116  R       inode   55:3591710992      64   11.057   
>  0.000    0.000    0.000  pd   sdoh
> 14:26:50.141510  R       inode   21:3066810504      64    5.909   
>  0.000    0.000    0.000  pd   sdaf
> 14:26:50.130529  R       inode   89:2962370072      64   17.437   
>  0.000    0.000    0.000  pd   sddi
> 14:26:50.131063  R       inode   78:1889457000      64   17.062   
>  0.000    0.000    0.000  pd   sdsj
> 14:26:50.143403  R       inode   36:3323035688      64    4.807   
>  0.000    0.000    0.000  pd   sdmw
> 14:26:50.131044  R       inode   37:2513579736     128   17.181   
>  0.000    0.000    0.000  pd   sddv
> 14:26:50.138181  R       inode   72:3868810400      64   10.951   
>  0.000    0.000    0.000  pd   sdbz
> 14:26:50.138188  R       inode  131:2443484784     128   11.792   
>  0.000    0.000    0.000  pd   sdug
> 14:26:50.138003  R       inode  102:3696843872      64   11.994   
>  0.000    0.000    0.000  pd   sdgp
> 14:26:50.137099  R       inode  145:3370922504      64   13.225   
>  0.000    0.000    0.000  pd   sdmi
> 14:26:50.141576  R       inode   62:2668579904      64    9.313   
>  0.000    0.000    0.000  pd   sdou
> 14:26:50.134689  R       inode  159:2786164648      64   16.577   
>  0.000    0.000    0.000  pd   sdpq
> 14:26:50.145034  R       inode   34:2097217320      64    7.409   
>  0.000    0.000    0.000  pd   sdmt
> 14:26:50.138140  R       inode  139:2831038792      64   14.898   
>  0.000    0.000    0.000  pd   sdlw
> 14:26:50.130954  R       inode  164:282120312       64   22.274   
>  0.000    0.000    0.000  pd   sdzd
> 14:26:50.137038  R       inode   41:3421909608      64   16.314   
>  0.000    0.000    0.000  pd   sdef
> 14:26:50.137606  R       inode  104:1870962416      64   16.644   
>  0.000    0.000    0.000  pd   sdgx
> 14:26:50.141306  R       inode   65:2276184264      64   16.593   
>  0.000    0.000    0.000  pd   sdrk
>
>
> >
> > Last but not least.. and this is what i really would like to
> > accomplish, i would to be able to monitor the latency of metadata 
> operations.
>
> you can't do this on the server side as you don't know how much time 
> you spend on the client , network or anything between the app and the 
> physical disk, so you can only reliably look at this from the client, 
> the iohist output only shows you the Server disk i/o processing time, 
> but that can be a fraction of the overall time (in other cases this 
> obviously can also be the dominant part depending on your workload).
>
> the easiest way on the client is to run
>
> mmfsadm vfsstats enable
> from now on vfs stats are collected until you restart GPFS.
>
> then run :
>
> vfs statistics currently enabled
> started at: Fri Aug 29 13:15:05.380 2014
>   duration: 448446.970 sec
>
>  name        calls  time per call     total time
>  -------------------- -------- -------------- --------------
>  statfs          9       0.000002     0.000021
>  startIO  246191176       0.005853 1441049.976740
>
> to dump what ever you collected so far on this node.
>
> > In my environment there are users that litterally overhelm our
> > storages with metadata request, so even if there is no massive
> > throughput or huge waiters, any "ls" could take ages. I would like
> > to be able to monitor metadata behaviour. There is a way to to do
> > that from the NSD servers?
>
> not this simple as described above.
>
> >
> > Thanks in advance for any tip/help.
> >
> > Regards,
> > Salvatore_______________________________________________
> > gpfsug-discuss mailing list
> > gpfsug-discuss at gpfsug.org
> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140904/ba28845b/attachment.htm>

From orlando.richards at ed.ac.uk  Thu Sep  4 14:54:37 2014
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Thu, 04 Sep 2014 14:54:37 +0100
Subject: [gpfsug-discuss] gpfs performance monitoring
In-Reply-To: <540869DF.5060100@ebi.ac.uk>
References: <54074F90.7000303@ebi.ac.uk>	<OFB11720D9.723B90AC-ON88257D49.00032BBC-88257D49.00049D93@us.ibm.com>
	<540869DF.5060100@ebi.ac.uk>
Message-ID: <54086F1D.1000401@ed.ac.uk>


On 04/09/14 14:32, Salvatore Di Nardo wrote:
> Sorry to bother you again but dstat have some issues with the plugin:
>
>         [root at gss01a util]# dstat --gpfs
>         /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is
>         deprecated.  Use the subprocess module.
>            pipes[cmd] = os.popen3(cmd, 't', 0)
>         Module dstat_gpfs failed to load. (global name 'select' is not
>         defined)
>         None of the stats you selected are available.
>
> I found this solution , but involve dstat recompile....
>
> https://github.com/dagwieers/dstat/issues/44
>
> Are you aware about any easier solution (we use RHEL6.3) ?
>

This worked for me the other day on a dev box I was poking at:

# rm /usr/share/dstat/dstat_gpfsops*

# cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 
/usr/share/dstat/dstat_gpfsops.py

# dstat --gpfsops
/usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated.  Use 
the subprocess module.
   pipes[cmd] = os.popen3(cmd, 't', 0)
---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o-----------------------------
   cr   del  op/cl   rd    wr  trunc fsync looku gattr sattr other mb_rd 
mb_wr  pref wrbeh steal clean  sync revok logwr logda oth_r oth_w
    0     0     0     0     0     0     0     0     0     0     0     0 
     0     0     0     0     0     0     0     0     0     0     0

...


>
> Regards,
> Salvatore
>
> On 04/09/14 01:50, Sven Oehme wrote:
>> > Hello everybody,
>>
>> Hi
>>
>> > here i come here again, this time to ask some hint about how to
>> monitor GPFS.
>> >
>> > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is
>> > that they return number based only on the request done in the
>> > current host, so i have to run them on all the clients ( over 600
>> > nodes) so its quite unpractical.  Instead i would like to know from
>> > the servers whats going on, and i came across the vio_s statistics
>> > wich are less documented and i dont know exacly what they mean.
>> > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that
>> > runs VIO_S.
>> >
>> > My problems with the output of this command:
>> >  echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1
>> >
>> > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second
>> > timestamp: 1409763206/477366
>> > recovery group: *
>> > declustered array: *
>> > vdisk: *
>> > client reads: 2584229
>> > client short writes: 55299693
>> > client medium writes: 190071
>> > client promoted full track writes:      465145
>> > client full track writes: 9249
>> > flushed update writes: 4187708
>> > flushed promoted full track writes: 123
>> > migrate operations: 114
>> > scrub operations: 450590
>> > log writes: 28509602
>> >
>> > it sais "VIOPS per second", but they seem to me just counters as
>> > every time i re-run the command, the numbers increase by a bit..
>> > Can anyone confirm if those numbers are counter or if they are OPS/sec.
>>
>> the numbers are accumulative so everytime you run them they just show
>> the value since start (or last reset) time.
>>
>> >
>> > On a closer eye about i dont understand what most of thosevalues
>> > mean. For example, what exacly are "flushed promoted full track
>> write" ??
>> > I tried to find a documentation about this output , but could not
>> > find any. can anyone point me a link where output of vio_s is explained?
>> >
>> > Another thing i dont understand about those numbers is if they are
>> > just operations, or the number of blocks that was read/write/etc .
>>
>> its just operations and if i would explain what the numbers mean i
>> might confuse you even more because this is not what you are really
>> looking for.
>> what you are looking for is what the client io's look like on the
>> Server side, while the VIO layer is the Server side to the disks, so
>> one lever lower than what you are looking for from what i could read
>> out of the description above.
>>
>> so the Layer you care about is the NSD Server layer, which sits on top
>> of the VIO layer (which is essentially the SW RAID Layer in GNR)
>>
>> > I'm asking that because if they are just ops, i don't know how much
>> > they could be usefull. For example one write operation could eman
>> > write 1 block or write a file of 100GB. If those are oprations,
>> > there is a way to have the oupunt in bytes or blocks?
>>
>> there are multiple ways to get infos on the NSD layer, one would be to
>> use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats counts
>> again.
>>
>> the alternative option is to use mmdiag --iohist. this shows you a
>> history of the last X numbers of io operations on either the client or
>> the server side like on a client :
>>
>> # mmdiag --iohist
>>
>> === mmdiag: iohist ===
>>
>> I/O history:
>>
>>  I/O start time RW    Buf type disk:sectorNum     nSec  time ms qTime
>> ms       RpcTimes ms  Type  Device/NSD ID         NSD server
>> --------------- -- ----------- ----------------- -----  -------
>> -------- -----------------  ---- ------------------ ---------------
>> 14:25:22.169617  R  LLIndBlock    1:1075622848      64   13.073
>>  0.000   12.959  0.063  cli   C0A70401:53BEEA7F     192.167.4.1
>> 14:25:22.182723  R       inode    1:1071252480       8    6.970  0.000
>>    6.908    0.038  cli   C0A70401:53BEEA7F     192.167.4.1
>> 14:25:53.659918  R  LLIndBlock    1:1081202176      64    8.309
>>  0.000    8.210    0.046  cli   C0A70401:53BEEA7F     192.167.4.1
>> 14:25:53.668262  R       inode    2:1081373696       8   14.117
>>  0.000   14.032    0.058  cli   C0A70402:53BEEA5E   192.167.4.2
>> 14:25:53.682750  R  LLIndBlock    1:1065508736      64    9.254
>>  0.000    9.180    0.038  cli   C0A70401:53BEEA7F     192.167.4.1
>> 14:25:53.692019  R       inode    2:1064356608       8   14.899
>>  0.000   14.847    0.029  cli   C0A70402:53BEEA5E   192.167.4.2
>> 14:25:53.707100  R       inode    2:1077830152       8   16.499
>>  0.000   16.449    0.025  cli   C0A70402:53BEEA5E   192.167.4.2
>> 14:25:53.723788  R  LLIndBlock    1:1081202432      64    4.280
>>  0.000    4.203    0.040  cli   C0A70401:53BEEA7F     192.167.4.1
>> 14:25:53.728082  R       inode    2:1081918976       8    7.760  0.000
>>    7.710    0.027  cli   C0A70402:53BEEA5E     192.167.4.2
>> 14:25:57.877416  R    metadata  2:678978560       16   13.343    0.000
>>   13.254    0.053  cli   C0A70402:53BEEA5E   192.167.4.2
>> 14:25:57.891048  R  LLIndBlock    1:1065508608      64   15.491
>>  0.000   15.401  0.058  cli   C0A70401:53BEEA7F     192.167.4.1
>> 14:25:57.906556  R       inode    2:1083476520       8   11.723
>>  0.000   11.676    0.029  cli   C0A70402:53BEEA5E   192.167.4.2
>> 14:25:57.918516  R  LLIndBlock    1:1075622720      64    8.062
>>  0.000    8.001    0.032  cli   C0A70401:53BEEA7F     192.167.4.1
>> 14:25:57.926592  R       inode    1:1076503480       8    8.087  0.000
>>    8.043    0.026  cli   C0A70401:53BEEA7F     192.167.4.1
>> 14:25:57.934856  R  LLIndBlock    1:1071088512      64    6.572
>>  0.000    6.510    0.033  cli   C0A70401:53BEEA7F     192.167.4.1
>> 14:25:57.941441  R       inode    2:1069885984       8   11.686
>>  0.000   11.641    0.024  cli   C0A70402:53BEEA5E   192.167.4.2
>> 14:25:57.953294  R       inode    2:1083476936       8    8.951  0.000
>>    8.912    0.021  cli   C0A70402:53BEEA5E     192.167.4.2
>> 14:25:57.965475  R       inode    1:1076503504       8    0.477  0.000
>>    0.053    0.000  cli   C0A70401:53BEEA7F     192.167.4.1
>> 14:25:57.965755  R       inode    2:1083476488       8    0.410  0.000
>>    0.061    0.321  cli   C0A70402:53BEEA5E     192.167.4.2
>> 14:25:57.965787  R       inode    2:1083476512       8    0.439  0.000
>>    0.053    0.342  cli   C0A70402:53BEEA5E     192.167.4.2
>>
>> you basically see if its a inode , data block , what size it has (in
>> sectors) , which nsd server you did send this request to, etc.
>>
>> on the Server side you see the type , which physical disk it goes to
>> and also what size of disk i/o it causes like :
>>
>> 14:26:50.129995  R       inode   12:3211886376      64   14.261
>>  0.000    0.000    0.000  pd   sdis
>> 14:26:50.137102  R       inode   19:3003969520      64    9.004
>>  0.000    0.000    0.000  pd   sdad
>> 14:26:50.136116  R       inode   55:3591710992      64   11.057
>>  0.000    0.000    0.000  pd   sdoh
>> 14:26:50.141510  R       inode   21:3066810504      64    5.909
>>  0.000    0.000    0.000  pd   sdaf
>> 14:26:50.130529  R       inode   89:2962370072      64   17.437
>>  0.000    0.000    0.000  pd   sddi
>> 14:26:50.131063  R       inode   78:1889457000      64   17.062
>>  0.000    0.000    0.000  pd   sdsj
>> 14:26:50.143403  R       inode   36:3323035688      64    4.807
>>  0.000    0.000    0.000  pd   sdmw
>> 14:26:50.131044  R       inode   37:2513579736     128   17.181
>>  0.000    0.000    0.000  pd   sddv
>> 14:26:50.138181  R       inode   72:3868810400      64   10.951
>>  0.000    0.000    0.000  pd   sdbz
>> 14:26:50.138188  R       inode  131:2443484784     128   11.792
>>  0.000    0.000    0.000  pd   sdug
>> 14:26:50.138003  R       inode  102:3696843872      64   11.994
>>  0.000    0.000    0.000  pd   sdgp
>> 14:26:50.137099  R       inode  145:3370922504      64   13.225
>>  0.000    0.000    0.000  pd   sdmi
>> 14:26:50.141576  R       inode   62:2668579904      64    9.313
>>  0.000    0.000    0.000  pd   sdou
>> 14:26:50.134689  R       inode  159:2786164648      64   16.577
>>  0.000    0.000    0.000  pd   sdpq
>> 14:26:50.145034  R       inode   34:2097217320      64    7.409
>>  0.000    0.000    0.000  pd   sdmt
>> 14:26:50.138140  R       inode  139:2831038792      64   14.898
>>  0.000    0.000    0.000  pd   sdlw
>> 14:26:50.130954  R       inode  164:282120312       64   22.274
>>  0.000    0.000    0.000  pd   sdzd
>> 14:26:50.137038  R       inode   41:3421909608      64   16.314
>>  0.000    0.000    0.000  pd   sdef
>> 14:26:50.137606  R       inode  104:1870962416      64   16.644
>>  0.000    0.000    0.000  pd   sdgx
>> 14:26:50.141306  R       inode   65:2276184264      64   16.593
>>  0.000    0.000    0.000  pd   sdrk
>>
>>
>> >
>> > Last but not least.. and this is what i really would like to
>> > accomplish, i would to be able to monitor the latency of metadata
>> operations.
>>
>> you can't do this on the server side as you don't know how much time
>> you spend on the client , network or anything between the app and the
>> physical disk, so you can only reliably look at this from the client,
>> the iohist output only shows you the Server disk i/o processing time,
>> but that can be a fraction of the overall time (in other cases this
>> obviously can also be the dominant part depending on your workload).
>>
>> the easiest way on the client is to run
>>
>> mmfsadm vfsstats enable
>> from now on vfs stats are collected until you restart GPFS.
>>
>> then run :
>>
>> vfs statistics currently enabled
>> started at: Fri Aug 29 13:15:05.380 2014
>>   duration: 448446.970 sec
>>
>>  name        calls  time per call     total time
>>  -------------------- -------- -------------- --------------
>>  statfs          9       0.000002     0.000021
>>  startIO  246191176       0.005853 1441049.976740
>>
>> to dump what ever you collected so far on this node.
>>
>> > In my environment there are users that litterally overhelm our
>> > storages with metadata request, so even if there is no massive
>> > throughput or huge waiters, any "ls" could take ages. I would like
>> > to be able to monitor metadata behaviour. There is a way to to do
>> > that from the NSD servers?
>>
>> not this simple as described above.
>>
>> >
>> > Thanks in advance for any tip/help.
>> >
>> > Regards,
>> > Salvatore_______________________________________________
>> > gpfsug-discuss mailing list
>> > gpfsug-discuss at gpfsug.org
>> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at gpfsug.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>

-- 
             --
        Dr Orlando Richards
Research Facilities (ECDF) Systems Leader
        Information Services
    IT Infrastructure Division
        Tel: 0131 650 4994
      skype: orlando.richards

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


From sdinardo at ebi.ac.uk  Thu Sep  4 15:07:42 2014
From: sdinardo at ebi.ac.uk (Salvatore Di Nardo)
Date: Thu, 04 Sep 2014 15:07:42 +0100
Subject: [gpfsug-discuss] gpfs performance monitoring
In-Reply-To: <54086F1D.1000401@ed.ac.uk>
References: <54074F90.7000303@ebi.ac.uk>	<OFB11720D9.723B90AC-ON88257D49.00032BBC-88257D49.00049D93@us.ibm.com>	<540869DF.5060100@ebi.ac.uk>
	<54086F1D.1000401@ed.ac.uk>
Message-ID: <5408722E.6060309@ebi.ac.uk>


On 04/09/14 14:54, Orlando Richards wrote:
>
>
> On 04/09/14 14:32, Salvatore Di Nardo wrote:
>> Sorry to bother you again but dstat have some issues with the plugin:
>>
>>         [root at gss01a util]# dstat --gpfs
>>         /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is
>>         deprecated.  Use the subprocess module.
>>            pipes[cmd] = os.popen3(cmd, 't', 0)
>>         Module dstat_gpfs failed to load. (global name 'select' is not
>>         defined)
>>         None of the stats you selected are available.
>>
>> I found this solution , but involve dstat recompile....
>>
>> https://github.com/dagwieers/dstat/issues/44
>>
>> Are you aware about any easier solution (we use RHEL6.3) ?
>>
>
> This worked for me the other day on a dev box I was poking at:
>
> # rm /usr/share/dstat/dstat_gpfsops*
>
> # cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 
> /usr/share/dstat/dstat_gpfsops.py
>
> # dstat --gpfsops
> /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use 
> the subprocess module.
>   pipes[cmd] = os.popen3(cmd, 't', 0)
> ---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o----------------------------- 
>
>   cr   del  op/cl   rd    wr  trunc fsync looku gattr sattr other 
> mb_rd mb_wr  pref wrbeh steal clean  sync revok logwr logda oth_r oth_w
>    0     0     0     0     0     0     0     0     0     0 0     0     
> 0     0     0     0     0     0     0     0     0 0     0
>
> ...
>

NICE!! The only problem  is that the box seems lacking those python scripts:

        ls /usr/lpp/mmfs/samples/util/
        makefile  README  tsbackup  tsbackup.C  tsbackup.h tsfindinode 
        tsfindinode.c  tsgetusage  tsgetusage.c  tsinode tsinode.c 
        tslistall  tsreaddir  tsreaddir.c  tstimes tstimes.c

Do you mind sending me those py files? They should be 3 as i see e gpfs 
options:  gpfs, gpfs-ops, gpfsops  (dunno what are the differences )

Regards,
Salvatore


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140904/8feb0d5b/attachment.htm>

From orlando.richards at ed.ac.uk  Thu Sep  4 15:14:02 2014
From: orlando.richards at ed.ac.uk (Orlando Richards)
Date: Thu, 04 Sep 2014 15:14:02 +0100
Subject: [gpfsug-discuss] gpfs performance monitoring
In-Reply-To: <5408722E.6060309@ebi.ac.uk>
References: <54074F90.7000303@ebi.ac.uk>	<OFB11720D9.723B90AC-ON88257D49.00032BBC-88257D49.00049D93@us.ibm.com>	<540869DF.5060100@ebi.ac.uk>	<54086F1D.1000401@ed.ac.uk>
	<5408722E.6060309@ebi.ac.uk>
Message-ID: <540873AA.5070401@ed.ac.uk>


On 04/09/14 15:07, Salvatore Di Nardo wrote:
>
> On 04/09/14 14:54, Orlando Richards wrote:
>>
>>
>> On 04/09/14 14:32, Salvatore Di Nardo wrote:
>>> Sorry to bother you again but dstat have some issues with the plugin:
>>>
>>>         [root at gss01a util]# dstat --gpfs
>>>         /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is
>>>         deprecated.  Use the subprocess module.
>>>            pipes[cmd] = os.popen3(cmd, 't', 0)
>>>         Module dstat_gpfs failed to load. (global name 'select' is not
>>>         defined)
>>>         None of the stats you selected are available.
>>>
>>> I found this solution , but involve dstat recompile....
>>>
>>> https://github.com/dagwieers/dstat/issues/44
>>>
>>> Are you aware about any easier solution (we use RHEL6.3) ?
>>>
>>
>> This worked for me the other day on a dev box I was poking at:
>>
>> # rm /usr/share/dstat/dstat_gpfsops*
>>
>> # cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7
>> /usr/share/dstat/dstat_gpfsops.py
>>
>> # dstat --gpfsops
>> /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use
>> the subprocess module.
>>   pipes[cmd] = os.popen3(cmd, 't', 0)
>> ---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o-----------------------------
>>
>>   cr   del  op/cl   rd    wr  trunc fsync looku gattr sattr other
>> mb_rd mb_wr  pref wrbeh steal clean  sync revok logwr logda oth_r oth_w
>>    0     0     0     0     0     0     0     0     0     0 0     0
>> 0     0     0     0     0     0     0     0     0 0     0
>>
>> ...
>>
>
> NICE!! The only problem  is that the box seems lacking those python scripts:
>
>         ls /usr/lpp/mmfs/samples/util/
>         makefile  README  tsbackup  tsbackup.C  tsbackup.h tsfindinode
>         tsfindinode.c  tsgetusage  tsgetusage.c  tsinode tsinode.c
>         tslistall  tsreaddir  tsreaddir.c  tstimes tstimes.c
>

It came from the gpfs.base rpm:

# rpm -qf /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7
gpfs.base-3.5.0-13.x86_64


> Do you mind sending me those py files? They should be 3 as i see e gpfs
> options:  gpfs, gpfs-ops, gpfsops  (dunno what are the differences )
>

Only the gpfsops.py is included in the bundle - one for dstat 0.6 and 
one for dstat 0.7.


I've attached it to this mail as well (it seems to be GPL'd).


> Regards,
> Salvatore
>
>
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>

-- 
             --
        Dr Orlando Richards
Research Facilities (ECDF) Systems Leader
        Information Services
    IT Infrastructure Division
        Tel: 0131 650 4994
      skype: orlando.richards

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.
-------------- next part --------------
#
# Copyright (C) 2009, 2010 IBM Corporation
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2, or (at your option)
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
#

global string, select, os, re, fnmatch
import string, select, os, re, fnmatch

# Dstat class to display selected gpfs performance counters returned by the
# mmpmon "vfs_s", "ioc_s", "vio_s", "vflush_s", and "lroc_s" commands.
#
# The set of counters displayed can be customized via environment variables:
#
#   DSTAT_GPFS_WHAT
#
#     Selects which of the five mmpmon commands to display.
#     It is a comma separated list of any of the following:
#       "vfs":    show mmpmon "vfs_s" counters
#       "ioc":    show mmpmon "ioc_s" counters related to NSD client I/O
#       "nsd":    show mmpmon "ioc_s" counters related to NSD server I/O
#       "vio":    show mmpmon "vio_s" counters
#       "vflush": show mmpmon "vflush_s" counters
#       "lroc":   show mmpmon "lroc_s" counters
#       "all":    equivalent to specifying all of the above
#
#     Example:
#
#       DSTAT_GPFS_WHAT=vfs,lroc dstat -M gpfsops
#
#     will display counters for mmpmon "vfs_s" and "lroc" commands.
#
#     The default setting is "vfs,ioc", i.e., by default only "vfs_s" and NSD
#     client related "ioc_s" counters are displayed.
#
#   DSTAT_GPFS_VFS
#   DSTAT_GPFS_IOC
#   DSTAT_GPFS_VIO
#   DSTAT_GPFS_VFLUSH
#   DSTAT_GPFS_LROC
#
#     Allow finer grain control over exactly which values will be displayed for
#     each of the five mmpmon commands.  Each variable is a comma separated list
#     of counter names with optional column header string.
#
#     Example:
#
#       export DSTAT_GPFS_VFS='create, remove, rd/wr=read+write'
#       export DSTAT_GPFS_IOC='sync*, logwrap*, oth_rd=*_rd, oth_wr=*_wr'
#       dstat -M gpfsops
#
#     Under "vfs-ops" this will display three columns, showing creates, deletes
#     (removes), and a third column labelled "rd/wr" with a combined count of
#     read and write operations.
#     Under "disk-i/o" it will display four columns, showing all disk  I/Os
#     initiated by sync, and log wrap, plus two columns labeled "oth_rd" and
#     "oth_wr" showing counts of all other disk reads and disk writes,
#     respectively.
#
#     Note: setting one of these environment variables overrides the
#     corrosponding setting in DSTAT_GPFS_WHAT.  For example, setting
#     DSTAT_GPFS_VFS="" will omit all "vfs_s" counters regardless of whether
#     "vfs" appears in DSTAT_GPFS_WHAT or not.
#
#     Counter sets are specified as a comma-separated list of entries of one
#     of the following forms
#
#       counter
#       label = counter
#       label = counter1 + counter2 + ...
#
#     If no label is specified, the name of the counter is used as the column
#     header (truncated to 5 characters).
#     Counter names may contain shell-style wildcards.  For example, the
#     pattern "sync*" matches the two ioc_s counters "sync_rd" and "sync_wr" and
#     therefore produce a column containing the combined count of disk reads and
#     disk writes initiated by sync.  If a counter appears in or matches a name
#     pattern in more than one entry, it is included only in the count under the
#     first entry in which it appears.  For example, adding an entry "other = *"
#     at the end of the list will add a column labeled "other" that shows the
#     sum of all counter values *not* included in any of the previous columns.
#
#   DSTAT_GPFS_LIST=1 dstat -M gpfsops
#
#     This will show all available counter names and the default definition 
#     for which sets of counter values are displayed.
#
# An alternative to setting environment variables is to create a file
#   ~/.dstat_gpfs_rc
# with python statements that sets any of the following variables
#   vfs_wanted:    equivalent to setting DSTAT_GPFS_VFS
#   ioc_wanted:    equivalent to setting DSTAT_GPFS_IOC
#   vio_wanted:    equivalent to setting DSTAT_GPFS_VIO
#   vflush_wanted: equivalent to setting DSTAT_GPFS_VFLUSH
#   lroc_wanted:   equivalent to setting DSTAT_GPFS_LROC
#
# For example, the following ~/.dstat_gpfs_rc file will produce the same 
# result as the environment variables in the example above:
#
#   vfs_wanted = 'create, remove, rd/wr=read+write'
#   ioc_wanted = 'sync*, logwrap*, oth_rd=*_rd, oth_wr=*_wr'
#
# See also the default vfs_wanted, ioc_wanted, and vio_wanted  settings in 
# the dstat_gpfsops __init__ method below.


class dstat_plugin(dstat):
    def __init__(self):
        # list of all stats counters returned by mmpmon "vfs_s", "ioc_s", "vio_s", "vflush_s", and "lroc_s"
        # always ignore the first few chars like : io_s _io_s_ _n_ 172.31.136.2 _nn_ mgmt001st001 _rc_ 0 _t_ 1322526286 _tu_ 415518
        vfs_keys = ('_access_', '_close_', '_create_', '_fclear_', '_fsync_', '_fsync_range_', '_ftrunc_', '_getattr_',
                    '_link_', '_lockctl_', '_lookup_', '_map_lloff_', '_mkdir_', '_mknod_', '_open_', '_read_', '_write_',
                    '_mmapRead_', '_mmapWrite_', '_aioRead_', '_aioWrite_','_readdir_', '_readlink_', '_readpage_', '_remove_', '_rename_', '_rmdir_',
                    '_setacl_', '_setattr_', '_symlink_', '_unmap_', '_writepage_', '_tsfattr_', '_tsfsattr_', '_flock_',
                    '_setxattr_', '_getxattr_', '_listxattr_', '_removexattr_', '_encode_fh_', '_decode_fh_', '_get_dentry_',
                    '_get_parent_', '_mount_', '_statfs_', '_sync_', '_vget_')

        ioc_keys = ('_other_rd_', '_other_wr_','_mb_rd_', '_mb_wr_', '_steal_rd_', '_steal_wr_', '_cleaner_rd_', '_cleaner_wr_',
                    '_sync_rd_', '_sync_wr_', '_logwrap_rd_', '_logwrap_wr_', '_revoke_rd_', '_revoke_wr_',
                    '_prefetch_rd_', '_prefetch_wr_', '_logdata_rd_', '_logdata_wr_', '_nsdworker_rd_', '_nsdworker_wr_','_nsdlocal_rd_','_nsdlocal_wr_',
                    '_vdisk_rd_','_vdisk_wr_', '_pdisk_rd_','_pdisk_wr_', '_logtip_rd_', '_logtip_wr_')

        vio_keys = ('_r_', '_sw_', '_mw_', '_pfw_', '_ftw_', '_fuw_', '_fpw_', '_m_', '_s_', '_l_', '_rgd_', '_meta_')

        vflush_keys = ('_ndt_', '_ngdb_', '_nfwlmb_', '_nfipt_', '_nfwwt_', '_ahwm_', '_susp_', '_uwrttf_', '_fftc_', '_nalth_', '_nasth_', '_nsigth_', '_ntgtth_')

        lroc_keys = ('_Inode_s_', '_Inode_sf_', '_Inode_smb_', '_Inode_r_', '_Inode_rf_', '_Inode_rmb_', '_Inode_i_', '_Inode_imb_',
                     '_Directory_s_', '_Directory_sf_', '_Directory_smb_', '_Directory_r_', '_Directory_rf_', '_Directory_rmb_', '_Directory_i_', '_Directory_imb_',
                     '_Data_s_', '_Data_sf_', '_Data_smb_', '_Data_r_', '_Data_rf_', '_Data_rmb_', '_Data_i_',  '_Data_imb_',
                     '_agt_i_', '_agt_i_rm_', '_agt_i_rM_', '_agt_i_ra_', '_agt_r_', '_agt_r_rm_', '_agt_r_rM_', '_agt_r_ra_', 
                     '_ssd_w_', '_ssd_w_p_', '_ssd_w_rm_', '_ssd_w_rM_', '_ssd_w_ra_', '_ssd_r_', '_ssd_r_p_', '_ssd_r_rm_', '_ssd_r_rM_', '_ssd_r_ra_') 

        # Default counters to display for each mmpmon category
        vfs_wanted = '''cr = create + mkdir + link + symlink,
                        del = remove + rmdir,
                        op/cl = open + close + map_lloff + unmap,
                        rd = read + readdir + readlink + mmapRead + readpage + aioRead + aioWrite,
                        wr = write + mmapWrite + writepage,
                        trunc = ftrunc + fclear,
                        fsync = fsync + fsync_range,
                        lookup,
                        gattr = access + getattr + getxattr + getacl,
                        sattr = setattr + setxattr + setacl,
                        other = *
                     '''
        ioc_wanted1 = '''mb_rd, mb_wr, pref=prefetch_rd, wrbeh=prefetch_wr,
                         steal*, cleaner*, sync*, revoke*, logwrap*, logdata*,
                         oth_rd = other_rd, oth_wr = other_wr
                      '''
        ioc_wanted2 = '''rns_r=nsdworker_rd, rns_w=nsdworker_wr,
                         lns_r=nsdlocal_rd, lns_w=nsdlocal_wr,
                         vd_r=vdisk_rd, vd_w=vdisk_wr, pd_r=pdisk_rd, pd_w=pdisk_wr,
                      '''
        vio_wanted = '''ClRead=r, ClShWr=sw, ClMdWr=mw, ClPFTWr=pfw, ClFTWr=ftw,
                        FlUpWr=fuw, FlPFTWr=fpw, Migrte=m, Scrub=s, LgWr=l,
                        RGDsc=rgd, Meta=meta
                     '''
        vflush_wanted = '''DiTrk = ndt,
                           DiBuf = ngdb,
                           FwLog = nfwlmb,
                           FinPr = nfipt,
                           WraTh = nfwwt,
                           HiWMa = ahwm,
                           Suspd = susp,
                           WrThF = uwrttf,
                           Force = fftc,
                           TrgTh = ntgtth,
                           other = nalth + nasth + nsigth
                        '''
        lroc_wanted = '''StorS = Inode_s + Directory_s + Data_s,
                         StorF = Inode_sf + Directory_sf + Data_sf,      
                         FetcS = Inode_r + Directory_r + Data_r,
                         FetcF = Inode_rf + Directory_rf + Data_rf, 
                         InVAL = Inode_i + Directory_i + Data_i
                      '''

        # Coarse counter selection via DSTAT_GPFS_WHAT
        if 'DSTAT_GPFS_WHAT' in os.environ:
            what_wanted = os.environ['DSTAT_GPFS_WHAT'].split(',')
        else:
            what_wanted = [ 'vfs', 'ioc' ]

        # If ".dstat_gpfs_rc" exists in user's home directory, run it.
        # Otherwise, use DSTAT_GPFS_WHAT for counter selection and look for other
        # DSTAT_GPFS_XXX environment variables for additional customization.
        userprofile = os.path.join(os.environ['HOME'], '.dstat_gpfs_rc')
        if os.path.exists(userprofile):
            ioc_wanted = ioc_wanted1 + ioc_wanted2
            exec file(userprofile)
        else:
            if 'all' not in what_wanted:
                if 'vfs' not in what_wanted:
                    vfs_wanted = ''
                if 'ioc' not in what_wanted:
                    ioc_wanted1 = ''
                if 'nsd' not in what_wanted:
                    ioc_wanted2 = ''
                if 'vio' not in what_wanted:
                    vio_wanted = ''
                if 'vflush' not in what_wanted:
                    vflush_wanted = ''
                if 'lroc' not in what_wanted:
                    lroc_wanted = ''
            ioc_wanted = ioc_wanted1 + ioc_wanted2

            # Fine grain counter cusomization via DSTAT_GPFS_XXX
            if 'DSTAT_GPFS_VFS' in os.environ:
                vfs_wanted = os.environ['DSTAT_GPFS_VFS']
            if 'DSTAT_GPFS_IOC' in os.environ:
                ioc_wanted = os.environ['DSTAT_GPFS_IOC']
            if 'DSTAT_GPFS_VIO' in os.environ:
                vio_wanted = os.environ['DSTAT_GPFS_VIO']
            if 'DSTAT_GPFS_VFLUSH' in os.environ:
                vflush_wanted = os.environ['DSTAT_GPFS_VFLUSH']
            if 'DSTAT_GPFS_LROC' in os.environ:
                lroc_wanted = os.environ['DSTAT_GPFS_LROC']

        self.debug = 0

        vars1, nick1, keymap1 = self.make_keymap(vfs_keys, vfs_wanted, 'gpfs-vfs-')
        vars2, nick2, keymap2 = self.make_keymap(ioc_keys, ioc_wanted, 'gpfs-io-')
        vars3, nick3, keymap3 = self.make_keymap(vio_keys, vio_wanted, 'gpfs-vio-')
        vars4, nick4, keymap4 = self.make_keymap(vflush_keys, vflush_wanted, 'gpfs-vflush-')
        vars5, nick5, keymap5 = self.make_keymap(lroc_keys, lroc_wanted, 'gpfs-lroc-')

        if 'DSTAT_GPFS_LIST' in os.environ or self.debug:
            self.show_keymap('vfs_s', 'DSTAT_GPFS_VFS', vfs_keys, vfs_wanted, vars1, keymap1, 'gpfs-vfs-')
            self.show_keymap('ioc_s', 'DSTAT_GPFS_IOC', ioc_keys, ioc_wanted, vars2, keymap2, 'gpfs-io-')
            self.show_keymap('vio_s', 'DSTAT_GPFS_VIO', vio_keys, vio_wanted, vars3, keymap3, 'gpfs-vio-')
            self.show_keymap('vflush_stat', 'DSTAT_GPFS_VFLUSH', vflush_keys, vflush_wanted, vars4, keymap4, 'gpfs-vflush-')
            self.show_keymap('lroc_s', 'DSTAT_GPFS_LROC', lroc_keys, lroc_wanted, vars5, keymap5, 'gpfs-lroc-')
            print

        self.vars = vars1 + vars2 + vars3 + vars4 + vars5
        self.varsrate = vars1 + vars2 + vars3 + vars5
        self.varsconst = vars4
        self.nick = nick1 + nick2 + nick3 + nick4 + nick5
        self.vfs_keymap = keymap1  
        self.ioc_keymap = keymap2
        self.vio_keymap = keymap3
        self.vflush_keymap = keymap4
        self.lroc_keymap = keymap5

        names = []
        self.addtitle(names, 'gpfs vfs ops', len(vars1))
        self.addtitle(names, 'gpfs disk i/o', len(vars2))
        self.addtitle(names, 'gpfs vio', len(vars3))
        self.addtitle(names, 'gpfs vflush', len(vars4))
        self.addtitle(names, 'gpfs lroc', len(vars5))

        self.name = '#'.join(names)
        self.type = 'd'
        self.width = 5
        self.scale = 1000


    def make_keymap(self, keys, wanted, prefix):
        '''Parse the list of counter values to be displayd
           "keys" is the list of all available counters
           "wanted" is a string of the form "name1 = key1 + key2 + ..., name2 = key3 + key4  ..."
           Returns a list of all names found, e.g. ['name1', 'name2', ...], and a dictionary that
           maps counters to names, e.g., { 'key1': 'name1', 'key2': 'name1', 'key3': 'name2', ... },
        '''
        vars = []
        nick = []
        kmap = {}
        ## print re.split(r'\s*,\s*', wanted.strip())

        for n in re.split(r'\s*,\s*', wanted.strip()):
            l = re.split(r'\s*=\s*', n, 2)
            if len(l) == 2:
                v = l[0]
                kl = re.split(r'\s*\+\s*', l[1])
            elif l[0]:
                v = l[0].strip('*')
                kl = l
            else:
                continue
            nick.append(v[0:5])
            v = prefix + v.replace('/', '-')
            vars.append(v)
            for s in kl:
                for k in keys:
                    if fnmatch.fnmatch(k.strip('_'), s) and k not in kmap:
                        kmap[k] = v
        return vars, nick, kmap

    def show_keymap(self, label, envname, keys, wanted, vars, kmap, prefix):
        'show available counter names and current counter set definition'
        linewd = 100

        print '\nAvailable counters for "%s":' % label
        mlen = max([len(k.strip('_')) for k in keys])
        ncols = linewd // (mlen + 1)
        nrows = (len(keys) + ncols - 1) // ncols
        for r in range(nrows):
            print ' ',
            for c in range(ncols):
                i = c *nrows + r
                if not i < len(keys):
                    break
                print keys[i].strip('_').ljust(mlen),
            print

        print '\nCurrent counter set selection:'
        print "\n%s='%s'\n" % (envname, re.sub(r'\s+', '', wanted).strip().replace(',', ', '))
        if not vars:
            return
        mlen = 5
        for v in vars:
            if v.startswith(prefix):
                s = v[len(prefix):]
            else:
                s = v
            n = ' %s = ' % s[0:mlen].rjust(mlen)
            kl = [ k.strip('_') for k in keys if kmap.get(k) == v ]
            i = 0
            while i < len(kl):
                slen = len(n) + 3 + len(kl[i])
                j = i + 1
                while j < len(kl) and slen + 3 + len(kl[j]) < linewd:
                    slen += 3 + len(kl[j])
                    j += 1
                print n + ' + '.join(kl[i:j])
                i = j
                n = ' %s + ' % ''.rjust(mlen)

    def addtitle(self, names, name, ncols):
        'pad title given by "name" with minus signs to span "ncols" columns'
        if ncols == 1:
            names.append(name.split()[-1].center(6*ncols - 1))
        elif ncols > 1:
            names.append(name.center(6*ncols - 1))

    def check(self):
        'start mmpmon command'
        if os.access('/usr/lpp/mmfs/bin/mmpmon', os.X_OK):
            try:
                self.stdin, self.stdout, self.stderr = dpopen('/usr/lpp/mmfs/bin/mmpmon -p -s')
                self.stdin.write('reset\n')
                readpipe(self.stdout)
            except IOError:
                raise Exception, 'Cannot interface with gpfs mmpmon binary'
            return True
        raise Exception, 'Needs GPFS mmpmon binary'

    def extract_vfs(self):
        'collect "vfs_s" counter values'
        self.stdin.write('vfs_s\n')
        l = []
        for line in readpipe(self.stdout):
            if not line: continue
            l += line.split()
        for i in range(11, len(l), 3):
            try:
                self.set2[self.vfs_keymap[l[i]]] += long(l[i+1])
            except KeyError:
                pass

    def extract_ioc(self):
        'collect "ioc_s" counter values'
        self.stdin.write('ioc_s\n')
        l = []
        for line in readpipe(self.stdout):
            if not line: continue
            l += line.split()
        for i in range(11, len(l), 3):
            try:
                self.set2[self.ioc_keymap[l[i]+'rd_']] += long(l[i+1])
            except KeyError:
                pass
            try:
                self.set2[self.ioc_keymap[l[i]+'wr_']] += long(l[i+2])
            except KeyError:
                pass

    def extract_vio(self):
        'collect "vio_s" counter values'
        self.stdin.write('vio_s\n')
        l = []
        for line in readpipe(self.stdout):
            if not line: continue
            l += line.split()
        for i in range(19, len(l), 2):
            try:
                if l[i] in self.vio_keymap:
                    self.set2[self.vio_keymap[l[i]]] += long(l[i+1])
            except KeyError:
                pass

    def extract_vflush(self):
        'collect "vflush_stat" counter values'
        self.stdin.write('vflush_stat\n')
        l = []
        for line in readpipe(self.stdout):
            if not line: continue
            l += line.split()
        for i in range(11, len(l), 2):
            try:
                if l[i] in self.vflush_keymap:
                    self.set2[self.vflush_keymap[l[i]]] += long(l[i+1])
            except KeyError:
                pass

    def extract_lroc(self):
        'collect "lroc_s" counter values'
        self.stdin.write('lroc_s\n')
        l = []
        for line in readpipe(self.stdout):
            if not line: continue
            l += line.split()
        for i in range(11, len(l), 2):
            try:
                if l[i] in self.lroc_keymap:
                    self.set2[self.lroc_keymap[l[i]]] += long(l[i+1])
            except KeyError:
                pass


    def extract(self):
        try:
            for name in self.vars:
                self.set2[name] = 0
            self.extract_ioc()
            self.extract_vfs()
            self.extract_vio()
            self.extract_vflush()
            self.extract_lroc()
            for name in self.varsrate:
                self.val[name] = (self.set2[name] - self.set1[name]) * 1.0 / elapsed
            for name in self.varsconst:
                self.val[name] = self.set2[name]
        except IOError, e:
            for name in self.vars: self.val[name] = -1
            ## print 'dstat_gpfs: lost pipe to mmpmon,', e
        except Exception, e:
            for name in self.vars: self.val[name] = -1
            print 'dstat_gpfs: exception', e

        if self.debug >= 0:
            self.debug -= 1

        if step == op.delay:
            self.set1.update(self.set2)

From ewahl at osc.edu  Thu Sep  4 15:13:48 2014
From: ewahl at osc.edu (Ed Wahl)
Date: Thu, 4 Sep 2014 14:13:48 +0000
Subject: [gpfsug-discuss] gpfs performance monitoring
In-Reply-To: <1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de>
References: <54074F90.7000303@ebi.ac.uk>,
	<1117005717.712028.1409825118676.open-xchange@oxbaltgw13.schlund.de>
Message-ID: <C59E5201836F7147BAD35189FFBB35D101164D83A1@USOAPP09V04P.si.lan>

Another known issue with slow "ls" can be the annoyance that is 'sssd' under newer OSs (rhel 6) and properly configuring this for remote auth.  I know on my nsd's I never did and the first ls in a directory where the cache is expired takes forever to make all the remote LDAP calls to get the UID info. bleh.

Ed

________________________________
From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of service at metamodul.com [service at metamodul.com]
Sent: Thursday, September 04, 2014 6:05 AM
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] gpfs performance monitoring

> , any "ls" could take ages.

Check if you large directories either with many files or simply large.
Verify if you have NFS exported GPFS.
Verify that your cache settings on the clients are large enough ( maxStatCache , maxFilesToCache , sharedMemLimit )
Verify that you have dedicated metadata luns ( metadataOnly )

Reference:
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Tuning%20Parameters

Note:
If possible monitor your metadata luns on the storage directly.

hth
Hajo


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140904/129caa21/attachment.htm>

From sdinardo at ebi.ac.uk  Thu Sep  4 15:18:02 2014
From: sdinardo at ebi.ac.uk (Salvatore Di Nardo)
Date: Thu, 04 Sep 2014 15:18:02 +0100
Subject: [gpfsug-discuss] gpfs performance monitoring
In-Reply-To: <540873AA.5070401@ed.ac.uk>
References: <54074F90.7000303@ebi.ac.uk>	<OFB11720D9.723B90AC-ON88257D49.00032BBC-88257D49.00049D93@us.ibm.com>	<540869DF.5060100@ebi.ac.uk>	<54086F1D.1000401@ed.ac.uk>
	<5408722E.6060309@ebi.ac.uk> <540873AA.5070401@ed.ac.uk>
Message-ID: <5408749A.9080306@ebi.ac.uk>


On 04/09/14 15:14, Orlando Richards wrote:
>
>
> On 04/09/14 15:07, Salvatore Di Nardo wrote:
>>
>> On 04/09/14 14:54, Orlando Richards wrote:
>>>
>>>
>>> On 04/09/14 14:32, Salvatore Di Nardo wrote:
>>>> Sorry to bother you again but dstat have some issues with the plugin:
>>>>
>>>>         [root at gss01a util]# dstat --gpfs
>>>>         /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is
>>>>         deprecated.  Use the subprocess module.
>>>>            pipes[cmd] = os.popen3(cmd, 't', 0)
>>>>         Module dstat_gpfs failed to load. (global name 'select' is not
>>>>         defined)
>>>>         None of the stats you selected are available.
>>>>
>>>> I found this solution , but involve dstat recompile....
>>>>
>>>> https://github.com/dagwieers/dstat/issues/44
>>>>
>>>> Are you aware about any easier solution (we use RHEL6.3) ?
>>>>
>>>
>>> This worked for me the other day on a dev box I was poking at:
>>>
>>> # rm /usr/share/dstat/dstat_gpfsops*
>>>
>>> # cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7
>>> /usr/share/dstat/dstat_gpfsops.py
>>>
>>> # dstat --gpfsops
>>> /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated. Use
>>> the subprocess module.
>>>   pipes[cmd] = os.popen3(cmd, 't', 0)
>>> ---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o----------------------------- 
>>>
>>>
>>>   cr   del  op/cl   rd    wr  trunc fsync looku gattr sattr other
>>> mb_rd mb_wr  pref wrbeh steal clean  sync revok logwr logda oth_r oth_w
>>>    0     0     0     0     0     0     0     0     0     0 0     0
>>> 0     0     0     0     0     0     0     0     0 0     0
>>>
>>> ...
>>>
>>
>> NICE!! The only problem  is that the box seems lacking those python 
>> scripts:
>>
>>         ls /usr/lpp/mmfs/samples/util/
>>         makefile  README  tsbackup  tsbackup.C  tsbackup.h tsfindinode
>>         tsfindinode.c  tsgetusage  tsgetusage.c  tsinode tsinode.c
>>         tslistall  tsreaddir  tsreaddir.c  tstimes tstimes.c
>>
>
> It came from the gpfs.base rpm:
>
> # rpm -qf /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7
> gpfs.base-3.5.0-13.x86_64
>
>
>> Do you mind sending me those py files? They should be 3 as i see e gpfs
>> options:  gpfs, gpfs-ops, gpfsops  (dunno what are the differences )
>>
>
> Only the gpfsops.py is included in the bundle - one for dstat 0.6 and 
> one for dstat 0.7.
>
>
> I've attached it to this mail as well (it seems to be GPL'd).
>
Thanks.


From J.R.Jones at soton.ac.uk  Thu Sep  4 16:15:48 2014
From: J.R.Jones at soton.ac.uk (Jones J.R.)
Date: Thu, 4 Sep 2014 15:15:48 +0000
Subject: [gpfsug-discuss] Building the portability layer for Xeon Phi
Message-ID: <1409843748.7733.31.camel@uos-204812.clients.soton.ac.uk>

Hi folks

Has anyone managed to successfully build the portability layer for Xeon
Phi?  At the moment we are having to export the GPFS mounts from the
host machine over NFS, which is proving rather unreliable.

Jess

From oehmes at us.ibm.com  Fri Sep  5 01:48:40 2014
From: oehmes at us.ibm.com (Sven Oehme)
Date: Thu, 4 Sep 2014 17:48:40 -0700
Subject: [gpfsug-discuss] gpfs performance monitoring
In-Reply-To: <54084258.90508@ebi.ac.uk>
References: <54074F90.7000303@ebi.ac.uk>	<OFB11720D9.723B90AC-ON88257D49.00032BBC-88257D49.00049D93@us.ibm.com>
	<54084258.90508@ebi.ac.uk>
Message-ID: <OF9ACC3F87.D8E14E88-ON88257D4A.0002DEA2-88257D4A.000474AF@us.ibm.com>

------------------------------------------
Sven Oehme 
Scalable Storage Research 
email: oehmes at us.ibm.com 
Phone: +1 (408) 824-8904 
IBM Almaden Research Lab 
------------------------------------------

gpfsug-discuss-bounces at gpfsug.org wrote on 09/04/2014 03:43:36 AM:

> From: Salvatore Di Nardo <sdinardo at ebi.ac.uk>
> To: gpfsug main discussion list <gpfsug-discuss at gpfsug.org>
> Date: 09/04/2014 03:44 AM
> Subject: Re: [gpfsug-discuss] gpfs performance monitoring
> Sent by: gpfsug-discuss-bounces at gpfsug.org
> 
> On 04/09/14 01:50, Sven Oehme wrote:
> > Hello everybody,
> 
> Hi 
> 
> > here i come here again, this time to ask some hint about how to 
> monitor GPFS.
> > 
> > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is 
> > that they return number based only on the request done in the 
> > current host, so i have to run them on all the clients ( over 600 
> > nodes) so its quite unpractical.  Instead i would like to know from 
> > the servers whats going on, and i came across the vio_s statistics 
> > wich are less documented and i dont know exacly what they mean. 
> > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that
> > runs VIO_S.
> > 
> > My problems with the output of this command: 
> >  echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1
> > 
> > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second
> > timestamp:                          1409763206/477366
> > recovery group:                     *
> > declustered array:                  *
> > vdisk:                              *
> > client reads:                          2584229
> > client short writes:                  55299693
> > client medium writes:                   190071
> > client promoted full track writes:      465145
> > client full track writes:                 9249
> > flushed update writes:                 4187708
> > flushed promoted full track writes:        123
> > migrate operations:                        114
> > scrub operations:                       450590
> > log writes:                           28509602 
> > 
> > it sais "VIOPS per second", but they seem to me just counters as 
> > every time i re-run the command, the numbers increase by a bit..  
> > Can anyone confirm if those numbers are counter or if they are 
OPS/sec.
> 
> the numbers are accumulative so everytime you run them they just 
> show the value since start (or last reset) time. 
> OK, you confirmed my toughts, thatks

> 
> > 
> > On a closer eye about i dont understand what most of thosevalues 
> > mean. For example, what exacly are "flushed promoted full track write" 
?? 
> > I tried to find a documentation about this output , but could not 
> > find any. can anyone point me a link where output of vio_s is 
explained?
> > 
> > Another thing i dont understand about those numbers is if they are 
> > just operations, or the number of blocks that was read/write/etc . 
> 
> its just operations and if i would explain what the numbers mean i 
> might confuse you even more because this is not what you are really 
> looking for. 
> what you are looking for is what the client io's look like on the 
> Server side, while the VIO layer is the Server side to the disks, so
> one lever lower than what you are looking for from what i could read
> out of the description above.  
> No.. what I'm looking its exactly how the disks are busy to keep the
> requests. Obviously i'm not looking just that, but I feel the needs 
> to monitor also those things. Ill explain you why. 
> 
> It happens when our storage is quite busy ( 180Gb/s of read/write ) 
> that the FS start to be slowin normal cd or ls requests. This might 
> be normal, but in those situation i want to know where the 
> bottleneck is. Is the server CPU? Memory? Network? Spindles? knowing
> where the bottlenek is might help me to understand if we can tweak 
> the system a bit more.

if cd or ls is very slow in GPFS in the  majority of the cases it has 
nothing to do with NSD Server bottlenecks, only indirect. 
the main reason ls is slow in the field is you have some very powerful 
nodes that all do buffered writes into the same directory into 1 or 
multiple files while you do the ls on a different node. what happens now 
is that the ls you did run most likely is a alias for ls -l or something 
even more complex with color display, etc, but the point is it most likely 
returns file size. GPFS doesn't lie about the filesize, we only return 
accurate stat informations and while this is arguable, its a fact today. 
so what happens is that the stat on each file triggers a token  revoke on 
the node that currently writing to the file you do stat on, lets say it 
has 1 gb of dirty data in its memory for this file (as its writes data 
buffered) this 1 GB of data now gets written to the NSD server, the client 
updates the inode info and returns the correct size. 
lets say you have very fast network and you have a fast storage device 
like GSS (which i see you have) it will be able to do this in a few 100 
ms, but the problem is this now happens serialized for each single file in 
this directory that people  write into as for each we need to get the 
exact stat info to satisfy your ls -l request. 
this is what takes so long, not the fact that the storage device might be 
slow or to much metadata activity is going on , this is token , means 
network traffic and obviously latency dependent. 

the best way to see this is to look at waiters on the client where you run 
the ls and see what they are waiting for.

there are various ways to tune this to get better 'felt' ls responses but 
its not completely going away 
if all you try to with ls is if there is a file in the directory run 
unalias ls and check if ls after that runs fast as it shouldn't do the -l 
under the cover anymore. 

> 
> If its the CPU on the servers then there is no much to do beside 
> replacing or add more servers.If its not the CPU, maybe more memory 
> would help? Maybe its just the network that filled up? so i can add 
> more links 
> 
> Or if we reached the point there the bottleneck its the spindles, 
> then there is no much point o look somethere else, we just reached 
> the hardware limit..
> 
> Sometimes, it also happens that there is very low IO (10Gb/s ), 
> almost no cpu usage on the servers but huge slownes ( ls can take 10
> seconds).  Why that happens? There is not much data ops , but we 
> think there is a huge ammount of metadata ops. So what i want to 
> know is if the metadata vdisks are busy or not. If this is our 
> problem, could some SSD disks dedicated to metadata help? 

the answer if ssd's would help or not are hard to say without knowing the 
root case and as i tried to explain above the most likely case is token 
revoke, not disk i/o. obviously as more busy your disks are as longer the 
token revoke will take. 

> 
> 
> In particular im, a bit puzzled with the design of our GSS storage.
> Each recovery groups have 3 declustered arrays, and each declustered
> aray have 1 data and 1 metadata vdisk, but in the end both metadata 
> and data vdisks use the same spindles. The problem that, its that I 
> dont understand if we have a metadata bottleneck there. Maybe some 
> SSD disks in a dedicated declustered array would perform much 
> better, but this is just theory. I really would like to be able to 
> monitor IO activities on the metadata vdisks.

the short answer is we WANT the metadata disks to be with the data disks 
on the same spindles. compared to other storage systems, GSS is capable to 
handle different raid codes for different virtual disks on the same 
physical disks, this way we create raid1'ish 'LUNS' for metadata and 
raid6'is 'LUNS' for data so the small i/o penalty for a metadata is very 
small compared to a read/modify/write on the data disks.

> 
> 

> 
> 
> so the Layer you care about is the NSD Server layer, which sits on 
> top of the VIO layer (which is essentially the SW RAID Layer in GNR) 
> 
> > I'm asking that because if they are just ops, i don't know how much 
> > they could be usefull. For example one write operation could eman 
> > write 1 block or write a file of 100GB. If those are oprations, 
> > there is a way to have the oupunt in bytes or blocks? 
> 
> there are multiple ways to get infos on the NSD layer, one would be 
> to use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats 
> counts again. 
> 
> Counters its not a problem. I can collect them and create some 
> graphs in a monitoring tool. I will check that.

if you (let) upgrade your system to GSS 2.0 you get a graphical monitoring 
as part of it. if you want i can send you some direct email outside the 
group with additional informations on that. 

> 
> the alternative option is to use mmdiag --iohist. this shows you a 
> history of the last X numbers of io operations on either the client 
> or the server side like on a client : 
> 
> # mmdiag --iohist 
> 
> === mmdiag: iohist === 
> 
> I/O history: 
> 
>  I/O start time RW    Buf type disk:sectorNum     nSec  time ms 
> qTime ms       RpcTimes ms  Type  Device/NSD ID         NSD server 
> --------------- -- ----------- ----------------- -----  ------- 
> -------- -----------------  ---- ------------------ --------------- 
> 14:25:22.169617  R  LLIndBlock    1:1075622848      64   13.073   
>  0.000   12.959    0.063  cli   C0A70401:53BEEA7F     192.167.4.1 
> 14:25:22.182723  R       inode    1:1071252480       8    6.970   
>  0.000    6.908    0.038  cli   C0A70401:53BEEA7F     192.167.4.1 
> 14:25:53.659918  R  LLIndBlock    1:1081202176      64    8.309   
>  0.000    8.210    0.046  cli   C0A70401:53BEEA7F     192.167.4.1 
> 14:25:53.668262  R       inode    2:1081373696       8   14.117   
>  0.000   14.032    0.058  cli   C0A70402:53BEEA5E     192.167.4.2 
> 14:25:53.682750  R  LLIndBlock    1:1065508736      64    9.254   
>  0.000    9.180    0.038  cli   C0A70401:53BEEA7F     192.167.4.1 
> 14:25:53.692019  R       inode    2:1064356608       8   14.899   
>  0.000   14.847    0.029  cli   C0A70402:53BEEA5E     192.167.4.2 
> 14:25:53.707100  R       inode    2:1077830152       8   16.499   
>  0.000   16.449    0.025  cli   C0A70402:53BEEA5E     192.167.4.2 
> 14:25:53.723788  R  LLIndBlock    1:1081202432      64    4.280   
>  0.000    4.203    0.040  cli   C0A70401:53BEEA7F     192.167.4.1 
> 14:25:53.728082  R       inode    2:1081918976       8    7.760   
>  0.000    7.710    0.027  cli   C0A70402:53BEEA5E     192.167.4.2 
> 14:25:57.877416  R    metadata    2:678978560       16   13.343   
>  0.000   13.254    0.053  cli   C0A70402:53BEEA5E     192.167.4.2 
> 14:25:57.891048  R  LLIndBlock    1:1065508608      64   15.491   
>  0.000   15.401    0.058  cli   C0A70401:53BEEA7F     192.167.4.1 
> 14:25:57.906556  R       inode    2:1083476520       8   11.723   
>  0.000   11.676    0.029  cli   C0A70402:53BEEA5E     192.167.4.2 
> 14:25:57.918516  R  LLIndBlock    1:1075622720      64    8.062   
>  0.000    8.001    0.032  cli   C0A70401:53BEEA7F     192.167.4.1 
> 14:25:57.926592  R       inode    1:1076503480       8    8.087   
>  0.000    8.043    0.026  cli   C0A70401:53BEEA7F     192.167.4.1 
> 14:25:57.934856  R  LLIndBlock    1:1071088512      64    6.572   
>  0.000    6.510    0.033  cli   C0A70401:53BEEA7F     192.167.4.1 
> 14:25:57.941441  R       inode    2:1069885984       8   11.686   
>  0.000   11.641    0.024  cli   C0A70402:53BEEA5E     192.167.4.2 
> 14:25:57.953294  R       inode    2:1083476936       8    8.951   
>  0.000    8.912    0.021  cli   C0A70402:53BEEA5E     192.167.4.2 
> 14:25:57.965475  R       inode    1:1076503504       8    0.477   
>  0.000    0.053    0.000  cli   C0A70401:53BEEA7F     192.167.4.1 
> 14:25:57.965755  R       inode    2:1083476488       8    0.410   
>  0.000    0.061    0.321  cli   C0A70402:53BEEA5E     192.167.4.2 
> 14:25:57.965787  R       inode    2:1083476512       8    0.439   
>  0.000    0.053    0.342  cli   C0A70402:53BEEA5E     192.167.4.2 
> 
> you basically see if its a inode , data block , what size it has (in
> sectors) , which nsd server you did send this request to, etc. 
> 
> on the Server side you see the type , which physical disk it goes to
> and also what size of disk i/o it causes like :   
> 
> 14:26:50.129995  R       inode   12:3211886376      64   14.261   
>  0.000    0.000    0.000  pd   sdis 
> 14:26:50.137102  R       inode   19:3003969520      64    9.004   
>  0.000    0.000    0.000  pd   sdad 
> 14:26:50.136116  R       inode   55:3591710992      64   11.057   
>  0.000    0.000    0.000  pd   sdoh 
> 14:26:50.141510  R       inode   21:3066810504      64    5.909   
>  0.000    0.000    0.000  pd   sdaf 
> 14:26:50.130529  R       inode   89:2962370072      64   17.437   
>  0.000    0.000    0.000  pd   sddi 
> 14:26:50.131063  R       inode   78:1889457000      64   17.062   
>  0.000    0.000    0.000  pd   sdsj 
> 14:26:50.143403  R       inode   36:3323035688      64    4.807   
>  0.000    0.000    0.000  pd   sdmw 
> 14:26:50.131044  R       inode   37:2513579736     128   17.181   
>  0.000    0.000    0.000  pd   sddv 
> 14:26:50.138181  R       inode   72:3868810400      64   10.951   
>  0.000    0.000    0.000  pd   sdbz 
> 14:26:50.138188  R       inode  131:2443484784     128   11.792   
>  0.000    0.000    0.000  pd   sdug 
> 14:26:50.138003  R       inode  102:3696843872      64   11.994   
>  0.000    0.000    0.000  pd   sdgp 
> 14:26:50.137099  R       inode  145:3370922504      64   13.225   
>  0.000    0.000    0.000  pd   sdmi 
> 14:26:50.141576  R       inode   62:2668579904      64    9.313   
>  0.000    0.000    0.000  pd   sdou 
> 14:26:50.134689  R       inode  159:2786164648      64   16.577   
>  0.000    0.000    0.000  pd   sdpq 
> 14:26:50.145034  R       inode   34:2097217320      64    7.409   
>  0.000    0.000    0.000  pd   sdmt 
> 14:26:50.138140  R       inode  139:2831038792      64   14.898   
>  0.000    0.000    0.000  pd   sdlw 
> 14:26:50.130954  R       inode  164:282120312       64   22.274   
>  0.000    0.000    0.000  pd   sdzd 
> 14:26:50.137038  R       inode   41:3421909608      64   16.314   
>  0.000    0.000    0.000  pd   sdef 
> 14:26:50.137606  R       inode  104:1870962416      64   16.644   
>  0.000    0.000    0.000  pd   sdgx 
> 14:26:50.141306  R       inode   65:2276184264      64   16.593   
>  0.000    0.000    0.000  pd   sdrk 
> 

> 
> mmdiag --iohist its another think i looked at it, but i could not 
> find good explanation for all the "buf type" ( third column )

> allocSeg
> data
> iallocSeg
> indBlock
> inode
> LLIndBlock
> logData
> logDesc
> logWrap
> metadata
> vdiskAULog
> vdiskBuf
> vdiskFWLog
> vdiskMDLog
> vdiskMeta
> vdiskRGDesc
> If i want to monifor metadata operation whan should i look at? just 

inodes =inodes , *alloc* = file or data allocation blocks , *ind* = 
indirect blocks (for very large files) and metadata , everyhing else is 
data or internal i/o's 
 
> the metadata flag or also inode? this command takes also long to 
> run, especially if i run it a second time it hangs for a lot before 
> to rerun again, so i'm not sure that run it every 30secs or minute 
> its viable, but i will look also into that. THere is any 
> documentation that descibes clearly the whole output? what i found 
> its quite generic and don't go into details...

the reason it takes so long is because it collects 10's of thousands of 
i/os in a table and to not slow down the system when we dump the data we 
copy it to a separate buffer so we don't need locks :-) 
you can adjust the number of entries you want to collect by adjusting the 
ioHistorySize config parameter


> > 
> > Last but not least.. and this is what i really would like to 
> > accomplish, i would to be able to monitor the latency of metadata 
> operations. 
> 
> you can't do this on the server side as you don't know how much time
> you spend on the client , network or anything between the app and 
> the physical disk, so you can only reliably look at this from the 
> client, the iohist output only shows you the Server disk i/o 
> processing time, but that can be a fraction of the overall time (in 
> other cases this obviously can also be the dominant part depending 
> on your workload). 
> 
> the easiest way on the client is to run 
> 
> mmfsadm vfsstats enable 
> from now on vfs stats are collected until you restart GPFS. 
> 
> then run : 
> 
> vfs statistics currently enabled 
> started at: Fri Aug 29 13:15:05.380 2014 
>   duration: 448446.970 sec 
> 
>  name                    calls  time per call     total time 
>  -------------------- -------- -------------- -------------- 
>  statfs                      9       0.000002       0.000021 
>  startIO              246191176       0.005853 1441049.976740 
> 
> to dump what ever you collected so far on this node. 

> 
> We already do that, but as I said, I want to check specifically how 
> gss servers are keeping the requests to identify or exlude server 
> side bottlenecks.
> 
> 
> Thanks for your help, you gave me definitely few things where to look 
at.
> 
> Salvatore
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140904/a2f9be76/attachment.htm>

From oehmes at us.ibm.com  Fri Sep  5 01:53:17 2014
From: oehmes at us.ibm.com (Sven Oehme)
Date: Thu, 4 Sep 2014 17:53:17 -0700
Subject: [gpfsug-discuss] gpfs performance monitoring
In-Reply-To: <5408722E.6060309@ebi.ac.uk>
References: <54074F90.7000303@ebi.ac.uk>	<OFB11720D9.723B90AC-ON88257D49.00032BBC-88257D49.00049D93@us.ibm.com>
	<540869DF.5060100@ebi.ac.uk>	<54086F1D.1000401@ed.ac.uk>
	<5408722E.6060309@ebi.ac.uk>
Message-ID: <OFE53FDAA6.B366A778-ON88257D4A.00049748-88257D4A.0004E08F@us.ibm.com>

if you don't have the files you need to update to a newer version of the 
GPFS client software on the node.
they started shipping with 3.5.0.13
even you get the files you still wouldn't see many values as they never 
got exposed before. 

some more details are in a presentation i gave earlier this year which is 
archived in the list or here --> 
http://www.gpfsug.org/wp-content/uploads/2014/05/UG10_GPFS_Performance_Session_v10.pdf

Sven


------------------------------------------
Sven Oehme 
Scalable Storage Research 
email: oehmes at us.ibm.com 
Phone: +1 (408) 824-8904 
IBM Almaden Research Lab 
------------------------------------------


From:   Salvatore Di Nardo <sdinardo at ebi.ac.uk>
To:     gpfsug-discuss at gpfsug.org
Date:   09/04/2014 07:08 AM
Subject:        Re: [gpfsug-discuss] gpfs performance monitoring
Sent by:        gpfsug-discuss-bounces at gpfsug.org


On 04/09/14 14:54, Orlando Richards wrote:


On 04/09/14 14:32, Salvatore Di Nardo wrote: 
Sorry to bother you again but dstat have some issues with the plugin: 

        [root at gss01a util]# dstat --gpfs 
        /usr/bin/dstat:1672: DeprecationWarning: os.popen3 is 
        deprecated.  Use the subprocess module. 
           pipes[cmd] = os.popen3(cmd, 't', 0) 
        Module dstat_gpfs failed to load. (global name 'select' is not 
        defined) 
        None of the stats you selected are available. 

I found this solution , but involve dstat recompile.... 

https://github.com/dagwieers/dstat/issues/44 

Are you aware about any easier solution (we use RHEL6.3) ? 


This worked for me the other day on a dev box I was poking at: 

# rm /usr/share/dstat/dstat_gpfsops* 

# cp /usr/lpp/mmfs/samples/util/dstat_gpfsops.py.dstat.0.7 
/usr/share/dstat/dstat_gpfsops.py 

# dstat --gpfsops 
/usr/bin/dstat:1672: DeprecationWarning: os.popen3 is deprecated.  Use the 
subprocess module. 
  pipes[cmd] = os.popen3(cmd, 't', 0) 
---------------------------gpfs-vfs-ops--------------------------#-----------------------------gpfs-disk-i/o----------------------------- 

  cr   del  op/cl   rd    wr  trunc fsync looku gattr sattr other mb_rd 
mb_wr  pref wrbeh steal clean  sync revok logwr logda oth_r oth_w 
   0     0     0     0     0     0     0     0     0     0     0     0     
0     0     0     0     0     0     0     0     0     0     0 

... 


NICE!! The only problem  is that the box seems lacking those python 
scripts:
ls /usr/lpp/mmfs/samples/util/
makefile  README  tsbackup  tsbackup.C  tsbackup.h  tsfindinode  
tsfindinode.c  tsgetusage  tsgetusage.c  tsinode  tsinode.c  tslistall  
tsreaddir  tsreaddir.c  tstimes  tstimes.c
Do you mind sending me those py files? They should be 3 as i see e gpfs 
options:  gpfs, gpfs-ops, gpfsops  (dunno what are the differences )

Regards,
Salvatore


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140904/609d1997/attachment.htm>

From jonathan at buzzard.me.uk  Fri Sep  5 10:29:27 2014
From: jonathan at buzzard.me.uk (Jonathan Buzzard)
Date: Fri, 05 Sep 2014 10:29:27 +0100
Subject: [gpfsug-discuss] gpfs performance monitoring
In-Reply-To: <54084258.90508@ebi.ac.uk>
References: <54074F90.7000303@ebi.ac.uk>
	<OFB11720D9.723B90AC-ON88257D49.00032BBC-88257D49.00049D93@us.ibm.com>
	<54084258.90508@ebi.ac.uk>
Message-ID: <1409909367.30257.151.camel@buzzard.phy.strath.ac.uk>

On Thu, 2014-09-04 at 11:43 +0100, Salvatore Di Nardo wrote:

[SNIP]

> 
> Sometimes, it also happens that there is very low IO (10Gb/s ), almost
> no cpu usage on the servers but huge slownes ( ls can take 10
> seconds).  Why that happens? There is not much data ops , but we think
> there is a huge ammount of metadata ops. So what i want to know is if
> the metadata vdisks are busy or not. If this is our problem, could
> some SSD disks dedicated to metadata help? 
> 

This is almost always because you are using an external LDAP/NIS server
for GECOS information and the values that you need are not cached for
whatever reason and you are having to look them up again. Note that the
standard aliasing for RHEL based distros of ls also causes it to do a
stat on every file for the colouring etc. Also be aware that if you are
trying to fill out your cd with TAB auto-completion you will run into
similar issues. That is had you typed the path for the cd out in full
you would get in instantly, doing a couple of letters and hitting cd it
could take a while.

You can test this on a RHEL based distro by doing "/bin/ls -n" The idea
being to avoid any aliasing and not look up GECOS data and just report
the raw numerical stuff.

What I would suggest is that you set the cache time on UID/GID lookups
for positive lookups to a long time, in general as long as possible
because the values should almost never change. Even for a positive look
up of a group membership I would have that cached for a couple of hours.
For negative lookups something like five or 10 minutes is a good
starting point.


JAB.

-- 
Jonathan A. Buzzard                 Email: jonathan (at) buzzard.me.uk
Fife, United Kingdom.


From sdinardo at ebi.ac.uk  Fri Sep  5 11:56:37 2014
From: sdinardo at ebi.ac.uk (Salvatore Di Nardo)
Date: Fri, 05 Sep 2014 11:56:37 +0100
Subject: [gpfsug-discuss] gpfs performance monitoring
In-Reply-To: <OF9ACC3F87.D8E14E88-ON88257D4A.0002DEA2-88257D4A.000474AF@us.ibm.com>
References: <54074F90.7000303@ebi.ac.uk>	<OFB11720D9.723B90AC-ON88257D49.00032BBC-88257D49.00049D93@us.ibm.com>	<54084258.90508@ebi.ac.uk>
	<OF9ACC3F87.D8E14E88-ON88257D4A.0002DEA2-88257D4A.000474AF@us.ibm.com>
Message-ID: <540996E5.5000502@ebi.ac.uk>

Little clarification:
Our ls its plain ls, there is no alias.

Consider that all those things are already set up properly as EBI run hi 
computing farms from many years, so those things are already fixed loong 
time ago. We have very little experience with GPFS, but good knowledge 
with LSF farms and own multiple NFS stotages ( several petabyte sized). 
about NIS, all clients run NSCD that cashes all  informations to avoid 
such tipe of slownes, in fact then ls isslow, also ls -n is slow.

Beside that,  also a "cd" sometimes hangs, so it have nothing to do with 
getting attributes.

Just to clarify a bit more. Now GSS usually seems working fine, we have 
users that run jobs on the farms that pushes 180Gb/s read ( reading and 
writing files of 100GB size). GPFS works very well there, where other 
systems had performance problems accessing portion of data in so huge files.

Sadly, on the other hand, other users run jobs that do suge ammount of 
metadata operations, like toons of ls in directory with many files, or 
creating a silly amount of temporary files just to synchronize the jobs 
between the farm nodes, or just to store temporary data for few 
milliseconds and them immediately delete those temporary files. Imagine 
to create constantly thousands files just to write few bytes and they 
delete them after few milliseconds...

When those thing happens we see 10-15Gb/sec throughput, low CPU usage on 
the server ( 80% iddle), but any cd, or ls or wathever takes few 
seconds. So my question is, if the bottleneck could be the spindles, or 
if the clients could be tuned a bit more?

I read your PDF and all the paramenters seems already well configured  
except "maxFilesToCache", but  I'm not sure how we should configure few 
of those  parameters on the clients. As an example I cannot immagine a 
client that require 38g pagepool size.

so what's the correct *pagepool* on a client? what about those others?

*maxFilestoCache**
**maxBufferdescs**
**worker1threads**
**worker3threads*

Right now all the clients have 1 GB pagepool size. In theory, we can 
afford to use more ( i thing we can easily go up to 8GB) as they have 
plenty or available memory. If this could help, we can do that, but the 
client really really need more than 1G? They are just clients after all, 
so the memory in theory should be used for jobs not just for "caching".

Last question about "maxFIlesToCache" you say that must be large on 
small cluster but small on large clusters. What do you consider 6 
servers and  almost 700 clients?

on clienst we have:
    maxFilesToCache 4000

on servers we have
   maxFilesToCache 12288


Regards,
Salvatore
On 05/09/14 01:48, Sven Oehme wrote:
> ------------------------------------------
> Sven Oehme
> Scalable Storage Research
> email: oehmes at us.ibm.com
> Phone: +1 (408) 824-8904
> IBM Almaden Research Lab
> ------------------------------------------
>
> gpfsug-discuss-bounces at gpfsug.org wrote on 09/04/2014 03:43:36 AM:
>
> > From: Salvatore Di Nardo <sdinardo at ebi.ac.uk>
> > To: gpfsug main discussion list <gpfsug-discuss at gpfsug.org>
> > Date: 09/04/2014 03:44 AM
> > Subject: Re: [gpfsug-discuss] gpfs performance monitoring
> > Sent by: gpfsug-discuss-bounces at gpfsug.org
> >
> > On 04/09/14 01:50, Sven Oehme wrote:
> > > Hello everybody,
> >
> > Hi
> >
> > > here i come here again, this time to ask some hint about how to
> > monitor GPFS.
> > >
> > > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is
> > > that they return number based only on the request done in the
> > > current host, so i have to run them on all the clients ( over 600
> > > nodes) so its quite unpractical.  Instead i would like to know from
> > > the servers whats going on, and i came across the vio_s statistics
> > > wich are less documented and i dont know exacly what they mean.
> > > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that
> > > runs VIO_S.
> > >
> > > My problems with the output of this command:
> > >  echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1
> > >
> > > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second
> > > timestamp: 1409763206/477366
> > > recovery group: *
> > > declustered array: *
> > > vdisk: *
> > > client reads: 2584229
> > > client short writes: 55299693
> > > client medium writes: 190071
> > > client promoted full track writes: 465145
> > > client full track writes: 9249
> > > flushed update writes: 4187708
> > > flushed promoted full track writes: 123
> > > migrate operations: 114
> > > scrub operations: 450590
> > > log writes: 28509602
> > >
> > > it sais "VIOPS per second", but they seem to me just counters as
> > > every time i re-run the command, the numbers increase by a bit..
> > > Can anyone confirm if those numbers are counter or if they are 
> OPS/sec.
> >
> > the numbers are accumulative so everytime you run them they just
> > show the value since start (or last reset) time.
> > OK, you confirmed my toughts, thatks
>
> >
> > >
> > > On a closer eye about i dont understand what most of thosevalues
> > > mean. For example, what exacly are "flushed promoted full track 
> write" ??
> > > I tried to find a documentation about this output , but could not
> > > find any. can anyone point me a link where output of vio_s is 
> explained?
> > >
> > > Another thing i dont understand about those numbers is if they are
> > > just operations, or the number of blocks that was read/write/etc .
> >
> > its just operations and if i would explain what the numbers mean i
> > might confuse you even more because this is not what you are really
> > looking for.
> > what you are looking for is what the client io's look like on the
> > Server side, while the VIO layer is the Server side to the disks, so
> > one lever lower than what you are looking for from what i could read
> > out of the description above.
> > No.. what I'm looking its exactly how the disks are busy to keep the
> > requests. Obviously i'm not looking just that, but I feel the needs
> > to monitor also those things. Ill explain you why.
> >
> > It happens when our storage is quite busy ( 180Gb/s of read/write )
> > that the FS start to be slowin normal cd or ls requests. This might
> > be normal, but in those situation i want to know where the
> > bottleneck is. Is the server CPU? Memory? Network? Spindles? knowing
> > where the bottlenek is might help me to understand if we can tweak
> > the system a bit more.
>
> if cd or ls is very slow in GPFS in the  majority of the cases it has 
> nothing to do with NSD Server bottlenecks, only indirect.
> the main reason ls is slow in the field is you have some very powerful 
> nodes that all do buffered writes into the same directory into 1 or 
> multiple files while you do the ls on a different node. what happens 
> now is that the ls you did run most likely is a alias for ls -l or 
> something even more complex with color display, etc, but the point is 
> it most likely returns file size. GPFS doesn't lie about the filesize, 
> we only return accurate stat informations and while this is arguable, 
> its a fact today.
> so what happens is that the stat on each file triggers a token  revoke 
> on the node that currently writing to the file you do stat on, lets 
> say it has 1 gb of dirty data in its memory for this file (as its 
> writes data buffered) this 1 GB of data now gets written to the NSD 
> server, the client updates the inode info and returns the correct size.
> lets say you have very fast network and you have a fast storage device 
> like GSS (which i see you have) it will be able to do this in a few 
> 100 ms, but the problem is this now happens serialized for each single 
> file in this directory that people  write into as for each we need to 
> get the exact stat info to satisfy your ls -l request.
> this is what takes so long, not the fact that the storage device might 
> be slow or to much metadata activity is going on , this is token , 
> means network traffic and obviously latency dependent.
>
> the best way to see this is to look at waiters on the client where you 
> run the ls and see what they are waiting for.
>
> there are various ways to tune this to get better 'felt' ls responses 
> but its not completely going away
> if all you try to with ls is if there is a file in the directory run 
> unalias ls and check if ls after that runs fast as it shouldn't do the 
> -l under the cover anymore.
>
> >
> > If its the CPU on the servers then there is no much to do beside
> > replacing or add more servers.If its not the CPU, maybe more memory
> > would help? Maybe its just the network that filled up? so i can add
> > more links
> >
> > Or if we reached the point there the bottleneck its the spindles,
> > then there is no much point o look somethere else, we just reached
> > the hardware limit..
> >
> > Sometimes, it also happens that there is very low IO (10Gb/s ),
> > almost no cpu usage on the servers but huge slownes ( ls can take 10
> > seconds).  Why that happens? There is not much data ops , but we
> > think there is a huge ammount of metadata ops. So what i want to
> > know is if the metadata vdisks are busy or not. If this is our
> > problem, could some SSD disks dedicated to metadata help?
>
> the answer if ssd's would help or not are hard to say without knowing 
> the root case and as i tried to explain above the most likely case is 
> token revoke, not disk i/o. obviously as more busy your disks are as 
> longer the token revoke will take.
>
> >
> >
> > In particular im, a bit puzzled with the design of our GSS storage.
> > Each recovery groups have 3 declustered arrays, and each declustered
> > aray have 1 data and 1 metadata vdisk, but in the end both metadata
> > and data vdisks use the same spindles. The problem that, its that I
> > dont understand if we have a metadata bottleneck there. Maybe some
> > SSD disks in a dedicated declustered array would perform much
> > better, but this is just theory. I really would like to be able to
> > monitor IO activities on the metadata vdisks.
>
> the short answer is we WANT the metadata disks to be with the data 
> disks on the same spindles. compared to other storage systems, GSS is 
> capable to handle different raid codes for different virtual disks on 
> the same physical disks, this way we create raid1'ish 'LUNS' for 
> metadata and raid6'is 'LUNS' for data so the small i/o penalty for a 
> metadata is very small compared to a read/modify/write on the data disks.
>
> >
> >
>
> >
> >
> > so the Layer you care about is the NSD Server layer, which sits on
> > top of the VIO layer (which is essentially the SW RAID Layer in GNR)
> >
> > > I'm asking that because if they are just ops, i don't know how much
> > > they could be usefull. For example one write operation could eman
> > > write 1 block or write a file of 100GB. If those are oprations,
> > > there is a way to have the oupunt in bytes or blocks?
> >
> > there are multiple ways to get infos on the NSD layer, one would be
> > to use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats
> > counts again.
> >
> > Counters its not a problem. I can collect them and create some
> > graphs in a monitoring tool. I will check that.
>
> if you (let) upgrade your system to GSS 2.0 you get a graphical 
> monitoring as part of it. if you want i can send you some direct email 
> outside the group with additional informations on that.
>
> >
> > the alternative option is to use mmdiag --iohist. this shows you a
> > history of the last X numbers of io operations on either the client
> > or the server side like on a client :
> >
> > # mmdiag --iohist
> >
> > === mmdiag: iohist ===
> >
> > I/O history:
> >
> >  I/O start time RW    Buf type disk:sectorNum   nSec  time ms
> > qTime ms       RpcTimes ms  Type  Device/NSD ID         NSD server
> > --------------- -- ----------- ----------------- -----  -------
> > -------- -----------------  ---- ------------------ ---------------
> > 14:25:22.169617  R  LLIndBlock    1:1075622848      64   13.073
> >  0.000   12.959    0.063  cli   C0A70401:53BEEA7F     192.167.4.1
> > 14:25:22.182723  R       inode    1:1071252480       8    6.970
> >  0.000    6.908    0.038  cli C0A70401:53BEEA7F     192.167.4.1
> > 14:25:53.659918  R  LLIndBlock    1:1081202176      64    8.309
> >  0.000    8.210    0.046  cli C0A70401:53BEEA7F     192.167.4.1
> > 14:25:53.668262  R       inode    2:1081373696       8   14.117
> >  0.000   14.032    0.058  cli   C0A70402:53BEEA5E     192.167.4.2
> > 14:25:53.682750  R  LLIndBlock    1:1065508736      64    9.254
> >  0.000    9.180    0.038  cli C0A70401:53BEEA7F     192.167.4.1
> > 14:25:53.692019  R       inode    2:1064356608       8   14.899
> >  0.000   14.847    0.029  cli   C0A70402:53BEEA5E     192.167.4.2
> > 14:25:53.707100  R       inode    2:1077830152       8   16.499
> >  0.000   16.449    0.025  cli   C0A70402:53BEEA5E     192.167.4.2
> > 14:25:53.723788  R  LLIndBlock    1:1081202432      64    4.280
> >  0.000    4.203    0.040  cli C0A70401:53BEEA7F     192.167.4.1
> > 14:25:53.728082  R       inode    2:1081918976       8    7.760
> >  0.000    7.710    0.027  cli C0A70402:53BEEA5E     192.167.4.2
> > 14:25:57.877416  R    metadata    2:678978560       16   13.343
> >  0.000   13.254    0.053  cli   C0A70402:53BEEA5E     192.167.4.2
> > 14:25:57.891048  R  LLIndBlock    1:1065508608      64   15.491
> >  0.000   15.401    0.058  cli   C0A70401:53BEEA7F     192.167.4.1
> > 14:25:57.906556  R       inode    2:1083476520       8   11.723
> >  0.000   11.676    0.029  cli   C0A70402:53BEEA5E     192.167.4.2
> > 14:25:57.918516  R  LLIndBlock    1:1075622720      64    8.062
> >  0.000    8.001    0.032  cli C0A70401:53BEEA7F     192.167.4.1
> > 14:25:57.926592  R       inode    1:1076503480       8    8.087
> >  0.000    8.043    0.026  cli C0A70401:53BEEA7F     192.167.4.1
> > 14:25:57.934856  R  LLIndBlock    1:1071088512      64    6.572
> >  0.000    6.510    0.033  cli C0A70401:53BEEA7F     192.167.4.1
> > 14:25:57.941441  R       inode    2:1069885984       8   11.686
> >  0.000   11.641    0.024  cli   C0A70402:53BEEA5E     192.167.4.2
> > 14:25:57.953294  R       inode    2:1083476936       8    8.951
> >  0.000    8.912    0.021  cli C0A70402:53BEEA5E     192.167.4.2
> > 14:25:57.965475  R       inode    1:1076503504       8    0.477
> >  0.000    0.053    0.000  cli C0A70401:53BEEA7F     192.167.4.1
> > 14:25:57.965755  R       inode    2:1083476488       8    0.410
> >  0.000    0.061    0.321  cli C0A70402:53BEEA5E     192.167.4.2
> > 14:25:57.965787  R       inode    2:1083476512       8    0.439
> >  0.000    0.053    0.342  cli C0A70402:53BEEA5E     192.167.4.2
> >
> > you basically see if its a inode , data block , what size it has (in
> > sectors) , which nsd server you did send this request to, etc.
> >
> > on the Server side you see the type , which physical disk it goes to
> > and also what size of disk i/o it causes like :
> >
> > 14:26:50.129995  R       inode   12:3211886376      64   14.261
> >  0.000    0.000    0.000  pd sdis
> > 14:26:50.137102  R       inode   19:3003969520      64    9.004
> >  0.000    0.000    0.000  pd sdad
> > 14:26:50.136116  R       inode   55:3591710992      64   11.057
> >  0.000    0.000    0.000  pd sdoh
> > 14:26:50.141510  R       inode   21:3066810504      64    5.909
> >  0.000    0.000    0.000  pd sdaf
> > 14:26:50.130529  R       inode   89:2962370072      64   17.437
> >  0.000    0.000    0.000  pd sddi
> > 14:26:50.131063  R       inode   78:1889457000      64   17.062
> >  0.000    0.000    0.000  pd sdsj
> > 14:26:50.143403  R       inode   36:3323035688      64    4.807
> >  0.000    0.000    0.000  pd sdmw
> > 14:26:50.131044  R       inode   37:2513579736     128   17.181
> >  0.000    0.000    0.000  pd sddv
> > 14:26:50.138181  R       inode   72:3868810400      64   10.951
> >  0.000    0.000    0.000  pd sdbz
> > 14:26:50.138188  R       inode  131:2443484784     128   11.792
> >  0.000    0.000    0.000  pd sdug
> > 14:26:50.138003  R       inode  102:3696843872      64   11.994
> >  0.000    0.000    0.000  pd sdgp
> > 14:26:50.137099  R       inode  145:3370922504      64   13.225
> >  0.000    0.000    0.000  pd sdmi
> > 14:26:50.141576  R       inode   62:2668579904      64    9.313
> >  0.000    0.000    0.000  pd sdou
> > 14:26:50.134689  R       inode  159:2786164648      64   16.577
> >  0.000    0.000    0.000  pd sdpq
> > 14:26:50.145034  R       inode   34:2097217320      64    7.409
> >  0.000    0.000    0.000  pd sdmt
> > 14:26:50.138140  R       inode  139:2831038792      64   14.898
> >  0.000    0.000    0.000  pd sdlw
> > 14:26:50.130954  R       inode  164:282120312       64   22.274
> >  0.000    0.000    0.000  pd sdzd
> > 14:26:50.137038  R       inode   41:3421909608      64   16.314
> >  0.000    0.000    0.000  pd sdef
> > 14:26:50.137606  R       inode  104:1870962416      64   16.644
> >  0.000    0.000    0.000  pd sdgx
> > 14:26:50.141306  R       inode   65:2276184264      64   16.593
> >  0.000    0.000    0.000  pd sdrk
> >
>
> >
> > mmdiag --iohist its another think i looked at it, but i could not
> > find good explanation for all the "buf type" ( third column )
>
> > allocSeg
> > data
> > iallocSeg
> > indBlock
> > inode
> > LLIndBlock
> > logData
> > logDesc
> > logWrap
> > metadata
> > vdiskAULog
> > vdiskBuf
> > vdiskFWLog
> > vdiskMDLog
> > vdiskMeta
> > vdiskRGDesc
> > If i want to monifor metadata operation whan should i look at? just
>
> inodes =inodes , *alloc* = file or data allocation blocks , *ind* = 
> indirect blocks (for very large files) and metadata , everyhing else 
> is data or internal i/o's
>
> > the metadata flag or also inode? this command takes also long to
> > run, especially if i run it a second time it hangs for a lot before
> > to rerun again, so i'm not sure that run it every 30secs or minute
> > its viable, but i will look also into that. THere is any
> > documentation that descibes clearly the whole output? what i found
> > its quite generic and don't go into details...
>
> the reason it takes so long is because it collects 10's of thousands 
> of i/os in a table and to not slow down the system when we dump the 
> data we copy it to a separate buffer so we don't need locks :-)
> you can adjust the number of entries you want to collect by adjusting 
> the ioHistorySize config parameter
>
>
> > >
> > > Last but not least.. and this is what i really would like to
> > > accomplish, i would to be able to monitor the latency of metadata
> > operations.
> >
> > you can't do this on the server side as you don't know how much time
> > you spend on the client , network or anything between the app and
> > the physical disk, so you can only reliably look at this from the
> > client, the iohist output only shows you the Server disk i/o
> > processing time, but that can be a fraction of the overall time (in
> > other cases this obviously can also be the dominant part depending
> > on your workload).
> >
> > the easiest way on the client is to run
> >
> > mmfsadm vfsstats enable
> > from now on vfs stats are collected until you restart GPFS.
> >
> > then run :
> >
> > vfs statistics currently enabled
> > started at: Fri Aug 29 13:15:05.380 2014
> >   duration: 448446.970 sec
> >
> >  name    calls  time per call     total time
> >  -------------------- -------- -------------- --------------
> >  statfs      9       0.000002 0.000021
> >  startIO              246191176       0.005853 1441049.976740
> >
> > to dump what ever you collected so far on this node.
>
> >
> > We already do that, but as I said, I want to check specifically how
> > gss servers are keeping the requests to identify or exlude server
> > side bottlenecks.
> >
> >
> > Thanks for your help, you gave me definitely few things where to 
> look at.
> >
> > Salvatore
> > _______________________________________________
> > gpfsug-discuss mailing list
> > gpfsug-discuss at gpfsug.org
> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140905/0a7459f6/attachment.htm>

From chekh at stanford.edu  Fri Sep  5 22:17:47 2014
From: chekh at stanford.edu (Alex Chekholko)
Date: Fri, 05 Sep 2014 14:17:47 -0700
Subject: [gpfsug-discuss] gpfs performance monitoring
In-Reply-To: <540996E5.5000502@ebi.ac.uk>
References: <54074F90.7000303@ebi.ac.uk>	<OFB11720D9.723B90AC-ON88257D49.00032BBC-88257D49.00049D93@us.ibm.com>	<54084258.90508@ebi.ac.uk>	<OF9ACC3F87.D8E14E88-ON88257D4A.0002DEA2-88257D4A.000474AF@us.ibm.com>
	<540996E5.5000502@ebi.ac.uk>
Message-ID: <540A287B.1050202@stanford.edu>

On 9/5/14, 3:56 AM, Salvatore Di Nardo wrote:
> Little clarification:
> Our ls its plain ls, there is no alias.
...
> Last question about "maxFIlesToCache" you say that must be large on
> small cluster but small on large clusters. What do you consider 6
> servers and  almost 700 clients?
>
> on clienst we have:
>     maxFilesToCache 4000
>
> on servers we have
>    maxFilesToCache 12288
>
>

One thing to do is to try your 'ls', see it is slow, then immediately 
run it again.  If it is fast the second and consecutive times, it's 
because now the stat info is coming out of local cache.

e.g. /usr/bin/time ls /path/to/some/dir && /usr/bin/time ls 
/path/to/some/dir

The second time is likely to be almost immediate.  So long as your local 
cache is big enough.

I see on one of our older clusters we have:
tokenMemLimit 2G
maxFilesToCache 40000
maxStatCache 80000

You can also interrogate the local cache to see how full it is.

Of course, if many nodes are writing to same dirs, then the cache will 
need to be invalidated often which causes some overhead.  Big local 
cache is good if clients are usually working in different directories.

Regards,
-- 
chekh at stanford.edu


From oehmes at us.ibm.com  Sat Sep  6 01:12:42 2014
From: oehmes at us.ibm.com (Sven Oehme)
Date: Fri, 5 Sep 2014 17:12:42 -0700
Subject: [gpfsug-discuss] gpfs performance monitoring
In-Reply-To: <540996E5.5000502@ebi.ac.uk>
References: <54074F90.7000303@ebi.ac.uk>	<OFB11720D9.723B90AC-ON88257D49.00032BBC-88257D49.00049D93@us.ibm.com>
	<54084258.90508@ebi.ac.uk>	<OF9ACC3F87.D8E14E88-ON88257D4A.0002DEA2-88257D4A.000474AF@us.ibm.com>
	<540996E5.5000502@ebi.ac.uk>
Message-ID: <OF2C5947FD.D235ACC4-ON88257D4B.0000CF86-88257D4B.000129FD@us.ibm.com>

on your GSS nodes you have tuning files we suggest customers to use for 
mixed workloads clients.

the files in  /usr/lpp/mmfs/samples/gss/

if you create a nodeclass for all your clients you can run 
/usr/lpp/mmfs/samples/gss/gssClientConfig.sh NODECLASS and it applies all 
the settings to them so they will be active on next restart of the gpfs 
daemon. 

this should be a very good starting point for your config. please try that 
and let me know if it doesn't.
there are also several enhancements in GPFS 4.1 which reduce contention in 
multiple areas, which would help as well, if you have the choice to update 
the nodes.

btw. the GSS 2.0 package will update your GSS nodes to 4.1 also

Sven

------------------------------------------
Sven Oehme 
Scalable Storage Research 
email: oehmes at us.ibm.com 
Phone: +1 (408) 824-8904 
IBM Almaden Research Lab 
------------------------------------------


From:   Salvatore Di Nardo <sdinardo at ebi.ac.uk>
To:     gpfsug main discussion list <gpfsug-discuss at gpfsug.org>
Date:   09/05/2014 03:57 AM
Subject:        Re: [gpfsug-discuss] gpfs performance monitoring
Sent by:        gpfsug-discuss-bounces at gpfsug.org


Little clarification:
Our ls its plain ls, there is no alias. 

Consider that all those things are already set up properly as EBI run hi 
computing farms from many years, so those things are already fixed loong 
time ago. We have very little experience with GPFS, but good knowledge 
with LSF farms and own multiple NFS stotages ( several petabyte sized). 
about NIS, all clients run NSCD that cashes all  informations to avoid 
such tipe of slownes, in fact then ls isslow, also ls -n is slow.

Beside that,  also a "cd" sometimes hangs, so it have nothing to do with 
getting attributes.

Just to clarify a bit more. Now GSS usually seems working fine, we have 
users that run jobs on the farms that pushes 180Gb/s read ( reading and 
writing files of 100GB size). GPFS works very well there, where other 
systems had performance problems accessing portion of data in so huge 
files.

Sadly, on the other hand, other users run jobs that do suge ammount of 
metadata operations, like toons of ls in directory with many files, or 
creating a silly amount of temporary files just to synchronize the jobs 
between the farm nodes, or just to store temporary data for few 
milliseconds and them immediately delete those temporary files. Imagine to 
create constantly thousands files just to write few bytes and they delete 
them after few milliseconds...

When those thing happens we see 10-15Gb/sec throughput, low CPU usage on 
the server ( 80% iddle), but any cd, or ls or wathever takes few seconds. 
So my question is, if the bottleneck could be the spindles, or if the 
clients could be tuned a bit more?

I read your PDF and all the paramenters seems already well configured  
except "maxFilesToCache", but  I'm not sure how we should configure few of 
those  parameters on the clients. As an example I cannot immagine a client 
that require 38g pagepool size.

so what's the correct pagepool on a client? what about those others?

maxFilestoCache
maxBufferdescs
worker1threads
worker3threads

Right now all the clients have 1 GB pagepool size. In theory, we can 
afford to use more ( i thing we can easily go up to 8GB) as they have 
plenty or available memory. If this could help, we can do that, but the 
client really really need more than 1G? They are just clients after all, 
so the memory in theory should be used for jobs not just for "caching".

Last question about "maxFIlesToCache" you say that must be large on small 
cluster but small on large clusters. What do you consider 6 servers and  
almost 700 clients?

on clienst we have:
   maxFilesToCache 4000

on servers we have
  maxFilesToCache 12288


Regards,
Salvatore
On 05/09/14 01:48, Sven Oehme wrote:
------------------------------------------
Sven Oehme 
Scalable Storage Research 
email: oehmes at us.ibm.com 
Phone: +1 (408) 824-8904 
IBM Almaden Research Lab 
------------------------------------------ 

gpfsug-discuss-bounces at gpfsug.org wrote on 09/04/2014 03:43:36 AM:

> From: Salvatore Di Nardo <sdinardo at ebi.ac.uk> 
> To: gpfsug main discussion list <gpfsug-discuss at gpfsug.org> 
> Date: 09/04/2014 03:44 AM 
> Subject: Re: [gpfsug-discuss] gpfs performance monitoring 
> Sent by: gpfsug-discuss-bounces at gpfsug.org 
> 
> On 04/09/14 01:50, Sven Oehme wrote: 
> > Hello everybody,
> 
> Hi 
> 
> > here i come here again, this time to ask some hint about how to 
> monitor GPFS.
> > 
> > I know about mmpmon, but the issue with its "fs_io_s" and "io_s" is 
> > that they return number based only on the request done in the 
> > current host, so i have to run them on all the clients ( over 600 
> > nodes) so its quite unpractical.  Instead i would like to know from 
> > the servers whats going on, and i came across the vio_s statistics 
> > wich are less documented and i dont know exacly what they mean. 
> > There is also this script "/usr/lpp/mmfs/samples/vdisk/viostat" that
> > runs VIO_S.
> > 
> > My problems with the output of this command: 
> >  echo "vio_s" | /usr/lpp/mmfs/bin/mmpmon -r 1
> > 
> > mmpmon> mmpmon node 10.7.28.2 name gss01a vio_s OK VIOPS per second
> > timestamp:                          1409763206/477366
> > recovery group:                     *
> > declustered array:                  *
> > vdisk:                              *
> > client reads:                          2584229
> > client short writes:                  55299693
> > client medium writes:                   190071
> > client promoted full track writes:      465145
> > client full track writes:                 9249
> > flushed update writes:                 4187708
> > flushed promoted full track writes:        123
> > migrate operations:                        114
> > scrub operations:                       450590
> > log writes:                           28509602 
> > 
> > it sais "VIOPS per second", but they seem to me just counters as 
> > every time i re-run the command, the numbers increase by a bit..  
> > Can anyone confirm if those numbers are counter or if they are 
OPS/sec.
> 
> the numbers are accumulative so everytime you run them they just 
> show the value since start (or last reset) time. 
> OK, you confirmed my toughts, thatks

> 
> > 
> > On a closer eye about i dont understand what most of thosevalues 
> > mean. For example, what exacly are "flushed promoted full track write" 
?? 
> > I tried to find a documentation about this output , but could not 
> > find any. can anyone point me a link where output of vio_s is 
explained?
> > 
> > Another thing i dont understand about those numbers is if they are 
> > just operations, or the number of blocks that was read/write/etc . 
> 
> its just operations and if i would explain what the numbers mean i 
> might confuse you even more because this is not what you are really 
> looking for. 
> what you are looking for is what the client io's look like on the 
> Server side, while the VIO layer is the Server side to the disks, so
> one lever lower than what you are looking for from what i could read
> out of the description above.  
> No.. what I'm looking its exactly how the disks are busy to keep the
> requests. Obviously i'm not looking just that, but I feel the needs 
> to monitor also those things. Ill explain you why. 
> 
> It happens when our storage is quite busy ( 180Gb/s of read/write ) 
> that the FS start to be slowin normal cd or ls requests. This might 
> be normal, but in those situation i want to know where the 
> bottleneck is. Is the server CPU? Memory? Network? Spindles? knowing
> where the bottlenek is might help me to understand if we can tweak 
> the system a bit more.

if cd or ls is very slow in GPFS in the  majority of the cases it has 
nothing to do with NSD Server bottlenecks, only indirect. 
the main reason ls is slow in the field is you have some very powerful 
nodes that all do buffered writes into the same directory into 1 or 
multiple files while you do the ls on a different node. what happens now 
is that the ls you did run most likely is a alias for ls -l or something 
even more complex with color display, etc, but the point is it most likely 
returns file size. GPFS doesn't lie about the filesize, we only return 
accurate stat informations and while this is arguable, its a fact today. 
so what happens is that the stat on each file triggers a token  revoke on 
the node that currently writing to the file you do stat on, lets say it 
has 1 gb of dirty data in its memory for this file (as its writes data 
buffered) this 1 GB of data now gets written to the NSD server, the client 
updates the inode info and returns the correct size. 
lets say you have very fast network and you have a fast storage device 
like GSS (which i see you have) it will be able to do this in a few 100 
ms, but the problem is this now happens serialized for each single file in 
this directory that people  write into as for each we need to get the 
exact stat info to satisfy your ls -l request. 
this is what takes so long, not the fact that the storage device might be 
slow or to much metadata activity is going on , this is token , means 
network traffic and obviously latency dependent. 

the best way to see this is to look at waiters on the client where you run 
the ls and see what they are waiting for. 

there are various ways to tune this to get better 'felt' ls responses but 
its not completely going away 
if all you try to with ls is if there is a file in the directory run 
unalias ls and check if ls after that runs fast as it shouldn't do the -l 
under the cover anymore. 

> 
> If its the CPU on the servers then there is no much to do beside 
> replacing or add more servers.If its not the CPU, maybe more memory 
> would help? Maybe its just the network that filled up? so i can add 
> more links 
> 
> Or if we reached the point there the bottleneck its the spindles, 
> then there is no much point o look somethere else, we just reached 
> the hardware limit..
> 
> Sometimes, it also happens that there is very low IO (10Gb/s ), 
> almost no cpu usage on the servers but huge slownes ( ls can take 10
> seconds).  Why that happens? There is not much data ops , but we 
> think there is a huge ammount of metadata ops. So what i want to 
> know is if the metadata vdisks are busy or not. If this is our 
> problem, could some SSD disks dedicated to metadata help? 

the answer if ssd's would help or not are hard to say without knowing the 
root case and as i tried to explain above the most likely case is token 
revoke, not disk i/o. obviously as more busy your disks are as longer the 
token revoke will take.   

> 
> 
> In particular im, a bit puzzled with the design of our GSS storage.
> Each recovery groups have 3 declustered arrays, and each declustered
> aray have 1 data and 1 metadata vdisk, but in the end both metadata 
> and data vdisks use the same spindles. The problem that, its that I 
> dont understand if we have a metadata bottleneck there. Maybe some 
> SSD disks in a dedicated declustered array would perform much 
> better, but this is just theory. I really would like to be able to 
> monitor IO activities on the metadata vdisks.

the short answer is we WANT the metadata disks to be with the data disks 
on the same spindles. compared to other storage systems, GSS is capable to 
handle different raid codes for different virtual disks on the same 
physical disks, this way we create raid1'ish 'LUNS' for metadata and 
raid6'is 'LUNS' for data so the small i/o penalty for a metadata is very 
small compared to a read/modify/write on the data disks. 

> 
> 

> 
> 
> so the Layer you care about is the NSD Server layer, which sits on 
> top of the VIO layer (which is essentially the SW RAID Layer in GNR) 
> 
> > I'm asking that because if they are just ops, i don't know how much 
> > they could be usefull. For example one write operation could eman 
> > write 1 block or write a file of 100GB. If those are oprations, 
> > there is a way to have the oupunt in bytes or blocks? 
> 
> there are multiple ways to get infos on the NSD layer, one would be 
> to use the dstat plugin (see /usr/lpp/mmfs/sample/util) but thats 
> counts again. 
> 
> Counters its not a problem. I can collect them and create some 
> graphs in a monitoring tool. I will check that.

if you (let) upgrade your system to GSS 2.0 you get a graphical monitoring 
as part of it. if you want i can send you some direct email outside the 
group with additional informations on that. 

> 
> the alternative option is to use mmdiag --iohist. this shows you a 
> history of the last X numbers of io operations on either the client 
> or the server side like on a client : 
> 
> # mmdiag --iohist 
> 
> === mmdiag: iohist === 
> 
> I/O history: 
> 
>  I/O start time RW    Buf type disk:sectorNum     nSec  time ms 
> qTime ms       RpcTimes ms  Type  Device/NSD ID         NSD server 
> --------------- -- ----------- ----------------- -----  ------- 
> -------- -----------------  ---- ------------------ --------------- 
> 14:25:22.169617  R  LLIndBlock    1:1075622848      64   13.073   
>  0.000   12.959    0.063  cli   C0A70401:53BEEA7F     192.167.4.1 
> 14:25:22.182723  R       inode    1:1071252480       8    6.970   
>  0.000    6.908    0.038  cli   C0A70401:53BEEA7F     192.167.4.1 
> 14:25:53.659918  R  LLIndBlock    1:1081202176      64    8.309   
>  0.000    8.210    0.046  cli   C0A70401:53BEEA7F     192.167.4.1 
> 14:25:53.668262  R       inode    2:1081373696       8   14.117   
>  0.000   14.032    0.058  cli   C0A70402:53BEEA5E     192.167.4.2 
> 14:25:53.682750  R  LLIndBlock    1:1065508736      64    9.254   
>  0.000    9.180    0.038  cli   C0A70401:53BEEA7F     192.167.4.1 
> 14:25:53.692019  R       inode    2:1064356608       8   14.899   
>  0.000   14.847    0.029  cli   C0A70402:53BEEA5E     192.167.4.2 
> 14:25:53.707100  R       inode    2:1077830152       8   16.499   
>  0.000   16.449    0.025  cli   C0A70402:53BEEA5E     192.167.4.2 
> 14:25:53.723788  R  LLIndBlock    1:1081202432      64    4.280   
>  0.000    4.203    0.040  cli   C0A70401:53BEEA7F     192.167.4.1 
> 14:25:53.728082  R       inode    2:1081918976       8    7.760   
>  0.000    7.710    0.027  cli   C0A70402:53BEEA5E     192.167.4.2 
> 14:25:57.877416  R    metadata    2:678978560       16   13.343   
>  0.000   13.254    0.053  cli   C0A70402:53BEEA5E     192.167.4.2 
> 14:25:57.891048  R  LLIndBlock    1:1065508608      64   15.491   
>  0.000   15.401    0.058  cli   C0A70401:53BEEA7F     192.167.4.1 
> 14:25:57.906556  R       inode    2:1083476520       8   11.723   
>  0.000   11.676    0.029  cli   C0A70402:53BEEA5E     192.167.4.2 
> 14:25:57.918516  R  LLIndBlock    1:1075622720      64    8.062   
>  0.000    8.001    0.032  cli   C0A70401:53BEEA7F     192.167.4.1 
> 14:25:57.926592  R       inode    1:1076503480       8    8.087   
>  0.000    8.043    0.026  cli   C0A70401:53BEEA7F     192.167.4.1 
> 14:25:57.934856  R  LLIndBlock    1:1071088512      64    6.572   
>  0.000    6.510    0.033  cli   C0A70401:53BEEA7F     192.167.4.1 
> 14:25:57.941441  R       inode    2:1069885984       8   11.686   
>  0.000   11.641    0.024  cli   C0A70402:53BEEA5E     192.167.4.2 
> 14:25:57.953294  R       inode    2:1083476936       8    8.951   
>  0.000    8.912    0.021  cli   C0A70402:53BEEA5E     192.167.4.2 
> 14:25:57.965475  R       inode    1:1076503504       8    0.477   
>  0.000    0.053    0.000  cli   C0A70401:53BEEA7F     192.167.4.1 
> 14:25:57.965755  R       inode    2:1083476488       8    0.410   
>  0.000    0.061    0.321  cli   C0A70402:53BEEA5E     192.167.4.2 
> 14:25:57.965787  R       inode    2:1083476512       8    0.439   
>  0.000    0.053    0.342  cli   C0A70402:53BEEA5E     192.167.4.2 
> 
> you basically see if its a inode , data block , what size it has (in
> sectors) , which nsd server you did send this request to, etc. 
> 
> on the Server side you see the type , which physical disk it goes to
> and also what size of disk i/o it causes like :   
> 
> 14:26:50.129995  R       inode   12:3211886376      64   14.261   
>  0.000    0.000    0.000  pd   sdis 
> 14:26:50.137102  R       inode   19:3003969520      64    9.004   
>  0.000    0.000    0.000  pd   sdad 
> 14:26:50.136116  R       inode   55:3591710992      64   11.057   
>  0.000    0.000    0.000  pd   sdoh 
> 14:26:50.141510  R       inode   21:3066810504      64    5.909   
>  0.000    0.000    0.000  pd   sdaf 
> 14:26:50.130529  R       inode   89:2962370072      64   17.437   
>  0.000    0.000    0.000  pd   sddi 
> 14:26:50.131063  R       inode   78:1889457000      64   17.062   
>  0.000    0.000    0.000  pd   sdsj 
> 14:26:50.143403  R       inode   36:3323035688      64    4.807   
>  0.000    0.000    0.000  pd   sdmw 
> 14:26:50.131044  R       inode   37:2513579736     128   17.181   
>  0.000    0.000    0.000  pd   sddv 
> 14:26:50.138181  R       inode   72:3868810400      64   10.951   
>  0.000    0.000    0.000  pd   sdbz 
> 14:26:50.138188  R       inode  131:2443484784     128   11.792   
>  0.000    0.000    0.000  pd   sdug 
> 14:26:50.138003  R       inode  102:3696843872      64   11.994   
>  0.000    0.000    0.000  pd   sdgp 
> 14:26:50.137099  R       inode  145:3370922504      64   13.225   
>  0.000    0.000    0.000  pd   sdmi 
> 14:26:50.141576  R       inode   62:2668579904      64    9.313   
>  0.000    0.000    0.000  pd   sdou 
> 14:26:50.134689  R       inode  159:2786164648      64   16.577   
>  0.000    0.000    0.000  pd   sdpq 
> 14:26:50.145034  R       inode   34:2097217320      64    7.409   
>  0.000    0.000    0.000  pd   sdmt 
> 14:26:50.138140  R       inode  139:2831038792      64   14.898   
>  0.000    0.000    0.000  pd   sdlw 
> 14:26:50.130954  R       inode  164:282120312       64   22.274   
>  0.000    0.000    0.000  pd   sdzd 
> 14:26:50.137038  R       inode   41:3421909608      64   16.314   
>  0.000    0.000    0.000  pd   sdef 
> 14:26:50.137606  R       inode  104:1870962416      64   16.644   
>  0.000    0.000    0.000  pd   sdgx 
> 14:26:50.141306  R       inode   65:2276184264      64   16.593   
>  0.000    0.000    0.000  pd   sdrk 
> 

> 
> mmdiag --iohist its another think i looked at it, but i could not 
> find good explanation for all the "buf type" ( third column )

> allocSeg
> data
> iallocSeg
> indBlock
> inode
> LLIndBlock
> logData
> logDesc
> logWrap
> metadata
> vdiskAULog
> vdiskBuf
> vdiskFWLog
> vdiskMDLog
> vdiskMeta
> vdiskRGDesc 
> If i want to monifor metadata operation whan should i look at? just 

inodes =inodes , *alloc* = file or data allocation blocks , *ind* = 
indirect blocks (for very large files) and metadata , everyhing else is 
data or internal i/o's 
  
> the metadata flag or also inode? this command takes also long to 
> run, especially if i run it a second time it hangs for a lot before 
> to rerun again, so i'm not sure that run it every 30secs or minute 
> its viable, but i will look also into that. THere is any 
> documentation that descibes clearly the whole output? what i found 
> its quite generic and don't go into details...

the reason it takes so long is because it collects 10's of thousands of 
i/os in a table and to not slow down the system when we dump the data we 
copy it to a separate buffer so we don't need locks :-) 
you can adjust the number of entries you want to collect by adjusting the 
ioHistorySize config parameter 


> > 
> > Last but not least.. and this is what i really would like to 
> > accomplish, i would to be able to monitor the latency of metadata 
> operations. 
> 
> you can't do this on the server side as you don't know how much time
> you spend on the client , network or anything between the app and 
> the physical disk, so you can only reliably look at this from the 
> client, the iohist output only shows you the Server disk i/o 
> processing time, but that can be a fraction of the overall time (in 
> other cases this obviously can also be the dominant part depending 
> on your workload). 
> 
> the easiest way on the client is to run 
> 
> mmfsadm vfsstats enable 
> from now on vfs stats are collected until you restart GPFS. 
> 
> then run : 
> 
> vfs statistics currently enabled 
> started at: Fri Aug 29 13:15:05.380 2014 
>   duration: 448446.970 sec 
> 
>  name                    calls  time per call     total time 
>  -------------------- -------- -------------- -------------- 
>  statfs                      9       0.000002       0.000021 
>  startIO              246191176       0.005853 1441049.976740 
> 
> to dump what ever you collected so far on this node. 

> 
> We already do that, but as I said, I want to check specifically how 
> gss servers are keeping the requests to identify or exlude server 
> side bottlenecks.
> 
> 
> Thanks for your help, you gave me definitely few things where to look 
at.
> 
> Salvatore
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140905/930d4664/attachment.htm>

From luke.raimbach at oerc.ox.ac.uk  Tue Sep  9 11:23:47 2014
From: luke.raimbach at oerc.ox.ac.uk (Luke Raimbach)
Date: Tue, 9 Sep 2014 10:23:47 +0000
Subject: [gpfsug-discuss] mmdiag output questions
Message-ID: <F68FAAD16AEC9744BD921F91014D2F003A7E0B@MBX06.ad.oak.ox.ac.uk>

Hi All,

When tracing a problem recently (which turned out to be a NIC failure), mmdiag proved useful in tracing broken cluster connections. I have some questions about the output of mmdiag using the --network switch:

Occasionally I see nodes in the same cluster grouped, but in no readily identifiable way - for example, the following output has three headings "Device bon0:" with some nodes listed, but the nodes don't seem to share anything in common like status, err, ostype, etc.

Also, is anyone able to explain what might be seen under the err column? Do these correspond to GPFS error codes as one might see in mmfs.log.latest? What is the sock column displaying - the number of open sockets or the socket state? Lastly, the sent/recvd columns seem very low. Is there a rolling time window within which these statistics are kept in some internal mmfsd buffer?

Cheers.

=== mmdiag: network ===

Pending messages:
  (none)
Inter-node communication configuration:
  tscTcpPort      1191
  my address      10.100.10.51/22 (eth0) <c0n8>
  my addr list    10.200.21.1/16 (bond0)/cpdn.oerc.local  10.100.10.51/22 (eth0)
  my node number  9
TCP Connections between nodes:
  Device bond0:
    hostname                            node     destination     status     err  sock  sent(MB)  recvd(MB)  ostype
    gpfs01                              <c0n0>   10.200.1.1      connected  0    32    110       110        Linux/L
    gpfs02                              <c0n1>   10.200.2.1      connected  0    36    104       104        Linux/L
    linux                               <c0n2>   10.200.101.1    connected  0    37    0         0          Linux/L
    jupiter                             <c0n3>   10.200.102.1    connected  0    35    0         0          Windows/L
    cnfs0                               <c0n4>   10.200.10.10    connected  0    39    0         0          Linux/L
    cnfs1                               <c0n5>   10.200.10.11    init       0    -1    0         0          Linux/L
  Device bond0:
    hostname                            node     destination     status     err  sock  sent(MB)  recvd(MB)  ostype
    cnfs2                               <c0n6>   10.200.10.12    connected  0    33    5         5          Linux/L
    cnfs3                               <c0n7>   10.200.10.13    init       0    -1    0         0          Linux/L
    cpdn-ppc02                          <c0n9>   10.200.61.1     init       0    -1    0         0          Linux/L
    cpdn-ppc03                          <c0n10>  10.200.62.1     init       0    -1    0         0          Linux/L
  Device bond0:
    hostname                            node     destination     status     err  sock  sent(MB)  recvd(MB)  ostype
    cpdn-ppc01                          <c0n11>  10.200.60.1     connected  0    38    0         0          Linux/L
diag verbs: VERBS RDMA class not initialized


Conversely, the output of mmdiag --network on the file system manager node for the same cluster looks like this:

=== mmdiag: network ===

Pending messages:
  (none)
Inter-node communication configuration:
  tscTcpPort      1191
  my address      10.100.10.21/22 (eth0) <c0n0>
  my addr list    10.200.1.1/16 (bond0)/cpdn.oerc.local  10.100.10.21/22 (eth0)
  my node number  1
TCP Connections between nodes:
  Device bond0:
    hostname                            node     destination     status     err  sock  sent(MB)  recvd(MB)  ostype
    gpfs02                              <c0n1>   10.200.2.1      connected  0    73    219       219        Linux/L
    linux                               <c0n2>   10.200.101.1    connected  0    49    180       181        Linux/L
    jupiter                             <c0n3>   10.200.102.1    connected  0    33    3         3          Windows/L
    cnfs0                               <c0n4>   10.200.10.10    connected  0    61    3         3          Linux/L
    cnfs1                               <c0n5>   10.200.10.11    connected  0    81    0         0          Linux/L
    cnfs2                               <c0n6>   10.200.10.12    connected  0    64    23        23         Linux/L
    cnfs3                               <c0n7>   10.200.10.13    connected  0    60    2         2          Linux/L
    tsm01                               <c0n8>   10.200.21.1     connected  0    50    110       110        Linux/L
    cpdn-ppc02                          <c0n9>   10.200.61.1     connected  0    63    0         0          Linux/L
    cpdn-ppc03                          <c0n10>  10.200.62.1     connected  0    65    0         0          Linux/L
    cpdn-ppc01                          <c0n11>  10.200.60.1     connected  0    62    94        94         Linux/L
diag verbs: VERBS RDMA class not initialized


All neatly connected!


--

Luke Raimbach
IT Manager
Oxford e-Research Centre
7 Keble Road,
Oxford,
OX1 3QG

+44(0)1865 610639


From chair at gpfsug.org  Wed Sep 10 15:33:24 2014
From: chair at gpfsug.org (Jez Tucker (Chair))
Date: Wed, 10 Sep 2014 15:33:24 +0100
Subject: [gpfsug-discuss] GPFS Request for Enhancements
Message-ID: <54106134.7010902@gpfsug.org>

Hi all

   Just a quick reminder that the RFEs that you all gave feedback at the 
last UG on are live on IBM's RFE site:

goo.gl/1K6LBa

Please take the time to have a look and add your votes to the GPFS RFEs.

Jez


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140910/26b484ea/attachment.htm>

From dmetcalfe at ocf.co.uk  Thu Sep 11 21:18:58 2014
From: dmetcalfe at ocf.co.uk (Daniel Metcalfe)
Date: Thu, 11 Sep 2014 21:18:58 +0100
Subject: [gpfsug-discuss] mmdiag output questions
In-Reply-To: <F68FAAD16AEC9744BD921F91014D2F003A7E0B@MBX06.ad.oak.ox.ac.uk>
References: <F68FAAD16AEC9744BD921F91014D2F003A7E0B@MBX06.ad.oak.ox.ac.uk>
Message-ID: <A42128435E851644B9B011BB824F6C816F56F25B1D@MAIL.ocf.local>

Hi Luke,


I've seen the same apparent grouping of nodes, I don't believe the nodes are actually being grouped but instead the "Device Bond0:" and column headers are being re-printed to screen whenever there is a node that has the "init" status followed by a node that is "connected".  It is something I've noticed on many different versions of GPFS so I imagine it's a "feature".


I've not noticed anything but '0' in the err column so I'm not sure if these correspond to error codes in the GPFS logs.  If you run the command "mmfsadm dump tscomm", you'll see a bit more detail than the mmdiag -network shows.  This suggests the sock column is number of sockets. I've seen the low numbers to for sent / recv using mmdiag --network, again the mmfsadm command above gives a better representation I've found.


All that being said, if you want to get in touch with us then we'll happily open a PMR for you and find out the answer to any of your questions.


Kind regards,


Danny Metcalfe
Systems Engineer
OCF plc

Tel: 0114 257 2200


[cid:image001.jpg at 01CFCE04.575B8380]


Twitter<http://twitter.com/ocfplc>

Fax: 0114 257 0022

[cid:image002.jpg at 01CFCE04.575B8380]

Blog<http://blog.ocf.co.uk/>

Mob: 07960 503404

[cid:image003.jpg at 01CFCE04.575B8380]

Web<http://www.ocf.co.uk/>


Please note, any emails relating to an OCF Support request must always be sent to support at ocf.co.uk<mailto:support at ocf.co.uk> for a ticket number to be generated or existing support ticket to be updated. Should this not be done then OCF cannot be held responsible for requests not dealt with in a timely manner.

OCF plc is a company registered in England and Wales.  Registered number 4132533. Registered office address: OCF plc, 5 Rotunda Business Centre, Thorncliffe Park, Chapeltown, Sheffield, S35 2PG

This message is private and confidential. If you have received this message in error, please notify us immediately and remove it from your system.


-----Original Message-----
From: gpfsug-discuss-bounces at gpfsug.org [mailto:gpfsug-discuss-bounces at gpfsug.org] On Behalf Of Luke Raimbach
Sent: 09 September 2014 11:24
To: gpfsug-discuss at gpfsug.org
Subject: [gpfsug-discuss] mmdiag output questions


Hi All,


When tracing a problem recently (which turned out to be a NIC failure), mmdiag proved useful in tracing broken cluster connections. I have some questions about the output of mmdiag using the --network switch:


Occasionally I see nodes in the same cluster grouped, but in no readily identifiable way - for example, the following output has three headings "Device bon0:" with some nodes listed, but the nodes don't seem to share anything in common like status, err, ostype, etc.


Also, is anyone able to explain what might be seen under the err column? Do these correspond to GPFS error codes as one might see in mmfs.log.latest? What is the sock column displaying - the number of open sockets or the socket state? Lastly, the sent/recvd columns seem very low. Is there a rolling time window within which these statistics are kept in some internal mmfsd buffer?


Cheers.


=== mmdiag: network ===


Pending messages:

  (none)

Inter-node communication configuration:

  tscTcpPort      1191

  my address      10.100.10.51/22 (eth0) <c0n8>

  my addr list    10.200.21.1/16 (bond0)/cpdn.oerc.local  10.100.10.51/22 (eth0)

  my node number  9

TCP Connections between nodes:

  Device bond0:

    hostname                            node     destination     status     err  sock  sent(MB)  recvd(MB)  ostype

    gpfs01                              <c0n0>   10.200.1.1      connected  0    32    110       110        Linux/L

    gpfs02                              <c0n1>   10.200.2.1      connected  0    36    104       104        Linux/L

    linux                               <c0n2>   10.200.101.1    connected  0    37    0         0          Linux/L

    jupiter                             <c0n3>   10.200.102.1    connected  0    35    0         0          Windows/L

    cnfs0                               <c0n4>   10.200.10.10    connected  0    39    0         0          Linux/L

    cnfs1                               <c0n5>   10.200.10.11    init       0    -1    0         0          Linux/L

  Device bond0:

    hostname                            node     destination     status     err  sock  sent(MB)  recvd(MB)  ostype

    cnfs2                               <c0n6>   10.200.10.12    connected  0    33    5         5          Linux/L

    cnfs3                               <c0n7>   10.200.10.13    init       0    -1    0         0          Linux/L

    cpdn-ppc02                          <c0n9>   10.200.61.1     init       0    -1    0         0          Linux/L

    cpdn-ppc03                          <c0n10>  10.200.62.1     init       0    -1    0         0          Linux/L

  Device bond0:

    hostname                            node     destination     status     err  sock  sent(MB)  recvd(MB)  ostype

    cpdn-ppc01                          <c0n11>  10.200.60.1     connected  0    38    0         0          Linux/L

diag verbs: VERBS RDMA class not initialized


Conversely, the output of mmdiag --network on the file system manager node for the same cluster looks like this:


=== mmdiag: network ===


Pending messages:

  (none)

Inter-node communication configuration:

  tscTcpPort      1191

  my address      10.100.10.21/22 (eth0) <c0n0>

  my addr list    10.200.1.1/16 (bond0)/cpdn.oerc.local  10.100.10.21/22 (eth0)

  my node number  1

TCP Connections between nodes:

  Device bond0:

    hostname                            node     destination     status     err  sock  sent(MB)  recvd(MB)  ostype

    gpfs02                              <c0n1>   10.200.2.1      connected  0    73    219       219        Linux/L

    linux                               <c0n2>   10.200.101.1    connected  0    49    180       181        Linux/L

    jupiter                             <c0n3>   10.200.102.1    connected  0    33    3         3          Windows/L

    cnfs0                               <c0n4>   10.200.10.10    connected  0    61    3         3          Linux/L

    cnfs1                               <c0n5>   10.200.10.11    connected  0    81    0         0          Linux/L

    cnfs2                               <c0n6>   10.200.10.12    connected  0    64    23        23         Linux/L

    cnfs3                               <c0n7>   10.200.10.13    connected  0    60    2         2          Linux/L

    tsm01                               <c0n8>   10.200.21.1     connected  0    50    110       110        Linux/L

    cpdn-ppc02                          <c0n9>   10.200.61.1     connected  0    63    0         0          Linux/L

    cpdn-ppc03                          <c0n10>  10.200.62.1     connected  0    65    0         0          Linux/L

    cpdn-ppc01                          <c0n11>  10.200.60.1     connected  0    62    94        94         Linux/L

diag verbs: VERBS RDMA class not initialized


All neatly connected!


--


Luke Raimbach

IT Manager

Oxford e-Research Centre

7 Keble Road,

Oxford,

OX1 3QG


+44(0)1865 610639

_______________________________________________

gpfsug-discuss mailing list

gpfsug-discuss at gpfsug.org

http://gpfsug.org/mailman/listinfo/gpfsug-discuss


-----

No virus found in this message.

Checked by AVG - www.avg.com<http://www.avg.com>

Version: 2014.0.4765 / Virus Database: 4015/8158 - Release Date: 09/05/14
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140911/dd761352/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 4696 bytes
Desc: image001.jpg
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140911/dd761352/attachment.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.jpg
Type: image/jpeg
Size: 4725 bytes
Desc: image002.jpg
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140911/dd761352/attachment-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.jpg
Type: image/jpeg
Size: 4820 bytes
Desc: image003.jpg
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140911/dd761352/attachment-0002.jpg>

From stuartb at 4gh.net  Tue Sep 23 16:47:09 2014
From: stuartb at 4gh.net (Stuart Barkley)
Date: Tue, 23 Sep 2014 11:47:09 -0400 (EDT)
Subject: [gpfsug-discuss] filesets and mountpoint naming
Message-ID: <alpine.BSF.2.11.1409231127550.1507@freeman.4gh.net>

When we first started using GPFS we created several filesystems and
just directly mounted them where seemed appropriate.  We have
something like:

    /home
    /scratch
    /projects
    /reference
    /applications

We are finding the overhead of separate filesystems to be troublesome
and are looking at using filesets inside fewer filesystems to
accomplish our goals (we will probably keep /home separate for now).

We can put symbolic links in place to provide the same user
experience, but I'm looking for suggestions as to where to mount the
actual gpfs filesystems.

We have multiple compute clusters with multiple gpfs systems, one
cluster has a traditional gpfs system and a separate gss system which
will obviously need multiple mount points.  We also want to consider
possible future cross cluster mounts.

Some thoughts are to just do filesystems as:

    /gpfs01, /gpfs02, etc.
    /mnt/gpfs01, etc
    /mnt/clustera/gpfs01, etc.

What have other people done?  Are you happy with it?  What would you
do differently?

Thanks,
Stuart
-- 
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone


From sabujp at gmail.com  Thu Sep 25 13:39:14 2014
From: sabujp at gmail.com (Sabuj Pattanayek)
Date: Thu, 25 Sep 2014 07:39:14 -0500
Subject: [gpfsug-discuss] really slow cnfs vip failover due to exportfs -u
 in bin/mmnfsfuncs unexportAll and unexportFS
Message-ID: <CAEeMGHsP=7XXPkAp=3sYu6+=0+9DnMCCJdhZ=TUF_AFf2AAu6g@mail.gmail.com>

Hi all,

We're running 3.5.0.19 with CNFS and noticed really slow CNFS vip failover
times > 4.5mins . It looks like it's being caused by all the exportfs -u
calls being made in the unexportAll and the unexportFS function in
bin/mmnfsfuncs . What's the purpose of running exportfs -u on all the
exported directories? We're running only NFSv3 and have lots of exports and
for security reasons can't have one giant NFS export. That may be a
possibility with GPFS4.1 and NFSv4 but we won't be migrating to that
anytime soon.

Assume the network went down for the cnfs server or the system
panicked/crashed, what would be the purpose of exportfs -u be in that case,
so what's the purpose at all?

Thanks,
Sabuj
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140925/5e532686/attachment.htm>

From sabujp at gmail.com  Thu Sep 25 14:11:18 2014
From: sabujp at gmail.com (Sabuj Pattanayek)
Date: Thu, 25 Sep 2014 08:11:18 -0500
Subject: [gpfsug-discuss] really slow cnfs vip failover due to exportfs
 -u in bin/mmnfsfuncs unexportAll and unexportFS
In-Reply-To: <CAEeMGHsP=7XXPkAp=3sYu6+=0+9DnMCCJdhZ=TUF_AFf2AAu6g@mail.gmail.com>
References: <CAEeMGHsP=7XXPkAp=3sYu6+=0+9DnMCCJdhZ=TUF_AFf2AAu6g@mail.gmail.com>
Message-ID: <CAEeMGHtNqp78aU5nSRegAg=BERNLHzF6kvut=vz7tYsdzSuCXw@mail.gmail.com>

our support engineer suggests adding & to the end of the exportfs -u lines
in the mmnfsfunc script, which is a good workaround, can this be added to
future gpfs 3.5 and 4.1 rels (haven't even looked at 4.1 yet). I was
looking at the unexportfs all in nfs-utils/exportfs.c and it looks like the
limiting factor there would be all the hostname lookups? I don't see what
exportfs -u is doing other than doing slow reverse lookups and removing the
export from the nfs stack.

On Thu, Sep 25, 2014 at 7:39 AM, Sabuj Pattanayek <sabujp at gmail.com> wrote:

> Hi all,
>
> We're running 3.5.0.19 with CNFS and noticed really slow CNFS vip failover
> times > 4.5mins . It looks like it's being caused by all the exportfs -u
> calls being made in the unexportAll and the unexportFS function in
> bin/mmnfsfuncs . What's the purpose of running exportfs -u on all the
> exported directories? We're running only NFSv3 and have lots of exports and
> for security reasons can't have one giant NFS export. That may be a
> possibility with GPFS4.1 and NFSv4 but we won't be migrating to that
> anytime soon.
>
> Assume the network went down for the cnfs server or the system
> panicked/crashed, what would be the purpose of exportfs -u be in that case,
> so what's the purpose at all?
>
> Thanks,
> Sabuj
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140925/a8d7487e/attachment.htm>

From sabujp at gmail.com  Thu Sep 25 14:15:19 2014
From: sabujp at gmail.com (Sabuj Pattanayek)
Date: Thu, 25 Sep 2014 08:15:19 -0500
Subject: [gpfsug-discuss] really slow cnfs vip failover due to exportfs
 -u in bin/mmnfsfuncs unexportAll and unexportFS
In-Reply-To: <CAEeMGHtNqp78aU5nSRegAg=BERNLHzF6kvut=vz7tYsdzSuCXw@mail.gmail.com>
References: <CAEeMGHsP=7XXPkAp=3sYu6+=0+9DnMCCJdhZ=TUF_AFf2AAu6g@mail.gmail.com>
	<CAEeMGHtNqp78aU5nSRegAg=BERNLHzF6kvut=vz7tYsdzSuCXw@mail.gmail.com>
Message-ID: <CAEeMGHunYVBE4BRcjQcWtJ9Gdt9K0rqJPbJaeDxY5JFXi_pVow@mail.gmail.com>

yes, it's doing a getaddrinfo() call for every hostname that's a fqdn and
not an ip addr, which we have lots of in our export entries since sometimes
clients update their dns (ip's).

On Thu, Sep 25, 2014 at 8:11 AM, Sabuj Pattanayek <sabujp at gmail.com> wrote:

> our support engineer suggests adding & to the end of the exportfs -u lines
> in the mmnfsfunc script, which is a good workaround, can this be added to
> future gpfs 3.5 and 4.1 rels (haven't even looked at 4.1 yet). I was
> looking at the unexportfs all in nfs-utils/exportfs.c and it looks like the
> limiting factor there would be all the hostname lookups? I don't see what
> exportfs -u is doing other than doing slow reverse lookups and removing the
> export from the nfs stack.
>
> On Thu, Sep 25, 2014 at 7:39 AM, Sabuj Pattanayek <sabujp at gmail.com>
> wrote:
>
>> Hi all,
>>
>> We're running 3.5.0.19 with CNFS and noticed really slow CNFS vip
>> failover times > 4.5mins . It looks like it's being caused by all the
>> exportfs -u calls being made in the unexportAll and the unexportFS function
>> in bin/mmnfsfuncs . What's the purpose of running exportfs -u on all the
>> exported directories? We're running only NFSv3 and have lots of exports and
>> for security reasons can't have one giant NFS export. That may be a
>> possibility with GPFS4.1 and NFSv4 but we won't be migrating to that
>> anytime soon.
>>
>> Assume the network went down for the cnfs server or the system
>> panicked/crashed, what would be the purpose of exportfs -u be in that case,
>> so what's the purpose at all?
>>
>> Thanks,
>> Sabuj
>>
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140925/1f7cc4cf/attachment.htm>