[gpfsug-discuss] Executing Callbacks on other Nodes

Jeffrey R. Lang JRLang at uwyo.edu
Mon Apr 18 17:28:25 BST 2016


Roland

  Here's a tool written by NCAR that provides waiter information on a per node bases using a light weight daemon on the monitored node.   I have been using it for a while and it has helped me find and figure out long waiter nodes.

  It might do what you are looking for.

  https://sourceforge.net/projects/gpfsmonitorsuite/

jeff

-----Original Message-----
From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Roland Pabel
Sent: Monday, April 18, 2016 9:10 AM
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Subject: Re: [gpfsug-discuss] Executing Callbacks on other Nodes

Hi Bob,

I'll try the second approach, i.e, collecting "mmfsadm dump waiters" locally and then summing the values up, since it can be done without the overhead of ssh.

You mentioned mmlsnode starts all these ssh commands and that made me look into the file itself. I then noticed most of the mm commands are actually scripts. This helps a lot with regards to my original question. mmdsh seems to do what I need.

Thanks,

Roland


> This command is just using ssh to all the nodes and dumping the waiter 
> information and collecting it. That means if the node is down, slow to 
> respond, or there are a large number of nodes, it could take a while 
> to return.  In my 400-500 node clusters this command usually take less 
> than 10 seconds. I do prefix the command with a timeout value in case 
> a node is hung up and ssh never returns (which it sometimes does, and 
> that’s not the fault of GPFS) Something like this:
 
> timeout 45s /usr/lpp/mmfs/bin/mmlsnode -N waiters –L
> 
> This means I get incomplete information, but if you don’t you end up 
> piling up a lot of hung up commands. I would check over your cluster 
> carefully to see if there are other issues that might cause ssh to 
> hang up – which could impact other GPFS commands that distribute via ssh.
 
> Another approach would be to dump the waiters locally on each node, 
> send node specific information to the database, and then sum it up 
> using the graphing software.
 
> Bob Oesterlin
> Sr Storage Engineer, Nuance HPC Grid
> 
> From:
> <gpfsug-discuss-bounces at spectrumscale.org<mailto:gpfsug-discuss-bounce
> s at spe ctrumscale.org>> on behalf of Roland Pabel 
> <dr.roland.pabel at gmail.com<mailto:dr.roland.pabel at gmail.com>>
> Organization: RRZK Uni Köln
> Reply-To: gpfsug main discussion list
> <gpfsug-discuss at spectrumscale.org<mailto:gpfsug-discuss at spectrumscale.
> org>>
> 
 Date: Friday, April 15, 2016 at 10:50 AM
> To: gpfsug main discussion list
> <gpfsug-discuss at spectrumscale.org<mailto:gpfsug-discuss at spectrumscale.
> org>>
> 
 Subject: Re: [gpfsug-discuss] Executing Callbacks on other Nodes
> 
> Hi,
> 
> In our cluster, mmlsnode –N waiters –L takes about 25 seconds to run. 
> So running it every 30 seconds is a bit close. I'll try running it 
> once a minute
 and then incorporating this into our graphing.
> 
> Maybe the command is so slow for me because a few nodes are down?
> Is there a parameter to mmlsnode to configure the timeout?
> 
> 

--
Dr. Roland Pabel
Regionales Rechenzentrum der Universität zu Köln (RRZK) Weyertal 121, Raum 3.07
D-50931 Köln

Tel.: +49 (221) 470-89589
E-Mail: pabel at uni-koeln.de
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


More information about the gpfsug-discuss mailing list