[gpfsug-discuss] Looking for a way to see which node is having an impact on server?
chekh at stanford.edu
Mon Dec 9 21:21:24 GMT 2013
For IB traffic, you can use 'collectl -sx'
or else mmpmon (which is what 'dstat --gpfs' uses underneath anyway)
If your other NSDs are full, then of course all writes will go to the
empty NSDs. And then reading those new files your performance will be
limited to just the new NSDs.
On 12/09/2013 01:05 PM, Richard Lefebvre wrote:
> Hi Alex,
> I should have mention that my GPFS network is done through
> infiniband/RDMA, so looking at the TCP probably won't work. I will try
> to see if the traffic can be seen through ib0 (instead of eth0), but I
> have my doubts.
> As for the placement. The file system was 95% full when I added the new
> NSDs. I know that what is waiting now from the waiters commands is the
> to the 2 NSDs:
> waiting 0.791707000 seconds, NSDThread: for I/O completion on disk d9
> I have added more NSDs since then but the waiting is still on the 2
> disks. None of the others.
> On 12/09/2013 02:52 PM, Alex Chekholko wrote:
>> Hi Richard,
>> I would just use something like 'iftop' to look at the traffic between
>> the nodes. Or 'collectl'. Or 'dstat'.
>> e.g. dstat -N eth0 --gpfs --gpfs-ops --top-cpu-adv --top-io 2 10
>> For the NSD balance question, since GPFS stripes the blocks evenly
>> across all the NSDs, they will end up balanced over time. Or you can
>> rebalance manually with 'mmrestripefs -b' or similar.
>> It is unlikely that particular files ended up on a single NSD, unless
>> the other NSDs are totally full.
>> On 12/06/2013 04:31 PM, Richard Lefebvre wrote:
>>> I'm looking for a way to see which node (or nodes) is having an impact
>>> on the gpfs server nodes which is slowing the whole file system? What
>>> happens, usually, is a user is doing some I/O that doesn't fit the
>>> configuration of the gpfs file system and the way it was explain on how
>>> to use it efficiently. It is usually by doing a lot of unbuffered byte
>>> size, very random I/O on the file system that was made for large files
>>> and large block size.
>>> My problem is finding out who is doing that. I haven't found a way to
>>> pinpoint the node or nodes that could be the source of the problem, with
>>> over 600 client nodes.
>>> I tried to use "mmlsnodes -N waiters -L" but there is too much waiting
>>> that I cannot pinpoint on something.
>>> I must be missing something simple. Anyone got any help?
>>> Note: there is another thing I'm trying to pinpoint. A temporary
>>> imbalance was created by adding a new NSD. It seems that a group of
>>> files have been created on that same NSD and a user keeps hitting that
>>> NSD causing a high load. I'm trying to pinpoint the origin of that too.
>>> At least until everything is balance back. But will balancing spread
>>> those files since they are already on the most empty NSD?
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at gpfsug.org
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
Alex Chekholko chekh at stanford.edu 347-401-4860
More information about the gpfsug-discuss