[gpfsug-discuss] WAS: alternative path; Now: RDMA
Jonathan Buzzard
jonathan.buzzard at strath.ac.uk
Mon Dec 13 23:55:23 GMT 2021
On 13/12/2021 00:03, Andrew Beattie wrote:
> What is the main outcome or business requirement of the teaching cluster
> ( i notice your specific in the use of defining it as a teaching cluster)
> It is entirely possible that the use case for this cluster does not
> warrant the use of high speed low latency networking, and it simply
> needs the benefits of a parallel filesystem.
While we call it the "teaching cluster" it would be more appropriate to
call them "teaching nodes" that shares resources (storage and login
nodes) with the main research cluster. It's mainly used by
undergraduates doing final year projects and M.Sc. students. It's
getting a bit long in the tooth now but not many undergraduates have
access to a 16 core machine with 64GB of RAM. Even if they did being
able to let something go flat out for 48 hours means there personal
laptop is available for other things :-)
I was just musing that the cards in the teaching nodes being Intel
82599ES would be a stumbling block for RDMA over Ethernet, but on
checking the Intel X710 doesn't do RDMA either so it would all be a bust
anyway. I was clearly on the crack pipe when I thought they did. So
aside from the DSS-G and GPU nodes with Connect-X4 cards nothing does RDMA.
[SNIP]
> For some of my research clients this is the ability to run 20-30% more
> compute jobs on the same HPC resources in the same 24H period, which
> means that they can reduce the amount of time they need on the HPC
> cluster to get the data results that they are looking for.
Except as I said in our cluster the storage servers have never been
maxed out except when running benchmarks. Individual compute nodes have
been maxed out (mainly Gaussian writing 800GB temporary files) but as I
explained that's a good thing from my perspective because I don't want
one or two users to be able to pound the storage into oblivion and cause
problems for everyone else.
We have enough problems with users tanking the login nodes by running
computations on them. That should go away with our upgrade to RHEL8 and
the wonders of per user cgroups; me I love systemd.
In the end nobody has complained that the storage speed is a problem
yet, and putting the metadata on SSD would be my first port of call if
they did and funds where available to make things go faster.
To be honest I think the users are just happy that GPFS doesn't eat
itself and be out of action for a few weeks every couple of years like
Lustre did on the previous system.
JAB.
--
Jonathan A. Buzzard Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
More information about the gpfsug-discuss
mailing list