[gpfsug-discuss] Joining RDMA over different networks?

Thu Aug 24 16:26:34 BST 2023

On 22/08/2023 11:52, Alec wrote:

> I wouldn't want to use GPFS if I didn't want my nodes to be able to go 
> nuts, why bother to be frank.
> 

Because there are multiple users to the system. Do you want to be the 
one explaining to 50 other users that they can't use the system today 
because John from Chemistry is pounding the filesystem to death for his 
jobs? Didn't think so.

There is not an infinite amount of money available and it is not 
possible with a reasonable amount of money to make a file system that 
all the nodes can max out their network connection at once.

> I had tested a configuration with a single x86 box and 4 x 100Gbe 
> adapters talking to an ESS, that thing did amazing performance in excess 
> of 25 GB/s over Ethernet.  If you have a node that needs that 
> performance build to it.  Spend more time configuring QoS to fair share 
> your bandwidth than baking bottlenecks into your configuration.
> 

There are finite budgets and compromises have to be made. The 
compromises we made back in 2017 when the specification was written and 
put out to tender have held up really well.

> The reasoning of holding end nodes to a smaller bandwidth than the 
> backend doesn't make sense.  You want to clear "the work" as efficiently 
> as possible, more than keep IT from having any constraints popping up.  
> That's what leads to just endless dithering and diluting of 
> infrastructure until no one can figure out how to get real performance.
> 

It does because a small number of jobs can hold the system to ransom for 
lots of other users. I have to balance things across a large number of 
nodes. There is only a finite amount of bandwidth to the storage and it 
has to be shared out fairly. I could attempt to do it with QOS on the 
switches or I could go sod that for a lark 10Gbps is all you get and 
lets keep it simple. Though like I said today it would be 25Gbps, but 
this was a specification written six years ago when 25Gbps Ethernet was 
rather exotic and too expensive.

> So yeah 95% of the workloads don't care about their performance and can 
> live on dithered and diluted infrastructure that costs a zillion times 
> more money than what the 5% of workload that does care about bandwidth 
> needs to spend to actually deliver.
> 

They do care about performance, they just don't need to max out the 
allotted performance per node. However if performance of the file system 
is bad the performance of the their jobs will also be bad and the total 
FLOPS I get from the system will plummet through the floor.

Note it is more like 0.1% of jobs that peg the 10Gbps network interface 
for any period of time it at all.

> Build your infrastructure storage as high bandwidth as possible per node 
> because compared to all the other costs it's a drop in the bucket... 
> Don't cheap out on "cables".

No it's not. The Omnipath network (which by the way is reserved 
deliberately for MPI) cost a *LOT* of money. We are having serious 
conversations that with current core counts per node that an 
Infiniband/Omnipath network doesn't make sense any more, and that 25Gbps 
Ethernet will do just fine for a standard compute node.

Around 85% of our jobs run on 40 cores (aka one node) or less. If you go 
to 128 cores a node it's more like 95% of all jobs. If you go to 192 
cores it's about 98% of all jobs. The maximum job size we allow 
currently is 400 cores.

Better to ditch the expensive interconnect and use the hundreds of 
thousands of dollars saved and buy more compute nodes is the current 
thinking. The 2% of users can just have longer runtimes but hey there 
will be a lot more FLOPS available in total and they rarely have just 
one job in the queue so it will all balance out in the wash and be 
positive for most users.

In consultation the users are on board with this direction of travel. 
 From our perspective if a user absolutely needs more than 192 cores on 
a modern system it would not be unreasonable to direct them to a 
national facility that can handle the really huge jobs. We are an 
institutional HPC facility after all. We don't claim to be able to 
handle a 1000 core job for example.

JAB.

-- 
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG