[gpfsug-discuss] Using VMs as quorum / admin nodes in a GPFS infiniband cluster

Jonathan Buzzard jonathan.buzzard at strath.ac.uk
Thu Jun 17 13:15:32 BST 2021


On 17/06/2021 09:29, Jan-Frode Myklebust wrote:

> *All* nodes needs to be able to communicate on the daemon network. If 
> they don't have access to this network, they can't join the cluster.

Not strictly true.

TL;DR if all your NSD/master nodes are both Ethernet and Infiniband 
connected then you will be able to join the node to the network. Doing 
so is not advisable however as you will then start experiencing node 
evictions left right and centre.

> It doesn't need to be same subnet, it can be routed. But they all have to 
> be able to reach each other. If you use IPoIB, you likely need something 
> to route between the IPoIB network and the outside world to reach the IP 
> you have on your VM. I don't think you will be able to use an IP address 
> in the IPoIB range for your VM, unless your vmware hypervisor is 
> connected to the IB fabric, and can bridge it.. (doubt that's possible).

ESXi and pretty much ever other hypervisor worth their salt has been 
able to do PCI pass through since forever. So wack a Infiniband card in 
your ESXi node, pass it through to the VM and the jobs a goodun.

However it is something a lot of people are completely unaware of, 
including Infiniband/Omnipath vendors. Conversation goes can I run my 
fabric manager on a VM in ESXi rather than burn the planet on dedicated 
nodes for the job. Response comes back the fabric is not supported on 
ESXi, which shows utter ignorance on behalf of the fabric vendor.

> I've seen some customers avoid using IPoIB, and rather mix an ethernet 
> for daemon network, and dedicate the infiniband network to RDMA.
> 
What's the point of RDMA for GPFS, lower CPU overhead? For my mind it 
creates a lot of inflexibility. If your next cluster uses a different 
fabric migration is now a whole bunch more complicated. It's also a 
"minority sport" so something to be avoided unless there is a compelling 
reason not to.

In general you need a machine to act as a gateway between the Ethernet 
and Infiniband fabrics. The configuration for this is minimal, the 
following works just fine on RHEL7 and it's derivatives, though you will 
need to change your interface names to suite

enable the kernel to forward IPv4 packets

sysctl -w net.ipv4.ip_forward=1
echo "net.ipv4.ip_forward = 1" >> /etc/sysctl.conf

tell the firewall to forward packets between the Ethernet and Infiniband
interfaces

iptables -A FORWARD -i eth0 -o ib0 -j ACCEPT
iptables -A FORWARD -i ib0 -o eth0 -j ACCEPT
echo "-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-A FORWARD -i eth0 -o ib0 -j ACCEPT
-A FORWARD -i ib0 -o eth0 -j ACCEPT" > /etc/sysconfig/iptables

enable and start the firewall

systemctl enable --now firewalld

However this approach has "issues", as you now have a single point of 
failure on your system. TL;DR if the gateway goes away for any reason 
node ejections abound, so you can't restart it to apply security updates.

On our system it is mainly a plain Ethernet (minimum 10Gbps) GPFS fabric 
using plain TCP/IP. However the teaching HPC cluster nodes only have 
1Gbps Ethernet and 40Gbps Infiniband (they where kept from a previous 
system that used Lustre over Infiniband), so the storage goes over 
Infiniband and we hooked a spare port on the ConnectX-4 cards on the 
DSS-G nodes to the Infiniband fabric.

So the Ethernet/Infiniband gateway is only used as the nodes chat to one 
another. Further when a teaching node responds on the daemon network to 
a compute node it actually goes out the ethernet network of the node. 
You could fix that but it's complicated configuration.

This leads to the option of running a pair of nodes that will route 
between the networks and then running keepalived on the ethernet side to 
provide redundancy using VRRP to shift the gateway IP between the two 
nodes. You might be able to do the same for the Infiniband I have never 
tried, but in general it unnecessary IMHO.

I initially wanted to run this on the DSS-G nodes themselves because the 
amount of bridged traffic is tiny, 110 days since my gateway was last 
rebooted have produced a bit under 16GB of forwarded traffic. The DSS-G 
nodes are ideally placed to do the routing having loads of redundant 
Ethernet connectivity. However it turns out running keepalived on the 
DSS-G nodes is not allowed :-(

So I still have a single point of failure on the system and debating 
what to do next. Given RHEL8 has removed the driver support for the 
Intel Quickpath Infiniband cards a wholesale upgrade to 10Gbps Ethenet 
is looking attractive.


JAB.

-- 
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG



More information about the gpfsug-discuss mailing list