[gpfsug-discuss] Using VMs as quorum / admin nodes in a GPFS infiniband cluster
Jonathan Buzzard
jonathan.buzzard at strath.ac.uk
Thu Jun 17 13:15:32 BST 2021
On 17/06/2021 09:29, Jan-Frode Myklebust wrote:
> *All* nodes needs to be able to communicate on the daemon network. If
> they don't have access to this network, they can't join the cluster.
Not strictly true.
TL;DR if all your NSD/master nodes are both Ethernet and Infiniband
connected then you will be able to join the node to the network. Doing
so is not advisable however as you will then start experiencing node
evictions left right and centre.
> It doesn't need to be same subnet, it can be routed. But they all have to
> be able to reach each other. If you use IPoIB, you likely need something
> to route between the IPoIB network and the outside world to reach the IP
> you have on your VM. I don't think you will be able to use an IP address
> in the IPoIB range for your VM, unless your vmware hypervisor is
> connected to the IB fabric, and can bridge it.. (doubt that's possible).
ESXi and pretty much ever other hypervisor worth their salt has been
able to do PCI pass through since forever. So wack a Infiniband card in
your ESXi node, pass it through to the VM and the jobs a goodun.
However it is something a lot of people are completely unaware of,
including Infiniband/Omnipath vendors. Conversation goes can I run my
fabric manager on a VM in ESXi rather than burn the planet on dedicated
nodes for the job. Response comes back the fabric is not supported on
ESXi, which shows utter ignorance on behalf of the fabric vendor.
> I've seen some customers avoid using IPoIB, and rather mix an ethernet
> for daemon network, and dedicate the infiniband network to RDMA.
>
What's the point of RDMA for GPFS, lower CPU overhead? For my mind it
creates a lot of inflexibility. If your next cluster uses a different
fabric migration is now a whole bunch more complicated. It's also a
"minority sport" so something to be avoided unless there is a compelling
reason not to.
In general you need a machine to act as a gateway between the Ethernet
and Infiniband fabrics. The configuration for this is minimal, the
following works just fine on RHEL7 and it's derivatives, though you will
need to change your interface names to suite
enable the kernel to forward IPv4 packets
sysctl -w net.ipv4.ip_forward=1
echo "net.ipv4.ip_forward = 1" >> /etc/sysctl.conf
tell the firewall to forward packets between the Ethernet and Infiniband
interfaces
iptables -A FORWARD -i eth0 -o ib0 -j ACCEPT
iptables -A FORWARD -i ib0 -o eth0 -j ACCEPT
echo "-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-A FORWARD -i eth0 -o ib0 -j ACCEPT
-A FORWARD -i ib0 -o eth0 -j ACCEPT" > /etc/sysconfig/iptables
enable and start the firewall
systemctl enable --now firewalld
However this approach has "issues", as you now have a single point of
failure on your system. TL;DR if the gateway goes away for any reason
node ejections abound, so you can't restart it to apply security updates.
On our system it is mainly a plain Ethernet (minimum 10Gbps) GPFS fabric
using plain TCP/IP. However the teaching HPC cluster nodes only have
1Gbps Ethernet and 40Gbps Infiniband (they where kept from a previous
system that used Lustre over Infiniband), so the storage goes over
Infiniband and we hooked a spare port on the ConnectX-4 cards on the
DSS-G nodes to the Infiniband fabric.
So the Ethernet/Infiniband gateway is only used as the nodes chat to one
another. Further when a teaching node responds on the daemon network to
a compute node it actually goes out the ethernet network of the node.
You could fix that but it's complicated configuration.
This leads to the option of running a pair of nodes that will route
between the networks and then running keepalived on the ethernet side to
provide redundancy using VRRP to shift the gateway IP between the two
nodes. You might be able to do the same for the Infiniband I have never
tried, but in general it unnecessary IMHO.
I initially wanted to run this on the DSS-G nodes themselves because the
amount of bridged traffic is tiny, 110 days since my gateway was last
rebooted have produced a bit under 16GB of forwarded traffic. The DSS-G
nodes are ideally placed to do the routing having loads of redundant
Ethernet connectivity. However it turns out running keepalived on the
DSS-G nodes is not allowed :-(
So I still have a single point of failure on the system and debating
what to do next. Given RHEL8 has removed the driver support for the
Intel Quickpath Infiniband cards a wholesale upgrade to 10Gbps Ethenet
is looking attractive.
JAB.
--
Jonathan A. Buzzard Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
More information about the gpfsug-discuss
mailing list