[gpfsug-discuss] Changing verbsPorts On Single Node

Douglas Duckworth dod2014 at med.cornell.edu
Wed Feb 22 15:57:46 GMT 2017


Hello!

I am an HPC admin at Weill Cornell Medicine in the Upper East Side of
Manhattan.  It's a great place with researchers working in many
computationally demanding fields.  I am asked to do many new things all of
the time so it's never boring.  Yesterday we deployed a server that's
intended to create atomic-level image of a ribosome.  Pretty serious
science!

We have two DDN GridScaler GPFS clusters with around 3PB of storage.  FDR
Infiniband provides the interconnect.  Our compute nodes are Dell PowerEdge
12/13G servers running Centos 6 and 7 while we're using SGE for
scheduling.  Hopefully soon Slurm.  We also have some GPU servers from
Pengiun Computing, with GTX 1080s, as well a new Ryft FPGA accelerator.  I
am hoping our next round of computing power will come from AMD...

Anyway, I've been using Ansible to deploy our new GPFS nodes as well as
build all other things we need at WCM.  I thought that this was complete.
However, apparently, the GPFS client's been trying RDMA over port mlx4_0/2
though we need to use mlx4_0/1!  Rather than running mmchconfig against the
entire cluster, I have been trying it locally on the node that needs to be
addressed.  For example:

sudo mmchconfig verbsPorts=mlx4_0/1 -i -N node155

When ran locally the desired change becomes permanent and we see RDMA
active after restarting GPFS service on node.  Though mmchconfig still
tries to run against all nodes in the cluster!  I kill it of course at the
known_hosts step.

In addition I tried:

sudo mmchconfig verbsPorts=mlx4_0/1 -i -N node155 NodeClass=localhost

However the same result.

When doing capital "i" mmchconfig does attempt ssh with all nodes.  Yet the
change does not persist after restarting GPFS.

So far I consulted the following documentation:

http://ibm.co/2mcjK3P
http://ibm.co/2lFSInH

Could anyone please help?

We're using GPFS client version 4.1.1-3 on Centos 6 nodes as well
as 4.2.1-2 on those which are running Centos 7.

Thanks so much!

Best
Doug


Thanks,

Douglas Duckworth, MSc, LFCS
HPC System Administrator
Scientific Computing Unit
Physiology and Biophysics
Weill Cornell Medicine
E: doug at med.cornell.edu
O: 212-746-6305
F: 212-746-8690
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170222/a56244b2/attachment-0001.htm>


More information about the gpfsug-discuss mailing list