<font size=2 face="sans-serif">Hi Simon, </font><br><br><font size=2 face="sans-serif">you likely run into the following issue:</font><br><br><font size=2 face="sans-serif">APAR IV93896 - </font><a href="https://www-01.ibm.com/support/docview.wss?uid=isg1IV93896"><font size=2 color=blue face="sans-serif">https://www-01.ibm.com/support/docview.wss?uid=isg1IV93896</font></a><br><br><font size=2 face="sans-serif">This problem happens only if you use

different host domains within a cluster and will mostly impact CES. It

is unrelated to upgrade or mixed version clusters.</font><br><br><font size=2 face="sans-serif">Its has been fixed with 5.0.2, therefore

I recommend to upgrade soon. </font><br><br><br><font size=2 face="sans-serif">Mit freundlichen Grüßen / Kind regards<br><br>Mathias Dietz<br><br>Spectrum Scale Development - Release Lead Architect (4.2.x)<br>Spectrum Scale RAS Architect<br>---------------------------------------------------------------------------<br>IBM Deutschland<br>Am Weiher 24<br>65451 Kelsterbach<br>Phone: +49 70342744105<br>Mobile: +49-15152801035<br>E-Mail: mdietz@de.ibm.com<br>-----------------------------------------------------------------------------<br>IBM Deutschland Research & Development GmbH<br>Vorsitzender des Aufsichtsrats: Martina Koederitz, Geschäftsführung: Dirk

WittkoppSitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht

Stuttgart, HRB 243294</font><br><br><br><br><font size=1 color=#5f5f5f face="sans-serif">From:      

 </font><font size=1 face="sans-serif">Simon Thompson <S.J.Thompson@bham.ac.uk></font><br><font size=1 color=#5f5f5f face="sans-serif">To:      

 </font><font size=1 face="sans-serif">"gpfsug-discuss@spectrumscale.org"

<gpfsug-discuss@spectrumscale.org></font><br><font size=1 color=#5f5f5f face="sans-serif">Date:      

 </font><font size=1 face="sans-serif">11/01/2019 15:19</font><br><font size=1 color=#5f5f5f face="sans-serif">Subject:    

   </font><font size=1 face="sans-serif">[gpfsug-discuss]

A cautionary tale of upgrades</font><br><font size=1 color=#5f5f5f face="sans-serif">Sent by:    

   </font><font size=1 face="sans-serif">gpfsug-discuss-bounces@spectrumscale.org</font><br><hr noshade><br><br><br><font size=3 face="Calibri"> </font><br><font size=3 face="Calibri">I’ll start by saying this is our experience,

maybe we did something stupid along the way, but just in case others see

similar issues …</font><br><font size=3 face="Calibri"> </font><br><font size=3 face="Calibri">We have a cluster which contains protocol

nodes, these were all happily running GPFS 5.0.1-2 code. But the cluster

was a only 4 nodes + 1 quorum node – manager and quorum functions were

handled by the 4 protocol nodes.</font><br><font size=3 face="Calibri"> </font><br><font size=3 face="Calibri">Then one day we needed to reboot a protocol

node. We did so and its disk controller appeared to have failed. Oh well,

we thought we’ll fix that another day, we still have three other quorum

nodes.</font><br><font size=3 face="Calibri"> </font><br><font size=3 face="Calibri">As they are all getting a little long in

the tooth and were starting to struggle, we thought, well we have DME,

lets add some new nodes for quorum and token functions. Being shiny and

new they were all installed with GPFS 5.0.2-1 code.</font><br><font size=3 face="Calibri"> </font><br><font size=3 face="Calibri">All was well.</font><br><font size=3 face="Calibri"> </font><br><font size=3 face="Calibri">The some-time later, we needed to restart

another of the CES nodes, when we started GPFS on the node, it was causing

havock in our cluster – CES IPs were constantly being assigned, then removed

from the remaining nodes in the cluster. Crap we thought and disabled the

node in the cluster. This made things stabilise and as we’d been having

other GPFS issues, we didn’t want service to be interrupted whilst we

dug into this. Besides, it was nearly Christmas and we had conferences

and other work to content with.</font><br><font size=3 face="Calibri"> </font><br><font size=3 face="Calibri">More time passes and we’re about to cut

over all our backend storage to some shiny new DSS-G kit, so we plan a

whole system maintenance window. We finish all our data sync’s and then

try to start our protocol nodes to test them. No dice … we can’t get

any of the nodes to bring up IPs, the logs look like they start the assignment

process, but then gave up.</font><br><font size=3 face="Calibri"> </font><br><font size=3 face="Calibri">A lot of digging in the mm korn shell scripts,

and some studious use of DEBUG=1 when testing, we find that mmcesnetmvaddress

is calling “tsctl shownodes up”. On our protocol nodes, we find output

of the form:</font><br><font size=3 face="Calibri">bear-er-dtn01.bb2.cluster.cluster,rds-aw-ctdb01-data.bb2.cluster.cluster,rds-er-ctdb01-data.bb2.cluster.cluster,bber-irods-ires01-data.bb2.cluster.cluster,bber-irods-icat01-data.bb2.cluster.cluster,bbaw-irods-icat01-data.bb2.cluster.cluster,proto-pg-mgr01.bear.cluster.cluster,proto-pg-pf01.bear.cluster.cluster,proto-pg-dtn01.bear.cluster.cluster,proto-er-mgr01.bear.cluster.cluster,proto-er-pf01.bear.cluster.cluster,proto-aw-mgr01.bear.cluster.cluster,proto-aw-pf01.bear.cluster.cluster</font><br><font size=3 face="Calibri"> </font><br><font size=3 face="Calibri">Now our DNS name for these nodes is bb2.cluster

… something is repeating the DNS name.</font><br><font size=3 face="Calibri"> </font><br><font size=3 face="Calibri">So we dig around, resolv.conf, /etc/hosts

etc all look good and name resolution seems fine.</font><br><font size=3 face="Calibri"> </font><br><font size=3 face="Calibri">We look around on the manager/quorum nodes

and they don’t do this cluster.cluster thing. We can’t find anything

else Linux config wise that looks bad. In fact the only difference is that

our CES nodes are running 5.0.1-2 and the manager nodes 5.0.2-1. Given

we’re changing the whole storage hardware, we didn’t want to change the

GPFS/NFS/SMB code on the CES nodes, (we’ve been bitten before with SMB

packages not working properly in our environment), but we go ahead and

do GPFS and NFS packages.</font><br><font size=3 face="Calibri"> </font><br><font size=3 face="Calibri">Suddenly, magically all is working again.

CES starts fine and IPs get assigned OK. And tsctl gives the correct output.</font><br><font size=3 face="Calibri"> </font><br><font size=3 face="Calibri">So, my supposition is that there is some

incompatibility between 5.0.1-2 and 5.0.2-1 when running CES and the cluster

manager is running on 5.0.2-1. As I said before, I don’t have hard evidence

we did something stupid, but it certainly is fishy. We’re guessing this

same “feature” was the cause of the CES issues we saw when we rebooted

a CES node and the IPs kept deassigning… It looks like all was well as

we added the manager nodes after CES was started, but when a CES node restarted,

things broke.</font><br><font size=3 face="Calibri"> </font><br><font size=3 face="Calibri">We got everything working again in house

so didn’t raise a PMR, but if you find yourself in this upgrade path,

beware!</font><br><font size=3 face="Calibri"> </font><br><font size=3 face="Calibri">Simon</font><br><font size=3 face="Calibri"> </font><br><font size=3 face="Calibri"> </font><tt><font size=2>_______________________________________________<br>gpfsug-discuss mailing list<br>gpfsug-discuss at spectrumscale.org<br></font></tt><a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss"><tt><font size=2>http://gpfsug.org/mailman/listinfo/gpfsug-discuss</font></tt></a><tt><font size=2><br></font></tt><br><br><BR>