[gpfsug-discuss] A cautionary tale of upgrades

Mathias Dietz MDIETZ at de.ibm.com
Fri Jan 11 14:58:20 GMT 2019


Hi Simon, 

you likely run into the following issue:

APAR IV93896 - https://www-01.ibm.com/support/docview.wss?uid=isg1IV93896

This problem happens only if you use different host domains within a 
cluster and will mostly impact CES. It is unrelated to upgrade or mixed 
version clusters.

Its has been fixed with 5.0.2, therefore I recommend to upgrade soon. 


Mit freundlichen Grüßen / Kind regards

Mathias Dietz

Spectrum Scale Development - Release Lead Architect (4.2.x)
Spectrum Scale RAS Architect
---------------------------------------------------------------------------
IBM Deutschland
Am Weiher 24
65451 Kelsterbach
Phone: +49 70342744105
Mobile: +49-15152801035
E-Mail: mdietz at de.ibm.com
-----------------------------------------------------------------------------
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martina Koederitz, Geschäftsführung: Dirk 
WittkoppSitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht 
Stuttgart, HRB 243294



From:   Simon Thompson <S.J.Thompson at bham.ac.uk>
To:     "gpfsug-discuss at spectrumscale.org" 
<gpfsug-discuss at spectrumscale.org>
Date:   11/01/2019 15:19
Subject:        [gpfsug-discuss] A cautionary tale of upgrades
Sent by:        gpfsug-discuss-bounces at spectrumscale.org



 
I?ll start by saying this is our experience, maybe we did something stupid 
along the way, but just in case others see similar issues ?
 
We have a cluster which contains protocol nodes, these were all happily 
running GPFS 5.0.1-2 code. But the cluster was a only 4 nodes + 1 quorum 
node ? manager and quorum functions were handled by the 4 protocol nodes.
 
Then one day we needed to reboot a protocol node. We did so and its disk 
controller appeared to have failed. Oh well, we thought we?ll fix that 
another day, we still have three other quorum nodes.
 
As they are all getting a little long in the tooth and were starting to 
struggle, we thought, well we have DME, lets add some new nodes for quorum 
and token functions. Being shiny and new they were all installed with GPFS 
5.0.2-1 code.
 
All was well.
 
The some-time later, we needed to restart another of the CES nodes, when 
we started GPFS on the node, it was causing havock in our cluster ? CES 
IPs were constantly being assigned, then removed from the remaining nodes 
in the cluster. Crap we thought and disabled the node in the cluster. This 
made things stabilise and as we?d been having other GPFS issues, we didn?t 
want service to be interrupted whilst we dug into this. Besides, it was 
nearly Christmas and we had conferences and other work to content with.
 
More time passes and we?re about to cut over all our backend storage to 
some shiny new DSS-G kit, so we plan a whole system maintenance window. We 
finish all our data sync?s and then try to start our protocol nodes to 
test them. No dice ? we can?t get any of the nodes to bring up IPs, the 
logs look like they start the assignment process, but then gave up.
 
A lot of digging in the mm korn shell scripts, and some studious use of 
DEBUG=1 when testing, we find that mmcesnetmvaddress is calling ?tsctl 
shownodes up?. On our protocol nodes, we find output of the form:
bear-er-dtn01.bb2.cluster.cluster,rds-aw-ctdb01-data.bb2.cluster.cluster,rds-er-ctdb01-data.bb2.cluster.cluster,bber-irods-ires01-data.bb2.cluster.cluster,bber-irods-icat01-data.bb2.cluster.cluster,bbaw-irods-icat01-data.bb2.cluster.cluster,proto-pg-mgr01.bear.cluster.cluster,proto-pg-pf01.bear.cluster.cluster,proto-pg-dtn01.bear.cluster.cluster,proto-er-mgr01.bear.cluster.cluster,proto-er-pf01.bear.cluster.cluster,proto-aw-mgr01.bear.cluster.cluster,proto-aw-pf01.bear.cluster.cluster
 
Now our DNS name for these nodes is bb2.cluster ? something is repeating 
the DNS name.
 
So we dig around, resolv.conf, /etc/hosts etc all look good and name 
resolution seems fine.
 
We look around on the manager/quorum nodes and they don?t do this 
cluster.cluster thing. We can?t find anything else Linux config wise that 
looks bad. In fact the only difference is that our CES nodes are running 
5.0.1-2 and the manager nodes 5.0.2-1. Given we?re changing the whole 
storage hardware, we didn?t want to change the GPFS/NFS/SMB code on the 
CES nodes, (we?ve been bitten before with SMB packages not working 
properly in our environment), but we go ahead and do GPFS and NFS 
packages.
 
Suddenly, magically all is working again. CES starts fine and IPs get 
assigned OK. And tsctl gives the correct output.
 
So, my supposition is that there is some incompatibility between 5.0.1-2 
and 5.0.2-1 when running CES and the cluster manager is running on 
5.0.2-1. As I said before, I don?t have hard evidence we did something 
stupid, but it certainly is fishy. We?re guessing this same ?feature? was 
the cause of the CES issues we saw when we rebooted a CES node and the IPs 
kept deassigning? It looks like all was well as we added the manager nodes 
after CES was started, but when a CES node restarted, things broke.
 
We got everything working again in house so didn?t raise a PMR, but if you 
find yourself in this upgrade path, beware!
 
Simon
 
 _______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss





-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20190111/bd9402b2/attachment-0002.htm>


More information about the gpfsug-discuss mailing list