[gpfsug-discuss] ESS bring up the GPFS in recovery group without takeover

Sat Jan 6 11:17:34 GMT 2018

Hi Veera,

Can you please help in answering Damir's query.

---------------------------
My question is, is there a way of brining back the IO server into the mix 
without the recoverygroup takeover happening? Could I just start a gpfs 
and have it back in the mix as a backup server for the recoverygroup and 
if so, how do you do that. Right now that server is designated as primary 
server for the recovery group. I would like to have both IO servers in the 
mix for redundancy purposes.
---------------------------

Regards, The Spectrum Scale (GPFS) team

------------------------------------------------------------------------------------------------------------------
If you feel that your question can benefit other users of  Spectrum Scale 
(GPFS), then please post it to the public IBM developerWroks Forum at 
https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479
. 

If your query concerns a potential software error in Spectrum Scale (GPFS) 
and you have an IBM software maintenance contract please contact 
1-800-237-5511 in the United States or your local IBM Service Center in 
other countries. 

The forum is informally monitored as time permits and should not be used 
for priority messages to the Spectrum Scale (GPFS) team.

From:   Damir Krstic <damir.krstic at gmail.com>
To:     gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date:   12/22/2017 11:15 PM
Subject:        [gpfsug-discuss] ESS bring up the GPFS in recovery group 
without takeover
Sent by:        gpfsug-discuss-bounces at spectrumscale.org

It's been a very frustrating couple of months with our 2 ESS systems. IBM 
tells us we had blueflame bug and they came on site and updated our ESS to 
the latest version back in middle of November. Wednesday night one of the 
NSD servers in one of our ESS building blocks kernel panicked. No idea why 
and none of the logs are insightful. We have a PMR open with IBM. I am not 
very confident we will get to the bottom of what's causing kernel panics 
on our IO servers. The system has gone down over 4 times now in 2 months. 

When we tried brining it back up, it rejoined the recovery group and the 
IO on the entire cluster locked up until we were able to find couple of 
compute nodes with pending state in mmfsadm dump tscomm. Killing gpfs on 
those nodes resolved the issue of the filesystem locking up.

So far we have never been successful in brining back an IO server and not 
having a filesystem lock up until we find a node with pending state with 
tscomm. Anyway, the system was stable for few minutes until the same IO 
server that went down on Wednesday night went into an arbitrating mode. It 
never recovered. We stopped gpfs on that server and IO recovered again. We 
left gpfs down and cluster seems to be OK.

My question is, is there a way of brining back the IO server into the mix 
without the recoverygroup takeover happening? Could I just start a gpfs 
and have it back in the mix as a backup server for the recoverygroup and 
if so, how do you do that. Right now that server is designated as primary 
server for the recovery group. I would like to have both IO servers in the 
mix for redundancy purposes.

This ESS situation is beyond frustrating and I don't see end in sight.

Any help is appreciated._______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=vGabwIUw-ziMAtM7VfTppRp3S16NsgGOk5qMe50gtIQ&s=eplQuGhWVZMQ3tBLeqhCKpZ0w0rIiU-2R-UuqHYSsVA&e=

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180106/eab6a2ad/attachment-0001.htm>