[gpfsug-discuss] FW: ESS 3500-C5 : rg has resigned permanently

Thu Aug 24 20:56:16 BST 2023

mmvdisk rg change —active is a very common operation. It should be
perfectly safe.

mmvdisk rg change —restart is an option I didn’t know about, so likely not
something that’s commonly used.

I wouldn’t be too worried about losing the RGs. I don’t think that’s
something that can happen without support being able to help getting it
back online. Once I’ve had a situation similar to your RG not wanting to
become active again during an upgrade (around 5 years ago), and I believe
we solved it by rebooting the io-nodes — must have been some stuck process
I was unable to understand… or was it a CCR issue caused by some nodes
being way back-level..? Don’t remember.

  -jf

tor. 24. aug. 2023 kl. 20:22 skrev Walter Sklenka <
Walter.Sklenka at edv-design.at>:

> Hi Jan-Frode!
>
> We did the “switch” with mmvdisk rg change –rg ess3500_ess_n1_hs_ess_n2_hs
> –active ess-n2-hs             “
>
> Both nodes were up and we did not see any anomalies. And the rg was
> successfully created with the log groups
>
> Maybe the method to switch the rg (with –active) is a bad idea? (because
> manuals says:
>
>
> https://www.ibm.com/docs/en/ess/6.1.6_lts?topic=command-mmvdisk-recoverygroup
>
> *For a shared recovery group, the mmvdisk recoverygroup change --active *
> *Node** command means to make the specified node the server for all four
> user log groups and the root log group. The specified node therefore
> temporarily becomes the sole active server for the entire shared recovery
> group, leaving the other server idle. This should only be done in unusual
> maintenance situations, since it is normally considered an error condition
> for one of the servers of a shared recovery group to be idle. If the
> keyword DEFAULT is used instead of a server name, it restores the normal
> default balance of log groups, making each of the two servers responsible
> for two user log groups.*
>
>
>
>
> this was the state before we tried to restart , no log are seen, we got
> “unable to reset server list”
>
> ~]$ sudo mmvdisk server list --rg ess3500_ess_n1_hs_ess_n2_hs
>
>
>
>
>
> node
>
> number  server                            active   remarks
>
> ------  --------------------------------  -------  -------
>
>     98  ess-n1-hs            yes      configured
>
>     99  ess-n2-hs            yes      configured
>
>
>
>
>
> ~]$ sudo mmvdisk recoverygroup list --rg ess3500_ess_n1_hs_ess_n2_hs
>
>
>
>
>
>
> needs    user
>
> recovery group                       node
> class                                   active   current or master
> server          service  vdisks  remarks
>
> -----------------------------------  ----------  -------
> --------------------------------  -------  ------  -------
>
> ess3500_ess_n1_hs_ess_n2_hs  ess3500_mmvdisk_ess_n1_hs_ess_n2_hs  no
> -                                 unknown       0
>
>
>
>
>
> ~]$ ^C
>
> ~]$ sudo mmvdisk rg change --rg ess3500_ess_n1_hs_ess_n2_hs --restart
>
> mmvdisk:
>
> mmvdisk:
>
> mmvdisk: Unable to reset server list for recovery group
> 'ess3500_ess_n1_hs_ess_n2_hs'.
>
> mmvdisk: Command failed. Examine previous error messages to determine
> cause.
>
>
>
>
>
> Well, in the logs we did not find anything
>
> And finally we had to delete the rg , because we urgently needed new space
>
> With the new one we tested again and  we did mmshutdown -startup , and
> also with --active  flag, and all went ok. And now we have data on the rg
>
> But we are in concern that this might happen sometimes again and we might
> not be able to reenable the rg leading to a disaster
>
>
>
> So if you have any idea I would appreciate very much 😊
>
>
>
> Best regards
>
> Walter
>
> *From:* gpfsug-discuss <gpfsug-discuss-bounces at gpfsug.org> *On Behalf Of *Jan-Frode
> Myklebust
> *Sent:* Donnerstag, 24. August 2023 14:51
> *To:* gpfsug main discussion list <gpfsug-discuss at gpfsug.org>
> *Subject:* Re: [gpfsug-discuss] FW: ESS 3500-C5 : rg has resigned
> permanently
>
>
>
> It does sound like "mmvdisk rg change --restart" is the "varyon" command
> you're looking for.. but it's not clear why it's failing. I would start by
> looking at if there are any lower level issues with your cluster. Are your
> nodes healthy on a GPFS-level? "mmnetverify -N all" says network is OK ?
> "mmhealth node show -N all" not indicating any issues ?  Check
> mmfs.log.latest ?
>
>
>
> On Thu, Aug 24, 2023 at 1:41 PM Walter Sklenka <
> Walter.Sklenka at edv-design.at> wrote:
>
>
>
> Hi !
>
> Does someone eventually have experience with ESS 3500 ( no hybrid config,
> only NLSAS with 5 enclosures )
>
>
>
> We have issues with a shared recoverygroup. After creating it we made a
> test of setting only one node active (mybe not an optimal idea)
>
> But since then the recoverygroup is down
>
> We have created a PMR but do not get any response until now.
>
>
>
> The rg has no vdisks of any filesystem
>
> [gpfsadmin at hgess02-m ~]$ ^C
> [gpfsadmin at hgess02-m ~]$ sudo mmvdisk rg change --rg
> ess3500_hgess02_n1_hs_hgess02_n2_hs --restart
> mmvdisk:
> mmvdisk:
> mmvdisk: Unable to reset server list for recovery group
> 'ess3500_hgess02_n1_hs_hgess02_n2_hs'.
> mmvdisk: Command failed. Examine previous error messages to determine
> cause.
>
>
>
> We also tried
>
> 2023-08-21_16:57:26.174+0200: [I] Command: tsrecgroupserver
> ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l root hgess02-n2-hs.invalid
> 2023-08-21_16:57:26.201+0200: [I] Recovery group
> ess3500_hgess02_n1_hs_hgess02_n2_hs has resigned permanently
> 2023-08-21_16:57:26.201+0200: [E] Command: err 2: tsrecgroupserver
> ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l root hgess02-n2-hs.invalid
> 2023-08-21_16:57:26.201+0200: Specified entity, such as a disk or file
> system, does not exist.
> 2023-08-21_16:57:26.207+0200: [I] Command: tsrecgroupserver
> ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG001 hgess02-n2-hs.invalid.
> 2023-08-21_16:57:26.207+0200: [E] Command: err 212: tsrecgroupserver
> ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG001 hgess02-n2-hs.invalid
> 2023-08-21_16:57:26.207+0200: The current file system manager failed and
> no new manager will be appointed. This may cause nodes mounting the file
> system to experience mount failures.
> 2023-08-21_16:57:26.213+0200: [I] Command: tsrecgroupserver
> ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG002 hgess02-n2-hs.invalid
> 2023-08-21_16:57:26.213+0200: [E] Command: err 212: tsrecgroupserver
> ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG002 hgess02-n2-hs.invalid
> 2023-08-21_16:57:26.213+0200: The current file system manager failed and
> no new manager will be appointed. This may cause nodes mounting the file
> system to experience mount failures.
>
>
>
>
>
> For us it is crucial to know what we can do if theis happens again  ( it
> has no vdisks yet so it is not critical ).
>
>
>
> Do you know: is there a non documented way to “vary on”, or activate a
> recoverygroup again?
>
> The doc :
>
>
> https://www.ibm.com/docs/en/ess/6.1.6_lts?topic=rgi-recovery-group-issues-shared-recovery-groups-in-ess
>
> tells to mmshutdown and mmstartup, but the RGCM does say nothing
>
> When trying to execute any vdisk command it only says “rg down”, no idea
> how we could recover from that without deleting the rg ( I hope it will
> never happen, when we have vdisks on it
>
>
>
>
>
>
>
> Have a nice day
>
> Walter
>
>
>
>
>
>
>
>
>
> Mit freundlichen Grüßen
> *Walter Sklenka*
> *Technical Consultant*
>
>
>
> EDV-Design Informationstechnologie GmbH
> Giefinggasse 6
> <https://www.google.com/maps/search/Giefinggasse+6?entry=gmail&source=g>/1/2,
> A-1210 Wien
> Tel: +43 1 29 22 165-31
> Fax: +43 1 29 22 165-90
> E-Mail: sklenka at edv-design.at
> Internet: www.edv-design.at
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20230824/4dc22503/attachment-0001.htm>