[gpfsug-discuss] gpfs client expels

Salvatore Di Nardo sdinardo at ebi.ac.uk
Wed Aug 20 08:57:23 BST 2014


Still problems. Here some more detailed examples:

*EXAMPLE 1:*

            *EBI5-220**( CLIENT)**
            *Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a
            reply from node <GSS02B IP> gss02b*
            Tue Aug 19 11:03:04.981 2014: Request sent to <GSS02A IP>
            (gss02a in GSS.ebi.ac.uk) to expel <GSS02B IP> (gss02b in
            GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk
            Tue Aug 19 11:03:04.982 2014: This node will be expelled
            from cluster GSS.ebi.ac.uk due to expel msg from <EBI5-220
            IP> (ebi5-220)
            Tue Aug 19 11:03:09.319 2014: Cluster Manager connection
            broke. Probing cluster GSS.ebi.ac.uk
            Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum
            nodes during cluster probe.
            Tue Aug 19 11:03:10.322 2014: Lost membership in cluster
            GSS.ebi.ac.uk. Unmounting file systems.
            Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount invoked. 
            File system: gpfs1  Reason: SGPanic
            Tue Aug 19 11:03:12.066 2014: Connecting to <GSS02A IP>
            gss02a <c1p687>
            Tue Aug 19 11:03:12.070 2014: Connected to <GSS02A IP>
            gss02a <c1p687>
            Tue Aug 19 11:03:17.071 2014: Connecting to <GSS02B IP>
            gss02b <c1p686>
            Tue Aug 19 11:03:17.072 2014: Connecting to <GSS03B IP>
            gss03b <c1p685>
            Tue Aug 19 11:03:17.079 2014: Connecting to <GSS03A IP>
            gss03a <c1p684>
            Tue Aug 19 11:03:17.080 2014: Connecting to <GSS01B IP>
            gss01b <c1p683>
            Tue Aug 19 11:03:17.079 2014: Connecting to <GSS01A IP>
            gss01a <c1p1>
            Tue Aug 19 11:04:23.105 2014: Connected to <GSS02B IP>
            gss02b <c1p686>
            Tue Aug 19 11:04:23.107 2014: Connected to <GSS03B IP>
            gss03b <c1p685>
            Tue Aug 19 11:04:23.112 2014: Connected to <GSS03A IP>
            gss03a <c1p684>
            Tue Aug 19 11:04:23.115 2014: Connected to <GSS01B IP>
            gss01b <c1p683>
            Tue Aug 19 11:04:23.121 2014: Connected to <GSS01A IP>
            gss01a <c1p1>
            Tue Aug 19 11:12:28.992 2014: Node <GSS02A IP> (gss02a in
            GSS.ebi.ac.uk) is now the Group Leader.

            *GSS02B ( NSD SERVER)*
            ...
            Tue Aug 19 11:03:17.070 2014: Killing connection from
            *<EBI5-220 IP>* because the group is not ready for it to
            rejoin, err 46
            Tue Aug 19 11:03:25.016 2014: Killing connection from
            <EBI5-102 IP> because the group is not ready for it to
            rejoin, err 46
            Tue Aug 19 11:03:28.080 2014: Killing connection from
            *<EBI5-220 IP>* because the group is not ready for it to
            rejoin, err 46
            Tue Aug 19 11:03:36.019 2014: Killing connection from
            <EBI5-102 IP> because the group is not ready for it to
            rejoin, err 46
            Tue Aug 19 11:03:39.083 2014: Killing connection from
            *<EBI5-220 IP>* because the group is not ready for it to
            rejoin, err 46
            Tue Aug 19 11:03:47.023 2014: Killing connection from
            <EBI5-102 IP> because the group is not ready for it to
            rejoin, err 46
            Tue Aug 19 11:03:50.088 2014: Killing connection from
            *<EBI5-220 IP>* because the group is not ready for it to
            rejoin, err 46
            Tue Aug 19 11:03:52.218 2014: Killing connection from
            <EBI5-043 IP> because the group is not ready for it to
            rejoin, err 46
            Tue Aug 19 11:03:58.030 2014: Killing connection from
            <EBI5-102 IP> because the group is not ready for it to
            rejoin, err 46
            Tue Aug 19 11:04:01.092 2014: Killing connection from
            *<EBI5-220 IP>* because the group is not ready for it to
            rejoin, err 46
            Tue Aug 19 11:04:03.220 2014: Killing connection from
            <EBI5-043 IP> because the group is not ready for it to
            rejoin, err 46
            Tue Aug 19 11:04:09.034 2014: Killing connection from
            <EBI5-102 IP> because the group is not ready for it to
            rejoin, err 46
            Tue Aug 19 11:04:12.096 2014: Killing connection from
            *<EBI5-220 IP>* because the group is not ready for it to
            rejoin, err 46
            Tue Aug 19 11:04:14.224 2014: Killing connection from
            <EBI5-043 IP> because the group is not ready for it to
            rejoin, err 46
            Tue Aug 19 11:04:20.037 2014: Killing connection from
            <EBI5-102 IP> because the group is not ready for it to
            rejoin, err 46
            Tue Aug 19 11:04:23.103 2014: Accepted and connected to
            *<EBI5-220 IP>* ebi5-220 <c0n618>
            ...

            *GSS02a ( NSD SERVER)*
            Tue Aug 19 11:03:04.980 2014: Expel <GSS02B IP> (gss02b)
            request from <EBI5-220 IP> (ebi5-220 in
            ebi-cluster.ebi.ac.uk). Expelling: <EBI5-220 IP> (ebi5-220
            in ebi-cluster.ebi.ac.uk)
            Tue Aug 19 11:03:12.069 2014: Accepted and connected to
            <EBI5-220 IP> ebi5-220 <c0n618>


===============================================
*EXAMPLE 2*:

            *EBI5-038*
            Tue Aug 19 11:32:34.227 2014: *Disk lease period expired in
            cluster GSS.ebi.ac.uk. Attempting to reacquire lease.*
            Tue Aug 19 11:33:34.258 2014: *Lease is overdue. Probing
            cluster GSS.ebi.ac.uk*
            Tue Aug 19 11:35:24.265 2014: Close connection to <GSS02A
            IP> gss02a <c1n2> (Connection reset by peer). Attempting
            reconnect.
            Tue Aug 19 11:35:24.865 2014: Close connection to <EBI5-014
            IP> ebi5-014 <c1n457> (Connection reset by peer). Attempting
            reconnect.
            ...
            LOT MORE RESETS BY PEER
            ...
            Tue Aug 19 11:35:25.096 2014: Close connection to <EBI5-167
            IP> ebi5-167 <c1n155> (Connection reset by peer). Attempting
            reconnect.
            Tue Aug 19 11:35:25.267 2014: Connecting to <GSS02A IP>
            gss02a <c1n2>
            Tue Aug 19 11:35:25.268 2014: Close connection to <GSS02A
            IP> gss02a <c1n2> (Connection failed because destination is
            still processing previous node failure)
            Tue Aug 19 11:35:26.267 2014: Retry connection to <GSS02A
            IP> gss02a <c1n2>
            Tue Aug 19 11:35:26.268 2014: Close connection to <GSS02A
            IP> gss02a <c1n2> (Connection failed because destination is
            still processing previous node failure)
            Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum
            nodes during cluster probe.
            Tue Aug 19 11:36:24.277 2014: *Lost membership in cluster
            GSS.ebi.ac.uk. Unmounting file systems.*

            *GSS02a*
            Tue Aug 19 11:35:24.263 2014: Node <EBI5-038 IP> (ebi5-038
            in ebi-cluster.ebi.ac.uk) *is being expelled because of an
            expired lease.* Pings sent: 60. Replies received: 60.




In example 1 seems that an NSD was not repliyng to the client, but the 
servers seems working fine.. how can i trace better ( to solve) the 
problem?

In example 2 it seems to me that for some reason the manager are not 
renewing the lease in time. when this happens , its not a single client.
Loads of them fail to get the lease renewed. Why this is happening? how 
can i trace to the source of the problem?



Thanks in advance for any tips.

Regards,
Salvatore






-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140820/b9977ac0/attachment-0003.htm>


More information about the gpfsug-discuss mailing list