[gpfsug-discuss] Node expels
Tomer Perry
TOMP at il.ibm.com
Thu Jan 17 11:46:19 GMT 2019
Simon,
Take a look at
http://files.gpfsug.org/presentations/2018/USA/Scale_Network_Flow-0.8.pdf
slide 13.
Regards,
Tomer Perry
Scalable I/O Development (Spectrum Scale)
email: tomp at il.ibm.com
1 Azrieli Center, Tel Aviv 67021, Israel
Global Tel: +1 720 3422758
Israel Tel: +972 3 9188625
Mobile: +972 52 2554625
From: Simon Thompson <S.J.Thompson at bham.ac.uk>
To: "gpfsug-discuss at spectrumscale.org"
<gpfsug-discuss at spectrumscale.org>
Date: 17/01/2019 13:35
Subject: [gpfsug-discuss] Node expels
Sent by: gpfsug-discuss-bounces at spectrumscale.org
We?ve recently been seeing quite a few node expels with messages of the
form:
2019-01-17_11:19:30.882+0000: [W] The TCP connection to IP address
10.20.0.58 proto-pg-pf01.bear.cluster <c0n236> (socket 153) state is
unexpected: state=1 ca_state=4 snd_cwnd=1 snd_ssthresh=5 unacked=5
probes=0 backoff=7 retransmits=7 rto=26496000 rcv_ssthresh=102828 rtt=6729
rttvar=12066 sacked=0 retrans=1 reordering=3 lost=5
2019-01-17_11:19:30.882+0000: [I] tscCheckTcpConn: Sending debug data
collection request to node 10.20.0.58 proto-pg-pf01.bear.cluster
2019-01-17_11:19:30.882+0000: Sending request to collect TCP debug data to
proto-pg-pf01.bear.cluster localNode
2019-01-17_11:19:30.882+0000: [I] Calling user exit script
gpfsSendRequestToNodes: event sendRequestToNodes, Async command
/usr/lpp/mmfs/bin/mmcommon.
2019-01-17_11:24:52.611+0000: [E] Timed out in 300 seconds waiting for a
commMsgCheckMessages reply from node 10.20.0.58
proto-pg-pf01.bear.cluster. Sending expel message.
On the client node, we see messages of the form:
2019-01-17_11:19:31.101+0000: [N] sdrServ: Received Tcp data collection
request from 10.10.0.33
2019-01-17_11:19:31.102+0000: [N] GPFS will attempt to collect Tcp debug
data on this node.
2019-01-17_11:24:52.838+0000: [N] sdrServ: Received expel data collection
request from 10.10.0.33
2019-01-17_11:24:52.838+0000: [N] GPFS will attempt to collect debug data
on this node.
2019-01-17_11:25:02.741+0000: [N] This node will be expelled from cluster
rds.gpfs.servers due to expel msg from 10.10.12.41 (b
ber-les-nsd01-data.bb2.cluster in rds.gpfs.server
2019-01-17_11:25:03.160+0000: [N] sdrServ: Received expel data collection
request from 10.20.0.56
They always appear to be to a specific type of hardware with the same
Ethernet controller, though the nodes are split across three data centres
and we aren?t seeing link congestion on the links between them.
On the node I listed above, it?s not actually doing anything either as the
software on it is still being installed (i.e. it?s not doing GPFS or any
other IO other than a couple of home directories).
Any suggestions on what ?(socket 153) state is unexpected? means?
Thanks
Simon
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20190117/94dd36b8/attachment-0002.htm>
More information about the gpfsug-discuss
mailing list