[gpfsug-discuss] Fw: What is this error message telling me?

John Lewars jlewars at us.ibm.com
Thu Sep 27 17:37:34 BST 2018


Hi Kevin,

The message below indicates that the mmfsd code had a pending message on a 
socket, and, when it looked at the low level socket statistics, GPFS found 
indications that the TCP connection was in a 'bad state'.  GPFS determines 
a connection to be a 'bad state' if:

1) the CA_STATE for the socket is not in 0 (or open) state, which means 
the state must be disorder, recovery, or loss.  See this paper for more 
details on CA_STATE: 
https://wiki.aalto.fi/download/attachments/69901948/TCP-CongestionControlFinal.pdf

or

2) the RTO is greater than 10 seconds and there are unacknowledged 
messages pending on the socket (unacked > 0). 

In the example below we see that rto=27008000, which means that the 
non-fast path TCP retransmission timeout is about 27 seconds, and that 
probably means the connection has experienced significant packet loss.  If 
there was no expel following this message, I would suspect there was some 
transient packet loss that was recovered from.

There are plenty of places in which to find more details on RTO, but you 
might want to start with wikipedia (
https://en.wikipedia.org/wiki/Transmission_Control_Protocol) which states:

In addition, senders employ a retransmission timeout (RTO) that is based 
on the estimated round-trip time (or RTT) between the sender and receiver, 
as well as the variance in this round trip time. The behavior of this 
timer is specified in RFC 6298. There are subtleties in the estimation of 
RTT. For example, senders must be careful when calculating RTT samples for 
retransmitted packets; typically they use Karn's Algorithm or TCP 
timestamps (see RFC 1323). These individual RTT samples are then averaged 
over time to create a Smoothed Round Trip Time (SRTT) using Jacobson's 
algorithm. This SRTT value is what is finally used as the round-trip time 
estimate. 
[. . .]
Reliability is achieved by the sender detecting lost data and 
retransmitting it. TCP uses two primary techniques to identify loss. 
Retransmission timeout (abbreviated as RTO) and duplicate cumulative 
acknowledgements (DupAcks). 


Note that older versions of the Spectrum Scale code had a third criteria 
in checking for 'bad state', which included checking if unacked was 
greater than 8, but that check would sometimes call-out a socket that was 
working fine, so this third check has been removed via the APAR IJ02566. 
All Spectrum Scale V5 code has this fix and the 4.2.X code stream picked 
up this fix in PTF 7 (4.2.3.7 ships APAR IJ02566).

More details on debugging expels using these TCP connection messages are 
in the presentation you referred to, which I posted here:
https://www.ibm.com/developerworks/community/wikis/home?lang=en_us#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/DEBUG%20Expels

Regards,
John Lewars 
Technical Computing Development, IBM Poughkeepsie


----- Forwarded by Lyle Gayne/Poughkeepsie/IBM on 09/27/2018 11:15 AM 
-----


Hi All, 

2018-09-27_09:48:50.923-0500: [E] The TCP connection to IP address 1.2.3.4 
some client <c0n509> (socket 442) state is unexpected: ca_state=1 
unacked=3 rto=27008000

Seeing errors like the above and trying to track down the root cause.  I 
know that at last weeks’ GPFS User Group meeting at ORNL this very error 
message was discussed, but I don’t recall the details and the slides 
haven’t been posted to the website yet.  IIRC, the “rto” is significant … 

I’ve Googled, but haven’t gotten any hits, nor have I found anything in 
the GPFS 4.2.2 Problem Determination Guide.

Thanks in advance…

—
Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and 
Education
Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180927/d106fe43/attachment-0001.htm>


More information about the gpfsug-discuss mailing list