[gpfsug-discuss] Reasons for DiskLeaseThread Overloaded

Wed Feb 15 14:59:19 GMT 2023

Walter,

Thanks for the details.

The stack trace below captures the lease thread in the middle of sending the “lease” RPC. This operation normally is not blocking, and we do not often block while sending the RPC. But the stack trace “does not show” whether there was anything blocking the thread prior to the point where the RPCs are sent.

At a first glance:

2023-02-14_19:44:07.430+0100: [W]  counter: 0 (mark-idle: 0 mark-active: 0 pre-work: 0 post-work: 0) sched: (nvcsw: 0 nivcsw: 10)

I believe nivcsw: 10 means that the thread was scheduled out of the CPU involuntarily, possibly indicating that there is some CPU contention going on.

Could you open a case to get debug data collected?  If the problem can be recreated, I think we’ll need a recreate of the problem with traces enabled.

Thanks,

  Felipe

----
Felipe Knop        knop at us.ibm.com
GPFS Development and Security
IBM Systems
IBM Building 008
2455 South Rd, Poughkeepsie, NY 12601

From: gpfsug-discuss <gpfsug-discuss-bounces at gpfsug.org> on behalf of Walter Sklenka <Walter.Sklenka at EDV-Design.at>
Reply-To: gpfsug main discussion list <gpfsug-discuss at gpfsug.org>
Date: Wednesday, February 15, 2023 at 4:23 AM
To: gpfsug main discussion list <gpfsug-discuss at gpfsug.org>
Subject: [EXTERNAL] Re: [gpfsug-discuss] Reasons for DiskLeaseThread Overloaded

Hi! This is a „full“ sequence in mmfs. log. latest Fortunately this was also the last event until now (yesterday evening) Maybe you can have a look? 2023-02-14_19: 43: 51. 474+0100: [N] Disk lease period expired 0. 030 seconds ago in cluster
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Hi!
This is a „full“ sequence in mmfs.log.latest
Fortunately this was also the last event until now (yesterday evening)

Maybe you can have a look?
2023-02-14_19:43:51.474+0100: [N] Disk lease period expired 0.030 seconds ago in cluster xxx-cluster. Attempting to reacquire the lease.
2023-02-14_19:44:07.430+0100: [W] ------------------[GPFS Critical Thread Watchdog]------------------
2023-02-14_19:44:07.430+0100: [W] PID: 7294 State: R (DiskLeaseThread) is overloaded for more than 8 seconds
2023-02-14_19:44:07.430+0100: [W]  counter: 0 (mark-idle: 0 mark-active: 0 pre-work: 0 post-work: 0) sched: (nvcsw: 0 nivcsw: 10)
2023-02-14_19:44:07.430+0100: [W] Call Trace(PID: 7294):
2023-02-14_19:44:07.431+0100: [W] #0: 0x000055CABE4A56AB NodeConn::sendMessage(TcpConn**, iovec*, int, unsigned char, int, int, int, unsigned int, DestTag*, int*, unsigned long long*, unsigned long long*, unsi
gned int*, CondvarName, vsendCallback_t*) + 0x42B at ??:0
2023-02-14_19:44:07.432+0100: [W] #1: 0x000055CABE4A595F llc_send_msg(ClusterConfiguration*, NodeAddr, iovec*, int, unsigned char, int, int, int, unsigned int, DestTag*, int*, TcpConn**, unsigned long long*, u
nsigned long long*, unsigned int*, CondvarName, vsendCallback_t*, int, unsigned int) + 0xDF at ??:0
2023-02-14_19:44:07.437+0100: [W] #2: 0x000055CABE479A55 MsgRecord::send() + 0x1345 at ??:0
2023-02-14_19:44:07.438+0100: [W] #3: 0x000055CABE47A169 tscSendInternal(ClusterConfiguration*, unsigned int, unsigned char, int, int, NodeAddr*, TscReply*, TscScatteredBuff*, int, int (*)(void*, ClusterConfig
uration*, int, NodeAddr*, TscReply*), void*, ChainedCallback**, __va_list_tag*) + 0x339 at ??:0
2023-02-14_19:44:07.439+0100: [W] #4: 0x000055CABE47C39A tscSendWithCallback(ClusterConfiguration*, unsigned int, unsigned char, int, NodeAddr*, TscReply*, int (*)(void*, ClusterConfiguration*, int, NodeAddr*,
TscReply*), void*, void**, int, ...) + 0x1DA at ??:0
2023-02-14_19:44:07.440+0100: [W] #5: 0x000055CABE5F9853 MyLeaseState::renewLease(NodeAddr, TickTime) + 0x6E3 at ??:0
2023-02-14_19:44:07.440+0100: [W] #6: 0x000055CABE5FA682 ClusterConfiguration::checkAndRenewLease(TickTime) + 0x192 at ??:0
2023-02-14_19:44:07.441+0100: [W] #7: 0x000055CABE5FAAC6 ClusterConfiguration::RunLeaseChecks(void*) + 0x366 at ??:0
2023-02-14_19:44:07.441+0100: [W] #8: 0x000055CABDF2B662 Thread::callBody(Thread*) + 0x42 at ??:0
2023-02-14_19:44:07.441+0100: [W] #9: 0x000055CABDF18680 Thread::callBodyWrapper(Thread*) + 0xA0 at ??:0
2023-02-14_19:44:07.441+0100: [W] #10: 0x00007F3B7563D1CA start_thread + 0xEA at ??:0
2023-02-14_19:44:07.441+0100: [W] #11: 0x00007F3B7435BE73 __GI___clone + 0x43 at ??:0
2023-02-14_19:44:10.512+0100: [N] Disk lease reacquired in cluster xxx-cluster.
2023-02-14_19:44:10.512+0100: [N] Disk lease period expired 7.970 seconds ago in cluster xxx-cluster. Attempting to reacquire the lease.
2023-02-14_19:44:12.563+0100: [N] Disk lease reacquired in cluster xxx-cluster.

Thank you very much!
Best regards
Walter

From: gpfsug-discuss <gpfsug-discuss-bounces at gpfsug.org> On Behalf Of Felipe Knop
Sent: Mittwoch, 15. Februar 2023 00:06
To: gpfsug main discussion list <gpfsug-discuss at gpfsug.org>
Subject: Re: [gpfsug-discuss] Reasons for DiskLeaseThread Overloaded

All,

These messages like

[W] ------------------[GPFS Critical Thread Watchdog]------------------

indicate that a “critical thread”, in this case the lease thread, was apparently blocked for longer than expected. This is usually not caused by delays in the network, but possibly by excessive CPU load, blockage while accessing the local file system, or possible mutex contention.

Do you have other samples of the message, with a more complete stack trace?   Or was the instance below the only one?

  Felipe

----
Felipe Knop        knop at us.ibm.com<mailto:knop at us.ibm.com>
GPFS Development and Security
IBM Systems
IBM Building 008
2455 South Rd, Poughkeepsie, NY 12601

From: gpfsug-discuss <gpfsug-discuss-bounces at gpfsug.org<mailto:gpfsug-discuss-bounces at gpfsug.org>> on behalf of Walter Sklenka <Walter.Sklenka at EDV-Design.at<mailto:Walter.Sklenka at EDV-Design.at>>
Reply-To: gpfsug main discussion list <gpfsug-discuss at gpfsug.org<mailto:gpfsug-discuss at gpfsug.org>>
Date: Tuesday, February 14, 2023 at 10:49 AM
To: "gpfsug-discuss at gpfsug.org<mailto:gpfsug-discuss at gpfsug.org>" <gpfsug-discuss at gpfsug.org<mailto:gpfsug-discuss at gpfsug.org>>
Subject: [EXTERNAL] Re: [gpfsug-discuss] Reasons for DiskLeaseThread Overloaded

Hi! I started with 5. 1. 6. 0 and now am at [root@ ogpfs1 ~]# mmfsadm dump version Dump level: verbose Build branch "5. 1. 6. 1 ". the messages started from the beginning From: gpfsug-discuss <gpfsug-discuss-bounces@ gpfsug. org> On
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Hi!
I started with 5.1.6.0 and now am at [root at ogpfs1 ~]# mmfsadm dump version
Dump level: verbose
Build branch "5.1.6.1 ".

the messages started  from the beginning

From: gpfsug-discuss <gpfsug-discuss-bounces at gpfsug.org<mailto:gpfsug-discuss-bounces at gpfsug.org>> On Behalf Of Christian Vieser
Sent: Dienstag, 14. Februar 2023 15:34
To: gpfsug-discuss at gpfsug.org<mailto:gpfsug-discuss at gpfsug.org>
Subject: Re: [gpfsug-discuss] Reasons for DiskLeaseThread Overloaded

What version of Spectrum Scale is running there? Do these errors appear since your last version update?
Am 14.02.23 um 14:09 schrieb Walter Sklenka:
Dear Collegues!
May I ask if anyone has a hint what could be the reason for Critical Thread Watchdog warnings for Disk Leases Threads?
Is this a “local node” Problem or a network problem ?
I see these messages sometimes arriving when NSD Servers which also serve as NFS servers when they get under heavy NFS load

Following is an excerpt from mmfs.log.latest

2023-02-14_12:06:53.235+0100: [N] Disk lease period expired 0.040 seconds ago in cluster xxx-cluster. Attempting to reacquire the lease.
2023-02-14_12:06:53.600+0100: [W] ------------------[GPFS Critical Thread Watchdog]------------------
2023-02-14_12:06:53.600+0100: [W] PID: 7294 State: R (DiskLeaseThread) is overloaded for more than 8 seconds
2023-02-14_12:06:53.600+0100: [W]  counter: 0 (mark-idle: 0 mark-active: 0 pre-work: 0 post-work: 0) sched: (nvcsw: 0 nivcsw: 8)
2023-02-14_12:06:53.600+0100: [W] Call Trace(PID: 7294):
2023-02-14_12:06:53.600+0100: [W] #0: 0x000055CABDF49521 BaseMutexClass::release() + 0x12 at ??:0
2023-02-14_12:06:53.600+0100: [W] #1: 0xB1557721BBABD900 _etext + 0xB154F7E646041C0E at ??:0
2023-02-14_12:07:09.554+0100: [N] Disk lease reacquired in cluster xxx-cluster.
2023-02-14_12:07:09.554+0100: [N] Disk lease period expired 5.680 seconds ago in cluster xxx-cluster. Attempting to reacquire the lease.
2023-02-14_12:07:11.605+0100: [N] Disk lease reacquired in cluster xxx-cluster.
2023-02-14_12:10:55.990+0100: [I] Command: mmlspool /dev/fs4vm all -L -Y
2023-02-14_12:10:55.990+0100: [I] Command: successful mmlspool /dev/fs4vm all -L -Y
2023-02-14_12:30:58.756+0100: [I] Command: mmlspool /dev/fs4vm all -L -Y
2023-02-14_12:30:58.756+0100: [I] Command: successful mmlspool /dev/fs4vm all -L -Y
2023-02-14_13:10:55.988+0100: [I] Command: mmlspool /dev/fs4vm all -L -Y
2023-02-14_13:10:55.989+0100: [I] Command: successful mmlspool /dev/fs4vm all -L -Y
2023-02-14_13:21:40.892+0100: [N] Node 10.20.30.2 (ogpfs2-hs.local) lease renewal is overdue. Pinging to check if it is alive
2023-02-14_13:21:40.892+0100: [I] The TCP connection to IP address 10.20.30.2 ogpfs2-hs.local <c0n1>:[1] (socket 106) state: state=1 ca_state=0 snd_cwnd=10 snd_ssthresh=2147483647 unacked=0 probes=0 backoff=0 retransmits=0 rto=201000 rcv_ssthresh=1219344 rtt=121 rttvar=69 sacked=0 retrans=0 reordering=3 lost=0
2023-02-14_13:22:00.220+0100: [N] Disk lease period expired 0.010 seconds ago in cluster xxx-cluster. Attempting to reacquire the lease.
2023-02-14_13:22:08.298+0100: [N] Disk lease reacquired in cluster xxx-cluster.
2023-02-14_13:30:58.760+0100: [I] Command: mmlspool /dev/fs4vm all -L -Y
2023-02-14_13:30:58.760+0100: [I] Command: successful mmlspool /dev/fs4vm all -L -Y
Mit freundlichen Grüßen
Walter Sklenka
Technical Consultant

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20230215/1940c476/attachment-0002.htm>