[gpfsug-discuss] GPFS and POWER9

Simon Thompson S.J.Thompson at bham.ac.uk
Thu Sep 19 16:18:47 BST 2019


Hi Andrew,

Yes, but not only. We use the two SFP+ ports from the Broadcom supplied card + the bifurcated Mellanox card in them.

Simon

From: <gpfsug-discuss-bounces at spectrumscale.org> on behalf of "abeattie at au1.ibm.com" <abeattie at au1.ibm.com>
Reply-To: "gpfsug-discuss at spectrumscale.org" <gpfsug-discuss at spectrumscale.org>
Date: Thursday, 19 September 2019 at 11:45
To: "gpfsug-discuss at spectrumscale.org" <gpfsug-discuss at spectrumscale.org>
Subject: Re: [gpfsug-discuss] GPFS and POWER9

Simon,

are you using Intel 10Gb Network Adapters with RH 7.6 by anychance?

regards
Andrew Beattie
File and Object Storage Technical Specialist - A/NZ
IBM Systems - Storage
Phone: 614-2133-7927
E-mail: abeattie at au1.ibm.com<mailto:abeattie at au1.ibm.com>


----- Original message -----
From: Simon Thompson <S.J.Thompson at bham.ac.uk>
Sent by: gpfsug-discuss-bounces at spectrumscale.org
To: "gpfsug-discuss at spectrumscale.org" <gpfsug-discuss at spectrumscale.org>
Cc:
Subject: [EXTERNAL] [gpfsug-discuss] GPFS and POWER9
Date: Thu, Sep 19, 2019 8:42 PM



Recently we’ve been having some issues with some of our POWER9 systems. They are occasionally handing or rebooting, in one case, we’ve found we can cause them to do it by running some MPI IOR workload to GPFS. Every instance we’ve seen which has logged something to syslog has had mmfsd referenced, but we don’t know if that is a symptom or a cause. (sometimes they just hang and we don’t see such a message) We see the following in the kern log:



Sep 18 18:45:14 bear-pg0306u11a kernel: Hypervisor Maintenance interrupt [Recovered]

Sep 18 18:45:14 bear-pg0306u11a kernel: Error detail: Malfunction Alert

Sep 18 18:45:14 bear-pg0306u11a kernel: #011HMER: 8040000000000000

Sep 18 18:45:14 bear-pg0306u11a kernel: #011Unknown Malfunction Alert of type 3

Sep 18 18:45:14 bear-pg0306u11a kernel: Hypervisor Maintenance interrupt [Recovered]

Sep 18 18:45:14 bear-pg0306u11a kernel: Error detail: Malfunction Alert

Sep 18 18:45:14 bear-pg0306u11a kernel: #011HMER: 8040000000000000

Sep 18 18:45:14 bear-pg0306u11a kernel: Severe Machine check interrupt [Not recovered]

Sep 18 18:45:14 bear-pg0306u11a kernel:  NIP: [00000000115a2478] PID: 141380 Comm: mmfsd

Sep 18 18:45:14 bear-pg0306u11a kernel:  Initiator: CPU

Sep 18 18:45:14 bear-pg0306u11a kernel:  Error type: UE [Load/Store]

Sep 18 18:45:14 bear-pg0306u11a kernel:    Effective address: 000003002a2a8400

Sep 18 18:45:14 bear-pg0306u11a kernel:    Physical address:  000003c016590000

Sep 18 18:45:14 bear-pg0306u11a kernel: Severe Machine check interrupt [Not recovered]

Sep 18 18:45:14 bear-pg0306u11a kernel:  NIP: [000000001150b160] PID: 141380 Comm: mmfsd

Sep 18 18:45:14 bear-pg0306u11a kernel:  Initiator: CPU

Sep 18 18:45:14 bear-pg0306u11a kernel:  Error type: UE [Instruction fetch]

Sep 18 18:45:14 bear-pg0306u11a kernel:    Effective address: 000000001150b160

Sep 18 18:45:14 bear-pg0306u11a kernel:    Physical address:  000003c01fe80000

Sep 18 18:45:14 bear-pg0306u11a kernel: Severe Machine check interrupt [Not recovered]

Sep 18 18:45:14 bear-pg0306u11a kernel:  NIP: [000000001086a7f0] PID: 25926 Comm: mmfsd

Sep 18 18:45:14 bear-pg0306u11a kernel:  Initiator: CPU

Sep 18 18:45:14 bear-pg0306u11a kernel:  Error type: UE [Instruction fetch]

Sep 18 18:45:14 bear-pg0306u11a kernel:    Effective address: 000000001086a7f0

Sep 18 18:45:14 bear-pg0306u11a kernel:    Physical address:  000003c00fe70000

Sep 18 18:45:14 bear-pg0306u11a kernel: mmfsd[25926]: unhandled signal 7 at 000000001086a7f0 nip 000000001086a7f0 lr 000000001086a7f0 code 4



I’ve raised a hardware ticket with IBM, as traditionally a machine check exception would likely be a hardware/firmware issue. Anyone else seen this sort of behaviour? Its multiple boxes doing this, but they do all have the same firmware/rhel/gpfs stack installed.



Asking here as they always reference mmfsd PIDs … (but maybe it’s a symptom rather than cause)…



Simon
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20190919/425e8bd9/attachment-0002.htm>


More information about the gpfsug-discuss mailing list