[gpfsug-discuss] Kernel BUG/panic in mm/slub.c:3772 on Spectrum Scale Data Access Edition installed via gpfs.gplbin RPM on KVM guests

Ryan Novosielski novosirj at rutgers.edu
Fri Jan 17 15:55:54 GMT 2020


That /is/ interesting.

I’m a little confused about how that could be playing out in a case where I’m building on -1062.9.1, building for -1062.9.1, and running on -1062.9.1. Is there something inherent in the RPM building process that hasn’t caught up, or am I misunderstanding that change’s impact on it?

--
____
|| \\UTGERS,       |---------------------------*O*---------------------------
||_// the State     |         Ryan Novosielski - novosirj at rutgers.edu<mailto:novosirj at rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ     | Office of Advanced Research Computing - MSB C630, Newark
    `'

On Jan 17, 2020, at 10:35, Felipe Knop <knop at us.ibm.com> wrote:


Hi Ryan,

Some interesting IBM-internal communication overnight. The problems seems related to a change made to LINUX_KERNEL_VERSION_VERBOSE to handle the additional digit in the kernel numbering (3.10.0-1000+) . The GPL layer expected LINUX_KERNEL_VERSION_VERBOSE to have that extra digit, and its absence resulted in an incorrect function being compiled in, which led to the crash.

This, at least, seems to make sense, in terms of matching to the symptoms of the problem.

We are still in internal debates on whether/how update our guidelines for gplbin generation ...

Regards,

  Felipe

----
Felipe Knop knop at us.ibm.com
GPFS Development and Security
IBM Systems
IBM Building 008
2455 South Rd, Poughkeepsie, NY 12601
(845) 433-9314 T/L 293-9314



----- Original message -----
From: Ryan Novosielski <novosirj at rutgers.edu>
Sent by: gpfsug-discuss-bounces at spectrumscale.org
To: "gpfsug-discuss at spectrumscale.org" <gpfsug-discuss at spectrumscale.org>
Cc:
Subject: [EXTERNAL] Re: [gpfsug-discuss] Kernel BUG/panic in mm/slub.c:3772 on Spectrum Scale Data Access Edition installed via gpfs.gplbin RPM on KVM guests
Date: Thu, Jan 16, 2020 4:33 PM

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Felipe,

I either misunderstood support or convinced them to take further
action. It at first looked like they were suggesting "mmbuildgpl fixed
it: case closed" (I know they wanted to close the SalesForce case
anyway, which would prevent communication on the issue). At this
point, they've asked for a bunch more information.

Support is asking similar questions re: the speculations, and I'll
provide them with the relevant output ASAP, but I did confirm all of
that, including that there were no stray mmfs26/tracedev kernel
modules anywhere else in the relevant /lib/modules PATHs. In the
original case, I built on a machine running 3.10.0-957.27.2, but
pointed to the 3.10.0-1062.9.1 source code/defined the relevant
portions of usr/lpp/mmfs/src/config/env.mcr. That's always worked
before, and rebuilding once the build system was running
3.10.0-1062.9.1 as well did not change anything either. In all cases,
the GPFS version was Spectrum Scale Data Access Edition 5.0.4-1. If
you build against either the wrong kernel version or the wrong GPFS
version, both will appear right in the filename of the gpfs.gplbin RPM
you build. Mine is called:

gpfs.gplbin-3.10.0-1062.9.1.el7.x86_64-5.0.4-1.x86_64.rpm

Anyway, thanks for your response; I know you might not be
following/working on this directly, but I figured the extra info might
be of interest.

On 1/16/20 8:41 AM, Felipe Knop wrote:
> Hi Ryan,
>
> I'm aware of this ticket, and I understand that there has been
> active communication with the service team on this problem.
>
> The crash itself, as you indicate, looks like a problem that has
> been fixed:
>
> https://www.ibm.com/support/pages/ibm-spectrum-scale-gpfs-releases-423
13-or-later-and-5022-or-later-have-issues-where-kernel-crashes-rhel76-0
>
>  The fact that the problem goes away when *mmbuildgpl* is issued
> appears to point to some incompatibility with kernel levels and/or
> Scale version levels. Just speculating, some possible areas may
> be:
>
>
> * The RPM might have been built on a version of Scale without the
> fix * The RPM might have been built on a different (minor) version
> of the kernel * Somehow the VM picked a "leftover" GPFS kernel
> module, as opposed to the one included in gpfs.gplbin   -- given
> that mmfsd never complained about a missing GPL kernel module
>
>
> Felipe
>
> ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM
> Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601
> (845) 433-9314 T/L 293-9314
>
>
>
>
> ----- Original message ----- From: Ryan Novosielski
> <novosirj at rutgers.edu> Sent by:
> gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion
> list <gpfsug-discuss at spectrumscale.org> Cc: Subject: [EXTERNAL]
> [gpfsug-discuss] Kernel BUG/panic in mm/slub.c:3772 on Spectrum
> Scale Data Access Edition installed via gpfs.gplbin RPM on KVM
> guests Date: Wed, Jan 15, 2020 4:11 PM
>
> Hi there,
>
> I know some of the Spectrum Scale developers look at this list.
> I’m having a little trouble with support on this problem.
>
> We are seeing crashes with GPFS 5.0.4-1 Data Access Edition on KVM
> guests with a portability layer that has been installed via
> gpfs.gplbin RPMs that we built at our site and have used to
> install GPFS all over our environment. We’ve not seen this problem
> so far on any physical hosts, but have now experienced it on guests
> running on number of our KVM hypervisors, across vendors and
> firmware versions, etc. At one time I thought it was all happening
> on systems using Mellanox virtual functions for Infiniband, but
> we’ve now seen it on VMs without VFs. There may be an SELinux
> interaction, but some of our hosts have it disabled outright, some
> are Permissive, and some were working successfully with 5.0.2.x
> GPFS.
>
> What I’ve been instructed to try to solve this problem has been to
> run “mmbuildgpl”, and it has solved the problem. I don’t consider
> running "mmbuildgpl" a real solution, however. If RPMs are a
> supported means of installation, it should work. Support told me
> that they’d seen this solve the problem at another site as well.
>
> Does anyone have any more information about this problem/whether
> there’s a fix in the pipeline, or something that can be done to
> cause this problem that we could remedy? Is there an easy place to
> see a list of eFixes to see if this has come up? I know it’s very
> similar to a problem that happened I believe it was after 5.0.2.2
> and Linux 3.10.0-957.19.1, but that was fixed already in 5.0.3.x.
>
> Below is a sample of the crash output:
>
> [  156.733477] kernel BUG at mm/slub.c:3772! [  156.734212] invalid
> opcode: 0000 [#1] SMP [  156.735017] Modules linked in: ebtable_nat
> ebtable_filter ebtable_broute bridge stp llc ebtables mmfs26(OE)
> mmfslinux(OE) tracedev(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE)
> iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE)
> mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) ip6table_nat nf_nat_ipv6
> ip6table_mangle ip6table_raw nf_conntrack_ipv6 nf_defrag_ipv6
> ip6table_filter ip6_tables iptable_nat nf_nat_ipv4 nf_nat
> iptable_mangle iptable_raw nf_conntrack_ipv4 nf_defrag_ipv4
> xt_comment xt_multiport xt_conntrack nf_conntrack iptable_filter
> iptable_security nfit libnvdimm ppdev iosf_mbi crc32_pclmul
> ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper
> ablk_helper sg joydev pcspkr cryptd parport_pc parport i2c_piix4
> virtio_balloon knem(OE) binfmt_misc ip_tables xfs libcrc32c
> mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) sr_mod cdrom ata_generic
> pata_acpi virtio_console virtio_net virtio_blk crct10dif_pclmul
> crct10dif_common mlx5_core(OE) mlxfw(OE) crc32c_intel ptp pps_core
> devlink ata_piix serio_raw mlx_compat(OE) libata virtio_pci floppy
> virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod [
> 156.754814] CPU: 3 PID: 11826 Comm: request_handle* Tainted: G OE
> ------------   3.10.0-1062.9.1.el7.x86_64 #1 [  156.756782]
> Hardware name: Red Hat KVM, BIOS 1.11.0-2.el7 04/01/2014 [
> 156.757978] task: ffff8aeca5bf8000 ti: ffff8ae9f7a24000 task.ti:
> ffff8ae9f7a24000 [  156.759326] RIP: 0010:[<ffffffffbbe23dec>]
> [<ffffffffbbe23dec>] kfree+0x13c/0x140 [  156.760749] RSP:
> 0018:ffff8ae9f7a27278  EFLAGS: 00010246 [  156.761717] RAX:
> 001fffff00000400 RBX: ffffffffbc6974bf RCX: ffffa74dc1bcfb60 [
> 156.763030] RDX: 001fffff00000000 RSI: ffff8aed90fc6500 RDI:
> ffffffffbc6974bf [  156.764321] RBP: ffff8ae9f7a27290 R08:
> 0000000000000014 R09: 0000000000000003 [  156.765612] R10:
> 0000000000000048 R11: ffffdb5a82d125c0 R12: ffffa74dc4fd36c0 [
> 156.766938] R13: ffffffffc0a1c562 R14: ffff8ae9f7a272f8 R15:
> ffff8ae9f7a27938 [  156.768229] FS:  00007f8ffff05700(0000)
> GS:ffff8aedbfd80000(0000) knlGS:0000000000000000 [  156.769708] CS:
> 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [  156.770754] CR2:
> 000055963330e2b0 CR3: 0000000325ad2000 CR4: 00000000003606e0 [
> 156.772076] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000 [  156.773367] DR3: 0000000000000000 DR6:
> 00000000fffe0ff0 DR7: 0000000000000400 [  156.774663] Call Trace: [
> 156.775154]  [<ffffffffc0a1c562>]
> cxiInitInodeSecurityCleanup+0x12/0x20 [mmfslinux] [  156.776568]
> [<ffffffffc0b50562>]
> _Z17newInodeInitLinuxP15KernelOperationP13gpfsVfsData_tPP8OpenFilePPvP
P10gpfsNode_tP7FileUIDS6_N5LkObj12LockModeEnumE+0x152/0x290
>
>
[mmfs26]
> [  156.779378]  [<ffffffffc0b5cdfa>]
> _Z9gpfsMkdirP13gpfsVfsData_tP15KernelOperationP9cxiNode_tPPvPS4_PyS5_P
cjjjP10ext_cred_t+0x46a/0x7e0
>
>
[mmfs26]
> [  156.781689]  [<ffffffffc0bdb928>] ?
> _ZN14BaseMutexClass15releaseLockHeldEP16KernelSynchState+0x18/0x130
>
>
[mmfs26]
> [  156.783565]  [<ffffffffc0c3db2d>]
> _ZL21pcacheHandleCacheMissP13gpfsVfsData_tP15KernelOperationP10gpfsNod
e_tPvPcPyP12pCacheResp_tPS5_PS4_PjSA_j+0x4bd/0x760
>
>
[mmfs26]
> [  156.786228]  [<ffffffffc0c40675>]
> _Z12pcacheLookupP13gpfsVfsData_tP15KernelOperationP10gpfsNode_tPvPcP7F
ilesetjjjPS5_PS4_PyPjS9_+0x1ff5/0x21a0
>
>
[mmfs26]
> [  156.788681]  [<ffffffffc0c023ef>] ?
> _Z15findFilesetByIdP15KernelOperationjjPP7Filesetj+0x4f/0xa0
> [mmfs26] [  156.790448]  [<ffffffffc0b6d59c>]
> _Z10gpfsLookupP13gpfsVfsData_tPvP9cxiNode_tS1_S1_PcjPS1_PS3_PyP10cxiVa
ttr_tPjP10ext_cred_tjS5_PiS4_SD_+0x65c/0xad0
>
>
[mmfs26]
> [  156.793032]  [<ffffffffc0b8b022>] ?
> _Z33gpfsIsCifsBypassTraversalCheckingv+0xe2/0x130 [mmfs26] [
> 156.794588]  [<ffffffffc0a36d96>] gpfs_i_lookup+0x2e6/0x5a0
> [mmfslinux] [  156.795838]  [<ffffffffc0b6cf40>] ?
> _Z8gpfsLinkP13gpfsVfsData_tP9cxiNode_tS2_PvPcjjP10ext_cred_t+0x6c0/0x6
c0
>
>
[mmfs26]
> [  156.797753]  [<ffffffffbbe65d52>] ? __d_alloc+0x122/0x180 [
> 156.798763]  [<ffffffffbbe65e10>] ? d_alloc+0x60/0x70 [
> 156.799700]  [<ffffffffbbe556d3>] lookup_real+0x23/0x60 [
> 156.800651]  [<ffffffffbbe560f2>] __lookup_hash+0x42/0x60 [
> 156.801675]  [<ffffffffbc377874>] lookup_slow+0x42/0xa7 [
> 156.802634]  [<ffffffffbbe5ac3f>] link_path_walk+0x80f/0x8b0 [
> 156.803666]  [<ffffffffbbe5ae4a>] path_lookupat+0x7a/0x8b0 [
> 156.804690]  [<ffffffffbbdcd2fe>] ? lru_cache_add+0xe/0x10 [
> 156.805690]  [<ffffffffbbe24ef5>] ? kmem_cache_alloc+0x35/0x1f0 [
> 156.806766]  [<ffffffffbbe5c45f>] ? getname_flags+0x4f/0x1a0 [
> 156.807817]  [<ffffffffbbe5b6ab>] filename_lookup+0x2b/0xc0 [
> 156.808834]  [<ffffffffbbe5d5f7>] user_path_at_empty+0x67/0xc0 [
> 156.809923]  [<ffffffffbbdf3ecd>] ? handle_mm_fault+0x39d/0x9b0 [
> 156.811017]  [<ffffffffbbe5d661>] user_path_at+0x11/0x20 [
> 156.811983]  [<ffffffffbbe50343>] vfs_fstatat+0x63/0xc0 [
> 156.812951]  [<ffffffffbbe506fe>] SYSC_newstat+0x2e/0x60 [
> 156.813931]  [<ffffffffbc388a26>] ? trace_do_page_fault+0x56/0x150
> [  156.815050]  [<ffffffffbbe50bbe>] SyS_newstat+0xe/0x10 [
> 156.816010]  [<ffffffffbc38dede>] system_call_fastpath+0x25/0x2a [
> 156.817104] Code: 49 8b 03 31 f6 f6 c4 40 74 04 41 8b 73 68 4c 89
> df e8 89 2f fa ff eb 84 4c 8b 58 30 48 8b 10 80 e6 80 4c 0f 44 d8
> e9 28 ff ff ff <0f> 0b 66 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56
> 41 55 41 54 [  156.822192] RIP  [<ffffffffbbe23dec>]
> kfree+0x13c/0x140 [  156.823180]  RSP <ffff8ae9f7a27278> [
> 156.823872] ---[ end trace 142960be4a4feed8 ]--- [  156.824806]
> Kernel panic - not syncing: Fatal exception [  156.826475] Kernel
> Offset: 0x3ac00000 from 0xffffffff81000000 (relocation range:
> 0xffffffff80000000-0xffffffffbfffffff)
>
> -- ____ || \\UTGERS,
> |---------------------------*O*--------------------------- ||_//
> the State |         Ryan Novosielski - novosirj at rutgers.edu || \\
> University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
> Campus ||  \\    of NJ | Office of Advanced Research Computing -
> MSB C630, Newark `'
>
> _______________________________________________ gpfsug-discuss
> mailing list gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
>
>
> _______________________________________________ gpfsug-discuss
> mailing list gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>

- --
 ____
 || \\UTGERS,     |----------------------*O*------------------------
 ||_// the State  |    Ryan Novosielski - novosirj at rutgers.edu
 || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
 ||  \\    of NJ  | Office of Advanced Res. Comp. - MSB C630, Newark
      `'
-----BEGIN PGP SIGNATURE-----

iF0EARECAB0WIQST3OUUqPn4dxGCSm6Zv6Bp0RyxvgUCXiDWSgAKCRCZv6Bp0Ryx
vpCsAKCQ2ykmeycbOVrHTGaFqb2SsU26NwCg3YyYi4Jy2d+xZjJkE6Vfht8O8gM=
=9rKb
-----END PGP SIGNATURE-----
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20200117/88695038/attachment-0002.htm>


More information about the gpfsug-discuss mailing list