[gpfsug-discuss] Kernel BUG/panic in mm/slub.c:3772 on Spectrum Scale Data Access Edition installed via gpfs.gplbin RPM on KVM guests

Ryan Novosielski novosirj at rutgers.edu
Fri Jan 17 16:58:58 GMT 2020


Yeah, support got back to me with a similar response earlier today that I’d not seen yet that made it a lot clearer what I “did wrong". This would appear to be the cause in my case:

[root at master config]# diff env.mcr env.mcr-1062.9.1 
4,5c4,5
< #define LINUX_KERNEL_VERSION 31000999
< #define LINUX_KERNEL_VERSION_VERBOSE 310001062009001
---
> #define LINUX_KERNEL_VERSION 31001062
> #define LINUX_KERNEL_VERSION_VERBOSE 31001062009001


…the former having been generated by “make Autoconfig” and the latter generated by my brain. I’m surprised at the first line — I’d have caught myself that something different might have been needed if 3.10.0-1062 didn’t already fit in the number of digits.

Anyway, I explained to support that the reason I do this is that I maintain a couple of copies of env.mcr because occasionally there will be reasons to need gpfs.gplbin for a few different kernel versions (other software that doesn't want to be upgraded, etc.). I see I originally got this practice from the README (or possibly our original installer consultants).

Basically what’s missing here, so far as I can see, is a way to use mmbuildgpl/make Autoconfig but specify a target kernel version (and I guess an update to the docs or at least /usr/lpp/mmfs/src/README) that doesn’t suggest manually editing. Is there a way to at least find out what "make Autoconfig” would use for a target LINUX_KERNEL_VERSION_VERBOSE? From what I can see of makefile and config/configure, there’s no option for specifying anything.

--
____
|| \\UTGERS,  	 |---------------------------*O*---------------------------
||_// the State	 |         Ryan Novosielski - novosirj at rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ	 | Office of Advanced Research Computing - MSB C630, Newark
     `'

> On Jan 17, 2020, at 11:36 AM, Felipe Knop <knop at us.ibm.com> wrote:
> 
> Hi Ryan,
>  
> My interpretation of the analysis so far is that the content of LINUX_KERNEL_VERSION_VERBOSE in ' env.mcr' became incorrect. That is, it used to work well in a prior release of Scale, but not with 5.0.4.1 . This is because of a code change that added another digit to the version in LINUX_KERNEL_VERSION_VERBOSE to account for the 4-digit "fix level"  (3.10.0-1000+) . Then, when the GPL layer was built, its sources saw the content of LINUX_KERNEL_VERSION_VERBOSE with the missing extra digit and compiled the 'wrong' pieces in -- in particular the incorrect value of SECURITY_INODE_INIT_SECURITY() . And that led to the crash.
>  
> The problem did not happen when mmbuildgpl was used since the correct value of LINUX_KERNEL_VERSION_VERBOSE was then set up.
>  
>   Felipe
>  
> ----
> Felipe Knop knop at us.ibm.com
> GPFS Development and Security
> IBM Systems
> IBM Building 008
> 2455 South Rd, Poughkeepsie, NY 12601
> (845) 433-9314 T/L 293-9314
>  
>  
>  
> ----- Original message -----
> From: Ryan Novosielski <novosirj at rutgers.edu>
> Sent by: gpfsug-discuss-bounces at spectrumscale.org
> To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> Cc:
> Subject: [EXTERNAL] Re: [gpfsug-discuss] Kernel BUG/panic in mm/slub.c:3772 on Spectrum Scale Data Access Edition installed via gpfs.gplbin RPM on KVM guests
> Date: Fri, Jan 17, 2020 10:56 AM
>  
> That /is/ interesting. 
>  
> I’m a little confused about how that could be playing out in a case where I’m building on -1062.9.1, building for -1062.9.1, and running on -1062.9.1. Is there something inherent in the RPM building process that hasn’t caught up, or am I misunderstanding that change’s impact on it?
>  
> --
> ____
> || \\UTGERS,       |---------------------------*O*---------------------------
> ||_// the State     |         Ryan Novosielski - novosirj at rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\    of NJ     | Office of Advanced Research Computing - MSB C630, Newark
>     `'
>  
>> On Jan 17, 2020, at 10:35, Felipe Knop <knop at us.ibm.com> wrote:
>>  
>> 
>> Hi Ryan,
>>  
>> Some interesting IBM-internal communication overnight. The problems seems related to a change made to LINUX_KERNEL_VERSION_VERBOSE to handle the additional digit in the kernel numbering (3.10.0-1000+) . The GPL layer expected LINUX_KERNEL_VERSION_VERBOSE to have that extra digit, and its absence resulted in an incorrect function being compiled in, which led to the crash.
>>  
>> This, at least, seems to make sense, in terms of matching to the symptoms of the problem.
>>  
>> We are still in internal debates on whether/how update our guidelines for gplbin generation ...
>>  
>> Regards,
>>  
>>   Felipe
>>  
>> ----
>> Felipe Knop knop at us.ibm.com
>> GPFS Development and Security
>> IBM Systems
>> IBM Building 008
>> 2455 South Rd, Poughkeepsie, NY 12601
>> (845) 433-9314 T/L 293-9314
>>  
>>  
>>  
>> ----- Original message -----
>> From: Ryan Novosielski <novosirj at rutgers.edu>
>> Sent by: gpfsug-discuss-bounces at spectrumscale.org
>> To: "gpfsug-discuss at spectrumscale.org" <gpfsug-discuss at spectrumscale.org>
>> Cc:
>> Subject: [EXTERNAL] Re: [gpfsug-discuss] Kernel BUG/panic in mm/slub.c:3772 on Spectrum Scale Data Access Edition installed via gpfs.gplbin RPM on KVM guests
>> Date: Thu, Jan 16, 2020 4:33 PM
>>  
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>> 
>> Hi Felipe,
>> 
>> I either misunderstood support or convinced them to take further
>> action. It at first looked like they were suggesting "mmbuildgpl fixed
>> it: case closed" (I know they wanted to close the SalesForce case
>> anyway, which would prevent communication on the issue). At this
>> point, they've asked for a bunch more information.
>> 
>> Support is asking similar questions re: the speculations, and I'll
>> provide them with the relevant output ASAP, but I did confirm all of
>> that, including that there were no stray mmfs26/tracedev kernel
>> modules anywhere else in the relevant /lib/modules PATHs. In the
>> original case, I built on a machine running 3.10.0-957.27.2, but
>> pointed to the 3.10.0-1062.9.1 source code/defined the relevant
>> portions of usr/lpp/mmfs/src/config/env.mcr. That's always worked
>> before, and rebuilding once the build system was running
>> 3.10.0-1062.9.1 as well did not change anything either. In all cases,
>> the GPFS version was Spectrum Scale Data Access Edition 5.0.4-1. If
>> you build against either the wrong kernel version or the wrong GPFS
>> version, both will appear right in the filename of the gpfs.gplbin RPM
>> you build. Mine is called:
>> 
>> gpfs.gplbin-3.10.0-1062.9.1.el7.x86_64-5.0.4-1.x86_64.rpm
>> 
>> Anyway, thanks for your response; I know you might not be
>> following/working on this directly, but I figured the extra info might
>> be of interest.
>> 
>> On 1/16/20 8:41 AM, Felipe Knop wrote:
>> > Hi Ryan,
>> >
>> > I'm aware of this ticket, and I understand that there has been
>> > active communication with the service team on this problem.
>> >
>> > The crash itself, as you indicate, looks like a problem that has
>> > been fixed:
>> >
>> > https://www.ibm.com/support/pages/ibm-spectrum-scale-gpfs-releases-423
>> 13-or-later-and-5022-or-later-have-issues-where-kernel-crashes-rhel76-0
>> >
>> >  The fact that the problem goes away when *mmbuildgpl* is issued
>> > appears to point to some incompatibility with kernel levels and/or
>> > Scale version levels. Just speculating, some possible areas may
>> > be:
>> >
>> >
>> > * The RPM might have been built on a version of Scale without the
>> > fix * The RPM might have been built on a different (minor) version
>> > of the kernel * Somehow the VM picked a "leftover" GPFS kernel
>> > module, as opposed to the one included in gpfs.gplbin   -- given
>> > that mmfsd never complained about a missing GPL kernel module
>> >
>> >
>> > Felipe
>> >
>> > ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM
>> > Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601
>> > (845) 433-9314 T/L 293-9314
>> >
>> >
>> >
>> >
>> > ----- Original message ----- From: Ryan Novosielski
>> > <novosirj at rutgers.edu> Sent by:
>> > gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion
>> > list <gpfsug-discuss at spectrumscale.org> Cc: Subject: [EXTERNAL]
>> > [gpfsug-discuss] Kernel BUG/panic in mm/slub.c:3772 on Spectrum
>> > Scale Data Access Edition installed via gpfs.gplbin RPM on KVM
>> > guests Date: Wed, Jan 15, 2020 4:11 PM
>> >
>> > Hi there,
>> >
>> > I know some of the Spectrum Scale developers look at this list.
>> > I’m having a little trouble with support on this problem.
>> >
>> > We are seeing crashes with GPFS 5.0.4-1 Data Access Edition on KVM
>> > guests with a portability layer that has been installed via
>> > gpfs.gplbin RPMs that we built at our site and have used to
>> > install GPFS all over our environment. We’ve not seen this problem
>> > so far on any physical hosts, but have now experienced it on guests
>> > running on number of our KVM hypervisors, across vendors and
>> > firmware versions, etc. At one time I thought it was all happening
>> > on systems using Mellanox virtual functions for Infiniband, but
>> > we’ve now seen it on VMs without VFs. There may be an SELinux
>> > interaction, but some of our hosts have it disabled outright, some
>> > are Permissive, and some were working successfully with 5.0.2.x
>> > GPFS.
>> >
>> > What I’ve been instructed to try to solve this problem has been to
>> > run “mmbuildgpl”, and it has solved the problem. I don’t consider
>> > running "mmbuildgpl" a real solution, however. If RPMs are a
>> > supported means of installation, it should work. Support told me
>> > that they’d seen this solve the problem at another site as well.
>> >
>> > Does anyone have any more information about this problem/whether
>> > there’s a fix in the pipeline, or something that can be done to
>> > cause this problem that we could remedy? Is there an easy place to
>> > see a list of eFixes to see if this has come up? I know it’s very
>> > similar to a problem that happened I believe it was after 5.0.2.2
>> > and Linux 3.10.0-957.19.1, but that was fixed already in 5.0.3.x.
>> >
>> > Below is a sample of the crash output:
>> >
>> > [  156.733477] kernel BUG at mm/slub.c:3772! [  156.734212] invalid
>> > opcode: 0000 [#1] SMP [  156.735017] Modules linked in: ebtable_nat
>> > ebtable_filter ebtable_broute bridge stp llc ebtables mmfs26(OE)
>> > mmfslinux(OE) tracedev(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE)
>> > iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE)
>> > mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) ip6table_nat nf_nat_ipv6
>> > ip6table_mangle ip6table_raw nf_conntrack_ipv6 nf_defrag_ipv6
>> > ip6table_filter ip6_tables iptable_nat nf_nat_ipv4 nf_nat
>> > iptable_mangle iptable_raw nf_conntrack_ipv4 nf_defrag_ipv4
>> > xt_comment xt_multiport xt_conntrack nf_conntrack iptable_filter
>> > iptable_security nfit libnvdimm ppdev iosf_mbi crc32_pclmul
>> > ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper
>> > ablk_helper sg joydev pcspkr cryptd parport_pc parport i2c_piix4
>> > virtio_balloon knem(OE) binfmt_misc ip_tables xfs libcrc32c
>> > mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) sr_mod cdrom ata_generic
>> > pata_acpi virtio_console virtio_net virtio_blk crct10dif_pclmul
>> > crct10dif_common mlx5_core(OE) mlxfw(OE) crc32c_intel ptp pps_core
>> > devlink ata_piix serio_raw mlx_compat(OE) libata virtio_pci floppy
>> > virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod [
>> > 156.754814] CPU: 3 PID: 11826 Comm: request_handle* Tainted: G OE
>> > ------------   3.10.0-1062.9.1.el7.x86_64 #1 [  156.756782]
>> > Hardware name: Red Hat KVM, BIOS 1.11.0-2.el7 04/01/2014 [
>> > 156.757978] task: ffff8aeca5bf8000 ti: ffff8ae9f7a24000 task.ti:
>> > ffff8ae9f7a24000 [  156.759326] RIP: 0010:[<ffffffffbbe23dec>]
>> > [<ffffffffbbe23dec>] kfree+0x13c/0x140 [  156.760749] RSP:
>> > 0018:ffff8ae9f7a27278  EFLAGS: 00010246 [  156.761717] RAX:
>> > 001fffff00000400 RBX: ffffffffbc6974bf RCX: ffffa74dc1bcfb60 [
>> > 156.763030] RDX: 001fffff00000000 RSI: ffff8aed90fc6500 RDI:
>> > ffffffffbc6974bf [  156.764321] RBP: ffff8ae9f7a27290 R08:
>> > 0000000000000014 R09: 0000000000000003 [  156.765612] R10:
>> > 0000000000000048 R11: ffffdb5a82d125c0 R12: ffffa74dc4fd36c0 [
>> > 156.766938] R13: ffffffffc0a1c562 R14: ffff8ae9f7a272f8 R15:
>> > ffff8ae9f7a27938 [  156.768229] FS:  00007f8ffff05700(0000)
>> > GS:ffff8aedbfd80000(0000) knlGS:0000000000000000 [  156.769708] CS:
>> > 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [  156.770754] CR2:
>> > 000055963330e2b0 CR3: 0000000325ad2000 CR4: 00000000003606e0 [
>> > 156.772076] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> > 0000000000000000 [  156.773367] DR3: 0000000000000000 DR6:
>> > 00000000fffe0ff0 DR7: 0000000000000400 [  156.774663] Call Trace: [
>> > 156.775154]  [<ffffffffc0a1c562>]
>> > cxiInitInodeSecurityCleanup+0x12/0x20 [mmfslinux] [  156.776568]
>> > [<ffffffffc0b50562>]
>> > _Z17newInodeInitLinuxP15KernelOperationP13gpfsVfsData_tPP8OpenFilePPvP
>> P10gpfsNode_tP7FileUIDS6_N5LkObj12LockModeEnumE+0x152/0x290
>> >
>> >
>> [mmfs26]
>> > [  156.779378]  [<ffffffffc0b5cdfa>]
>> > _Z9gpfsMkdirP13gpfsVfsData_tP15KernelOperationP9cxiNode_tPPvPS4_PyS5_P
>> cjjjP10ext_cred_t+0x46a/0x7e0
>> >
>> >
>> [mmfs26]
>> > [  156.781689]  [<ffffffffc0bdb928>] ?
>> > _ZN14BaseMutexClass15releaseLockHeldEP16KernelSynchState+0x18/0x130
>> >
>> >
>> [mmfs26]
>> > [  156.783565]  [<ffffffffc0c3db2d>]
>> > _ZL21pcacheHandleCacheMissP13gpfsVfsData_tP15KernelOperationP10gpfsNod
>> e_tPvPcPyP12pCacheResp_tPS5_PS4_PjSA_j+0x4bd/0x760
>> >
>> >
>> [mmfs26]
>> > [  156.786228]  [<ffffffffc0c40675>]
>> > _Z12pcacheLookupP13gpfsVfsData_tP15KernelOperationP10gpfsNode_tPvPcP7F
>> ilesetjjjPS5_PS4_PyPjS9_+0x1ff5/0x21a0
>> >
>> >
>> [mmfs26]
>> > [  156.788681]  [<ffffffffc0c023ef>] ?
>> > _Z15findFilesetByIdP15KernelOperationjjPP7Filesetj+0x4f/0xa0
>> > [mmfs26] [  156.790448]  [<ffffffffc0b6d59c>]
>> > _Z10gpfsLookupP13gpfsVfsData_tPvP9cxiNode_tS1_S1_PcjPS1_PS3_PyP10cxiVa
>> ttr_tPjP10ext_cred_tjS5_PiS4_SD_+0x65c/0xad0
>> >
>> >
>> [mmfs26]
>> > [  156.793032]  [<ffffffffc0b8b022>] ?
>> > _Z33gpfsIsCifsBypassTraversalCheckingv+0xe2/0x130 [mmfs26] [
>> > 156.794588]  [<ffffffffc0a36d96>] gpfs_i_lookup+0x2e6/0x5a0
>> > [mmfslinux] [  156.795838]  [<ffffffffc0b6cf40>] ?
>> > _Z8gpfsLinkP13gpfsVfsData_tP9cxiNode_tS2_PvPcjjP10ext_cred_t+0x6c0/0x6
>> c0
>> >
>> >
>> [mmfs26]
>> > [  156.797753]  [<ffffffffbbe65d52>] ? __d_alloc+0x122/0x180 [
>> > 156.798763]  [<ffffffffbbe65e10>] ? d_alloc+0x60/0x70 [
>> > 156.799700]  [<ffffffffbbe556d3>] lookup_real+0x23/0x60 [
>> > 156.800651]  [<ffffffffbbe560f2>] __lookup_hash+0x42/0x60 [
>> > 156.801675]  [<ffffffffbc377874>] lookup_slow+0x42/0xa7 [
>> > 156.802634]  [<ffffffffbbe5ac3f>] link_path_walk+0x80f/0x8b0 [
>> > 156.803666]  [<ffffffffbbe5ae4a>] path_lookupat+0x7a/0x8b0 [
>> > 156.804690]  [<ffffffffbbdcd2fe>] ? lru_cache_add+0xe/0x10 [
>> > 156.805690]  [<ffffffffbbe24ef5>] ? kmem_cache_alloc+0x35/0x1f0 [
>> > 156.806766]  [<ffffffffbbe5c45f>] ? getname_flags+0x4f/0x1a0 [
>> > 156.807817]  [<ffffffffbbe5b6ab>] filename_lookup+0x2b/0xc0 [
>> > 156.808834]  [<ffffffffbbe5d5f7>] user_path_at_empty+0x67/0xc0 [
>> > 156.809923]  [<ffffffffbbdf3ecd>] ? handle_mm_fault+0x39d/0x9b0 [
>> > 156.811017]  [<ffffffffbbe5d661>] user_path_at+0x11/0x20 [
>> > 156.811983]  [<ffffffffbbe50343>] vfs_fstatat+0x63/0xc0 [
>> > 156.812951]  [<ffffffffbbe506fe>] SYSC_newstat+0x2e/0x60 [
>> > 156.813931]  [<ffffffffbc388a26>] ? trace_do_page_fault+0x56/0x150
>> > [  156.815050]  [<ffffffffbbe50bbe>] SyS_newstat+0xe/0x10 [
>> > 156.816010]  [<ffffffffbc38dede>] system_call_fastpath+0x25/0x2a [
>> > 156.817104] Code: 49 8b 03 31 f6 f6 c4 40 74 04 41 8b 73 68 4c 89
>> > df e8 89 2f fa ff eb 84 4c 8b 58 30 48 8b 10 80 e6 80 4c 0f 44 d8
>> > e9 28 ff ff ff <0f> 0b 66 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56
>> > 41 55 41 54 [  156.822192] RIP  [<ffffffffbbe23dec>]
>> > kfree+0x13c/0x140 [  156.823180]  RSP <ffff8ae9f7a27278> [
>> > 156.823872] ---[ end trace 142960be4a4feed8 ]--- [  156.824806]
>> > Kernel panic - not syncing: Fatal exception [  156.826475] Kernel
>> > Offset: 0x3ac00000 from 0xffffffff81000000 (relocation range:
>> > 0xffffffff80000000-0xffffffffbfffffff)
>> >
>> > -- ____ || \\UTGERS,
>> > |---------------------------*O*--------------------------- ||_//
>> > the State |         Ryan Novosielski - novosirj at rutgers.edu || \\
>> > University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
>> > Campus ||  \\    of NJ | Office of Advanced Research Computing -
>> > MSB C630, Newark `'
>> >
>> > _______________________________________________ gpfsug-discuss
>> > mailing list gpfsug-discuss at spectrumscale.org
>> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>> >
>> >
>> >
>> >
>> >
>> > _______________________________________________ gpfsug-discuss
>> > mailing list gpfsug-discuss at spectrumscale.org
>> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>> >
>> 
>> - --
>>  ____
>>  || \\UTGERS,     |----------------------*O*------------------------
>>  ||_// the State  |    Ryan Novosielski - novosirj at rutgers.edu
>>  || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
>>  ||  \\    of NJ  | Office of Advanced Res. Comp. - MSB C630, Newark
>>       `'
>> -----BEGIN PGP SIGNATURE-----
>> 
>> iF0EARECAB0WIQST3OUUqPn4dxGCSm6Zv6Bp0RyxvgUCXiDWSgAKCRCZv6Bp0Ryx
>> vpCsAKCQ2ykmeycbOVrHTGaFqb2SsU26NwCg3YyYi4Jy2d+xZjJkE6Vfht8O8gM=
>> =9rKb
>> -----END PGP SIGNATURE-----
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>  
>> 
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss 
>  
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss



More information about the gpfsug-discuss mailing list