[gpfsug-discuss] /sbin/rmmod mmfs26 hangs on mmshutdown

Sven Oehme oehmes at gmail.com
Wed Jul 11 14:47:06 BST 2018


Hi,

what does numactl -H report ?

also check if this is set to yes :

root at fab3a:~# mmlsconfig numaMemoryInterleave
numaMemoryInterleave yes

Sven

On Wed, Jul 11, 2018 at 6:40 AM Billich Heinrich Rainer (PSI) <
heiner.billich at psi.ch> wrote:

> Hello,
>
>
>
> I have two nodes which hang on  ‘mmshutdown’, in detail the command
> ‘/sbin/rmmod mmfs26’ hangs. I get kernel messages which I append below. I
> wonder if this looks familiar to somebody? Is it a known bug?  I can avoid
> the issue if I reduce pagepool from 128G to 64G.
>
>
>
> Running ‘systemctl stop gpfs’ shows the same issue. It forcefully
> terminates after a while, but ‘rmmod’ stays stuck.
>
>
>
> Two functions cxiReleaseAndForgetPages and put_page seem to be involved,
>  the first part of gpfs, the second a kernel call.
>
>
>
> The servers have 256G memory  and 72 (virtual) cores each.
>
> I run 5.0.1-1 on RHEL7.4  with kernel 3.10.0-693.17.1.el7.x86_64.
>
>
>
> I can try to switch back to 5.0.0
>
>
>
> Thank you & kind regards,
>
>
>
> Heiner
>
>
>
>
>
>
>
> Jul 11 14:12:04 node-1.x.y mmremote[1641]: Unloading module mmfs26
>
> Jul 11 14:12:04 node-1.x.y mmsysmon[2440]: [E] Event raised: The Spectrum
> Scale service process not running on this node. Normal operation cannot be
> done
>
> Jul 11 14:12:04 node-1.x.y mmsysmon[2440]: [I] Event raised: The Spectrum
> Scale service process is running
>
> Jul 11 14:12:04 node-1.x.y mmsysmon[2440]: [E] Event raised: The node is
> not able to form a quorum with the other available nodes.
>
> Jul 11 14:12:38 node-1.x.y sshd[2826]: Connection closed by xxx port 52814
> [preauth]
>
>
>
> Jul 11 14:12:41 node-1.x.y kernel: NMI watchdog: BUG: soft lockup - CPU#28
> stuck for 23s! [rmmod:2695]
>
>
>
> Jul 11 14:12:41 node-1.x.y kernel: Modules linked in: mmfs26(OE-)
> mmfslinux(OE) tracedev(OE) tcp_diag inet_diag rdma_ucm(OE) ib_ucm(OE)
> rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE)
> mlx5_fpga_tools(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE)
> mlx4_ib(OE) ib_core(OE) vfat fat ext4 sb_edac edac_core intel_powerclamp
> coretemp intel_rapl iosf_mbi mbcache jbd2 kvm irqbypass crc32_pclmul
> ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd
> iTCO_wdt iTCO_vendor_support ipmi_ssif pcc_cpufreq hpilo ipmi_si sg hpwdt
> pcspkr i2c_i801 lpc_ich ipmi_devintf wmi ioatdma shpchp ipmi_msghandler
> acpi_power_meter binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace sunrpc
> ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200
> i2c_algo_bit drm_kms_helper syscopyarea sysfillrect
>
> Jul 11 14:12:41 node-1.x.y kernel:  sysimgblt fb_sys_fops ttm ixgbe
> mlx4_core(OE) crct10dif_pclmul mdio mlx_compat(OE) crct10dif_common drm ptp
> crc32c_intel devlink hpsa pps_core i2c_core scsi_transport_sas dca
> dm_mirror dm_region_hash dm_log dm_mod [last unloaded: tracedev]
>
> Jul 11 14:12:41 node-1.x.y kernel: CPU: 28 PID: 2695 Comm: rmmod Tainted:
> G        W  OEL ------------   3.10.0-693.17.1.el7.x86_64 #1
>
> Jul 11 14:12:41 node-1.x.y kernel: Hardware name: HP ProLiant DL380
> Gen9/ProLiant DL380 Gen9, BIOS P89 01/22/2018
>
> Jul 11 14:12:41 node-1.x.y kernel: task: ffff8808c4814f10 ti:
> ffff881619778000 task.ti: ffff881619778000
>
> Jul 11 14:12:41 node-1.x.y kernel: RIP: 0010:[<ffffffff816a2970>]
> [<ffffffff816a2970>] put_compound_page+0xc3/0x174
>
> Jul 11 14:12:41 node-1.x.y kernel: RSP: 0018:ffff88161977bd50  EFLAGS:
> 00000246
>
> Jul 11 14:12:41 node-1.x.y kernel: RAX: 0000000000000283 RBX:
> 00000000fae3d201 RCX: 0000000000000284
>
> Jul 11 14:12:41 node-1.x.y kernel: RDX: 0000000000000283 RSI:
> 0000000000000246 RDI: ffffea003d478000
>
> Jul 11 14:12:41 node-1.x.y kernel: RBP: ffff88161977bd68 R08:
> ffff881ffae3d1e0 R09: 0000000180800059
>
> Jul 11 14:12:41 node-1.x.y kernel: R10: 00000000fae3d201 R11:
> ffffea007feb8f40 R12: 00000000fae3d201
>
> Jul 11 14:12:41 node-1.x.y kernel: R13: ffff88161977bd40 R14:
> 0000000000000000 R15: ffff88161977bd40
>
> Jul 11 14:12:41 node-1.x.y kernel: FS:  00007f81a1db0740(0000)
> GS:ffff883ffee80000(0000) knlGS:0000000000000000
>
> Jul 11 14:12:41 node-1.x.y kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033
>
> Jul 11 14:12:41 node-1.x.y kernel: CR2: 00007fa96e38f980 CR3:
> 0000000c36b2c000 CR4: 00000000001607e0
>
> Jul 11 14:12:41 node-1.x.y kernel: DR0: 0000000000000000 DR1:
> 0000000000000000 DR2: 0000000000000000
>
> Jul 11 14:12:41 node-1.x.y kernel: DR3: 0000000000000000 DR6:
> 00000000fffe0ff0 DR7: 0000000000000400
>
>
>
> Jul 11 14:12:41 node-1.x.y kernel: Call Trace:
>
> Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffff81192275>] put_page+0x45/0x50
>
> Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffffc08e3562>]
> cxiReleaseAndForgetPages+0xb2/0x1c0 [mmfslinux]
>
> Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffffc08e3ae5>]
> cxiDeallocPageList+0x45/0x110 [mmfslinux]
>
> Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffff811e0b02>] ?
> kmem_cache_free+0x1e2/0x200
>
> Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffffc08e3cda>]
> cxiFreeSharedMemory+0x12a/0x130 [mmfslinux]
>
> Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffffc0c70c12>]
> kxFreeAllSharedMemory+0xe2/0x160 [mmfs26]
>
> Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffffc0c5bd15>] mmfs+0xc85/0xca0
> [mmfs26]
>
> Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffffc08c8f16>]
> gpfs_clean+0x26/0x30 [mmfslinux]
>
> Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffffc0da5565>]
> cleanup_module+0x25/0x30 [mmfs26]
>
> Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffff8110044b>]
> SyS_delete_module+0x19b/0x300
>
> Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffff816b89fd>]
> system_call_fastpath+0x16/0x1b
>
> Jul 11 14:12:41 node-1.x.y kernel: Code: d1 00 00 00 4c 89 e7 e8 3a ff ff
> ff e9 c4 00 00 00 4c 39 e3 74 c1 41 8b 54 24 1c 85 d2 74 b8 8d 4a 01 89 d0
> f0 41 0f b1 4c 24 1c <39> c2 74 04 89 c2 eb e8 e8 f3 f0 ae ff 49 89 c5 f0
> 41 0f ba 2c
>
>
>
> Jul 11 14:13:23 node-1.x.y systemd[1]: gpfs.service stopping timed out.
> Terminating.
>
>
>
> Jul 11 14:13:27 node-1.x.y kernel: NMI watchdog: BUG: soft lockup - CPU#28
> stuck for 21s! [rmmod:2695]
>
>
>
> Jul 11 14:13:27 node-1.x.y kernel: Modules linked in: mmfs26(OE-)
> mmfslinux(OE) tracedev(OE) tcp_diag inet_diag rdma_ucm(OE) ib_ucm(OE)
> rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE)
> mlx5_fpga_tools(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE)
> mlx4_ib(OE) ib_core(OE) vfat fat ext4 sb_edac edac_core intel_powerclamp
> coretemp intel_rapl iosf_mbi mbcache jbd2 kvm irqbypass crc32_pclmul
> ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd
> iTCO_wdt iTCO_vendor_support ipmi_ssif pcc_cpufreq hpilo ipmi_si sg hpwdt
> pcspkr i2c_i801 lpc_ich ipmi_devintf wmi ioatdma shpchp ipmi_msghandler
>
> Jul 11 14:13:27 node-1.x.y kernel: INFO: rcu_sched detected stalls on
> CPUs/tasks:
>
> Jul 11 14:13:27 node-1.x.y kernel:  {
>
> Jul 11 14:13:27 node-1.x.y kernel:  28
>
> Jul 11 14:13:27 node-1.x.y kernel: }
>
> Jul 11 14:13:27 node-1.x.y kernel: (detected by 17, t=60002 jiffies,
> g=267734, c=267733, q=36089)
>
> Jul 11 14:13:27 node-1.x.y kernel: Task dump for CPU 28:
>
> Jul 11 14:13:27 node-1.x.y kernel: rmmod           R
>
> Jul 11 14:13:27 node-1.x.y kernel:   running task
>
> Jul 11 14:13:27 node-1.x.y kernel:     0  2695   2642 0x00000008
>
> Jul 11 14:13:27 node-1.x.y kernel: Call Trace:
>
> Jul 11 14:13:27 node-1.x.y kernel:
>
> Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffff811dea1c>] ?
> __free_slab+0xdc/0x200
>
> Jul 11 14:13:27 node-1.x.y kernel:
>
> Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffff816a28ad>] ?
> __put_compound_page+0x22/0x22
>
> Jul 11 14:13:27 node-1.x.y kernel:
>
> Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffff81192275>] ?
> put_page+0x45/0x50
>
> Jul 11 14:13:27 node-1.x.y kernel:
>
> Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc08e3562>] ?
> cxiReleaseAndForgetPages+0xb2/0x1c0 [mmfslinux]
>
> Jul 11 14:13:27 node-1.x.y kernel:
>
> Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc08e3ae5>] ?
> cxiDeallocPageList+0x45/0x110 [mmfslinux]
>
> Jul 11 14:13:27 node-1.x.y kernel:
>
> Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc08e3cda>] ?
> cxiFreeSharedMemory+0x12a/0x130 [mmfslinux]
>
> Jul 11 14:13:27 node-1.x.y kernel:
>
> Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc0c70c12>] ?
> kxFreeAllSharedMemory+0xe2/0x160 [mmfs26]
>
> Jul 11 14:13:27 node-1.x.y kernel:
>
> Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc0c5bd15>] ?
> mmfs+0xc85/0xca0 [mmfs26]
>
> Jul 11 14:13:27 node-1.x.y kernel:
>
> Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc08c8f16>] ?
> gpfs_clean+0x26/0x30 [mmfslinux]
>
> Jul 11 14:13:27 node-1.x.y kernel:
>
> Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc0da5565>] ?
> cleanup_module+0x25/0x30 [mmfs26]
>
> Jul 11 14:13:27 node-1.x.y kernel:
>
> Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffff8110044b>] ?
> SyS_delete_module+0x19b/0x300
>
> Jul 11 14:13:27 node-1.x.y kernel:
>
> Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffff816b89fd>] ?
> system_call_fastpath+0x16/0x1b
>
> Jul 11 14:13:27 node-1.x.y kernel:  acpi_power_meter
>
> Jul 11 14:13:27 node-1.x.y kernel:  binfmt_misc nfsd auth_rpcgss nfs_acl
> lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif
> crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea
> sysfillrect sysimgblt fb_sys_fops ttm ixgbe mlx4_core(OE) crct10dif_pclmul
> mdio mlx_compat(OE) crct10dif_common drm ptp crc32c_intel devlink hpsa
> pps_core i2c_core scsi_transport_sas dca dm_mirror dm_region_hash dm_log
> dm_mod [last unloaded: tracedev]
>
> Jul 11 14:13:27 node-1.x.y kernel: CPU: 28 PID: 2695 Comm: rmmod Tainted:
> G        W  OEL ------------   3.10.0-693.17.1.el7.x86_64 #1
>
> Jul 11 14:13:27 node-1.x.y kernel: Hardware name: HP ProLiant DL380
> Gen9/ProLiant DL380 Gen9, BIOS P89 01/22/2018
>
> Jul 11 14:13:27 node-1.x.y kernel: task: ffff8808c4814f10 ti:
> ffff881619778000 task.ti: ffff881619778000
>
> Jul 11 14:13:27 node-1.x.y kernel: RIP: 0010:[<ffffffff816a28ad>]
> [<ffffffff816a28ad>] __put_compound_page+0x22/0x22
>
> Jul 11 14:13:27 node-1.x.y kernel: RSP: 0018:ffff88161977bd70  EFLAGS:
> 00000282
>
> Jul 11 14:13:27 node-1.x.y kernel: RAX: 002fffff00008010 RBX:
> 0000000000000135 RCX: 00000000000001c1
>
> Jul 11 14:13:27 node-1.x.y kernel: RDX: ffff8814adbbf000 RSI:
> 0000000000000246 RDI: ffffea00650e7040
>
> Jul 11 14:13:27 node-1.x.y kernel: RBP: ffff88161977bd78 R08:
> ffff881ffae3df60 R09: 0000000180800052
>
> Jul 11 14:13:27 node-1.x.y kernel: R10: 00000000fae3db01 R11:
> ffffea007feb8f40 R12: ffff881ffae3df60
>
> Jul 11 14:13:27 node-1.x.y kernel: R13: 0000000180800052 R14:
> 00000000fae3db01 R15: ffffea007feb8f40
>
> Jul 11 14:13:27 node-1.x.y kernel: FS:  00007f81a1db0740(0000)
> GS:ffff883ffee80000(0000) knlGS:0000000000000000
>
> Jul 11 14:13:27 node-1.x.y kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033
>
> Jul 11 14:13:27 node-1.x.y kernel: CR2: 00007fa96e38f980 CR3:
> 0000000c36b2c000 CR4: 00000000001607e0
>
> Jul 11 14:13:27 node-1.x.y kernel: DR0: 0000000000000000 DR1:
> 0000000000000000 DR2: 0000000000000000
>
> Jul 11 14:13:27 node-1.x.y kernel: DR3: 0000000000000000 DR6:
> 00000000fffe0ff0 DR7: 0000000000000400
>
> Jul 11 14:13:27 node-1.x.y kernel: Call Trace:
>
> Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffff81192275>] ?
> put_page+0x45/0x50
>
> Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc08e3562>]
> cxiReleaseAndForgetPages+0xb2/0x1c0 [mmfslinux]
>
> Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc08e3ae5>]
> cxiDeallocPageList+0x45/0x110 [mmfslinux]
>
> Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc08e3cda>]
> cxiFreeSharedMemory+0x12a/0x130 [mmfslinux]
>
> Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc0c70c12>]
> kxFreeAllSharedMemory+0xe2/0x160 [mmfs26]
>
> Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc0c5bd15>] mmfs+0xc85/0xca0
> [mmfs26]
>
> Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc08c8f16>]
> gpfs_clean+0x26/0x30 [mmfslinux]
>
> Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc0da5565>]
> cleanup_module+0x25/0x30 [mmfs26]
>
> Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffff8110044b>]
> SyS_delete_module+0x19b/0x300
>
> Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffff816b89fd>]
> system_call_fastpath+0x16/0x1b
>
> Jul 11 14:13:27 node-1.x.y kernel: Code: c0 0f 95 c0 0f b6 c0 5d c3 0f 1f
> 44 00 00 55 48 89 e5 53 48 8b 07 48 89 fb a8 20 74 05 e8 0c f8 ae ff 48 89
> df ff 53 60 5b 5d c3 <0f> 1f 44 00 00 55 48 89 e5 41 55 41 54 53 48 8b 07
> 48 89 fb f6
>
>
>
> --
>
> Paul Scherrer Institut
>
> Science IT
>
> Heiner Billich
>
> WHGA 106
>
> CH 5232  Villigen PSI
>
> 056 310 36 02
>
> https://www.psi.ch
>
>
>
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180711/0671f7c0/attachment-0002.htm>


More information about the gpfsug-discuss mailing list