[gpfsug-discuss] mmfsd segfault/signal 6 on dirop.C:4548 in GPFS 5.0.2.x

Ryan Novosielski novosirj at rutgers.edu
Wed Aug 21 17:03:12 BST 2019


I posted this on Slack, but it’s serious enough that I want to make sure everyone sees it. Does anyone, from IBM or otherwise, have any more information about this/whether it was even announced anyplace? Thanks!

A little late, but we ran into a relatively serious problem at our site with 5.0.2.3 at our site. The symptom is a mmfsd crash/segfault related to fs/dirop.C:4548. We ran into this sporadically, but it was repeatable on the problem workload. From IBM Support:

2. This is a known defect.
The problem has been fixed through
D.1073563: CTM_A_XW_FOR_DATA_IN_INODE related assert in DirLTE::lock
A companion fix is
D.1073753: Assert that the lock mode in DirLTE::lock is strong enough


The rep further said "It's not an APAR since it's found in internal testing. It's an internal function at a place it should not assert but a part of the condition as the code path is specific to the DIR_UPDATE_LOCKMODE optimization code... The assert was meant for certain file creation code path, but the condition wasn't set strictly for that code path that some other code path could also run into the assert. So we cannot predict on which node it would happen.” 

The fix was setting disableAssert="dirop.C:4548, which can be done live. Anyone seen anything else about this anyplace? The bug is fixed in 5.0.3.x and was introduced in 5.0.2.0/1 (not sure what this version number means; I’ve seen them listed X.X.X.X.X.X, X.X.X-X.X, and others).

--
____
|| \\UTGERS,  	 |---------------------------*O*---------------------------
||_// the State	 |         Ryan Novosielski - novosirj at rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ	 | Office of Advanced Research Computing - MSB C630, Newark
     `'



More information about the gpfsug-discuss mailing list