[gpfsug-discuss] GPFS 3.5 to 4.1 Upgrade Question

Sander Kuusemets sander.kuusemets at ut.ee
Wed Dec 7 16:56:52 GMT 2016


It might have been some kind of a bug only we got, but I thought I'd 
share, just in case.

The email when they said they opened a ticket for this bug's fix was 
quite exactly a month ago, so I doubt they've fixed it, as they said it 
might take a while.

I don't know if this is of any help, but a paragraph from the explanation:

> The assert "msgLen >= (sizeof(Pad32) + 0)" is from routine 
> PIT_HelperGetWorkMH(). There are two RPC structures used in this routine
> - PitHelperWorkReport
> - PitInodeListPacket
>
> The problematic one is the 'PitInodeListPacket' subrpc which is a part 
> of an "interesting inode" code change. Looking at the dumps its 
> evident that node 'stage3' which sent the RPC is not capable of 
> supporting interesting inode (max feature level is 1340) and node 
> tank1 which is receiving it is trying to interpret the RPC beyond the 
> valid region (as its feature level 1502 supports PIT interesting 
> inodes). This is resulting in the assert you see. As a short term 
> measure bringing all the nodes to the same feature level should make 
> the problem go away. But since we support backward compatibility, we 
> are opening an APAR to create a code fix. It's unfortunately going to 
> be a tricky fix, which is going to take a significant amount of time. 
> Therefore I don't expect the team will be able to provide an efix 
> anytime soon. We recommend you bring all nodes in all clusters up the 
> latest level 4.2.0.4 and run the "mmchconfig release=latest" and 
> "mmchfs -V full"  commands that will ensure all daemon levels and fs 
> levels are at the necessary level that supports the 1502 RPC feature 
> level.
Best regards,

-- 
Sander Kuusemets
University of Tartu, High Performance Computing, IT Specialist

On 12/07/2016 06:31 PM, Aaron Knister wrote:
> Thanks Sander. That's disconcerting...yikes! Sorry for your trouble 
> but thank you for sharing.
>
> I'm surprised this didn't shake out during testing of gpfs 3.5 and 
> 4.1. I wonder if in light of this it's wise to do the clients first? 
> My logic being that there's clearly an example here of 4.1 servers 
> expecting behavior that only 4.1 clients provide. I suppose, though, 
> that there's just as likely a chance that there could be a yet to be 
> discovered bug in a situation where a 4.1 client expects something not 
> provided by a 3.5 server. Our current plan is still to take servers 
> first but I suspect we'll do a fair bit of testing with the PIT 
> commands in our test environment just out of curiosity.
>
> Also out of curiosity, how long ago did you open that PMR? I'm 
> wondering if there's a chance they've fixed this issue. I'm also 
> perplexed and cocnerned that there's no documentation of the PIT 
> commands to avoid during upgrades that I can find in any of the GPFS 
> upgrade documentation.
>
> -Aaron
>
> On 12/6/16 2:25 AM, Sander Kuusemets wrote:
>> Hello Aaron,
>>
>> I thought I'd share my two cents, as I just went through the process. I
>> thought I'd do the same, start upgrading from where I can and wait until
>> machines come available. It took me around 5 weeks to complete the
>> process, but the last two were because I was super careful.
>>
>> At first nothing happened, but at one point, a week into the upgrade
>> cycle, when I tried to mess around (create, delete, test) a fileset,
>> suddenly I got the weirdest of error messages while trying to delete a
>> fileset for the third time from a client node - I sadly cannot exactly
>> remember what it said, but I can describe what happened.
>>
>> After the error message, the current manager of our cluster fell into
>> arbitrating state, it's metadata disks were put to down state, manager
>> status was given to our other server node and it's log was spammed with
>> a lot of error messages, something like this:
>>
>>> mmfsd:
>>> /project/sprelbmd0/build/rbmd0s004a/src/avs/fs/mmfs/ts/cfgmgr/pitrpc.h:1411: 
>>>
>>> void logAssertFailed(UInt32, const char*, UInt32, Int32, Int32,
>>> UInt32, const char*, const char*): Assertion `msgLen >= (sizeof(Pad32)
>>> + 0)' failed.
>>> Wed Nov  2 19:24:01.967 2016: [N] Signal 6 at location 0x7F9426EFF625
>>> in process 15113, link reg 0xFFFFFFFFFFFFFFFF.
>>> Wed Nov  2 19:24:05.058 2016: [X] *** Assert exp(msgLen >=
>>> (sizeof(Pad32) + 0)) in line 1411 of file
>>> /project/sprelbmd0/build/rbmd0s004a/src/avs/fs/mmfs/ts/cfgmgr/pitrpc.h
>>> Wed Nov  2 19:24:05.059 2016: [E] *** Traceback:
>>> Wed Nov  2 19:24:05.060 2016: [E]         2:0x7F9428BAFBB6
>>> logAssertFailed + 0x2D6 at ??:0
>>> Wed Nov  2 19:24:05.061 2016: [E]         3:0x7F9428CBEF62
>>> PIT_GetWorkMH(RpcContext*, char*) + 0x6E2 at ??:0
>>> Wed Nov  2 19:24:05.062 2016: [E]         4:0x7F9428BCBF62
>>> tscHandleMsg(RpcContext*, MsgDataBuf*) + 0x512 at ??:0
>>> Wed Nov  2 19:24:05.063 2016: [E]         5:0x7F9428BE62A7
>>> RcvWorker::RcvMain() + 0x107 at ??:0
>>> Wed Nov  2 19:24:05.064 2016: [E]         6:0x7F9428BE644B
>>> RcvWorker::thread(void*) + 0x5B at ??:0
>>> Wed Nov  2 19:24:05.065 2016: [E]         7:0x7F94286F6F36
>>> Thread::callBody(Thread*) + 0x46 at ??:0
>>> Wed Nov  2 19:24:05.066 2016: [E]         8:0x7F94286E5402
>>> Thread::callBodyWrapper(Thread*) + 0xA2 at ??:0
>>> Wed Nov  2 19:24:05.067 2016: [E]         9:0x7F9427E0E9D1
>>> start_thread + 0xD1 at ??:0
>>> Wed Nov  2 19:24:05.068 2016: [E]         10:0x7F9426FB58FD clone +
>>> 0x6D at ??:0
>> After this I tried to put disks up again, which failed half-way through
>> and did the same with the other server node (current master). So after
>> this my cluster had effectively failed, because all the metadata disks
>> were down and there was no path to the data disks. When I tried to put
>> all the metadata disks up with one start command, then it worked on
>> third try and the cluster got into working state again. Downtime about
>> an hour.
>>
>> I created a PMR with this information and they said that it's a bug, but
>> it's a tricky one so it's going to take a while, but during that it's
>> not recommended to use any commands from this list:
>>
>>> Our apologies for the delayed response. Based on the debug data we
>>> have and looking at the source code, we believe the assert is due to
>>> incompatibility is arising from the feature level version for the
>>> RPCs. In this case the culprit is the PIT "interesting inode" code.
>>>
>>> Several user commands employ PIT (Parallel Inode Traversal) code to
>>> traverse each data block of every file:
>>>
>>>>
>>>>     mmdelfileset
>>>>     mmdelsnapshot
>>>>     mmdefragfs
>>>>     mmfileid
>>>>     mmrestripefs
>>>>     mmdeldisk
>>>>     mmrpldisk
>>>>     mmchdisk
>>>>     mmadddisk
>>> The problematic one is the 'PitInodeListPacket' subrpc which is a part
>>> of an "interesting inode" code change. Looking at the dumps its
>>> evident that node 'node3' which sent the RPC is not capable of
>>> supporting interesting inode (max feature level is 1340) and node
>>> server11 which is receiving it is trying to interpret the RPC beyond
>>> the valid region (as its feature level 1502 supports PIT interesting
>>> inodes).
>>
>> And apparently any of the fileset commands either, as I failed with 
>> those.
>>
>> After I finished the upgrade, everything has been working wonderfully.
>> But during this upgrade time I'd recommend to tread really carefully.
>>
>> Best regards,
>>
>




More information about the gpfsug-discuss mailing list