[gpfsug-discuss] mounts taking longer in 4.2 vs 4.1?

Aaron Knister aaron.s.knister at nasa.gov
Wed Feb 7 21:28:46 GMT 2018


I noticed something curious after migrating some nodes from 4.1 to 4.2
which is that mounts now can take foorrreeevverrr. It seems to boil down
to the point in the mount process where getEFOptions is called.

To highlight the difference--

4.1:
# /usr/bin/time /usr/lpp/mmfs/bin/mmcommon getEFOptions dnb02
skipMountPointCheck >/dev/null
0.16user 0.04system 0:00.43elapsed 45%CPU (0avgtext+0avgdata
9108maxresident)k
0inputs+2768outputs (0major+15404minor)pagefaults 0swaps

4.2:
/usr/bin/time /usr/lpp/mmfs/bin/mmcommon getEFOptions dnb02
skipMountPointCheck >/dev/null
9.75user 3.79system 0:23.35elapsed 58%CPU (0avgtext+0avgdata
10832maxresident)k
0inputs+38104outputs (0major+3135097minor)pagefaults 0swaps

that's uh...a 543x increase. Which, if you have 25+ filesystems and 3500
nodes that time really starts to add up.

It looks like under 4.2 this getEFOptions function triggers a bunch of
mmsdrfs parsing happens and node lists get generated whereas on 4.1 that
doesn't happen. Digging in a little deeper it looks to me like the big
difference is in gpfsClusterInit after the node fetches the "shadow"
mmsdrs file. Here's a 4.1 node:

gpfsClusterInit:mmsdrfsdef.sh[2827]> loginPrefix=''
gpfsClusterInit:mmsdrfsdef.sh[2828]> [[ -n '' ]]
gpfsClusterInit:mmsdrfsdef.sh[2829]> /usr/bin/scp
supersecrethost:/var/mmfs/gen/mmsdrfs /var/mmfs/gen/mmsdrfs.25326
gpfsClusterInit:mmsdrfsdef.sh[2830]> rc=0
gpfsClusterInit:mmsdrfsdef.sh[2831]> [[ 0 -ne 0 ]]
gpfsClusterInit:mmsdrfsdef.sh[2863]> [[ -f /var/mmfs/gen/mmsdrfs.25326 ]]
gpfsClusterInit:mmsdrfsdef.sh[2867]> /usr/bin/diff
/var/mmfs/gen/mmsdrfs.25326 /var/mmfs/gen/mmsdrfs
gpfsClusterInit:mmsdrfsdef.sh[2867]> 1> /dev/null 2> /dev/null
gpfsClusterInit:mmsdrfsdef.sh[2868]> rc=0
gpfsClusterInit:mmsdrfsdef.sh[2869]> [[ 0 -ne 0 ]]
gpfsClusterInit:mmsdrfsdef.sh[2874]> sdrfsFile=/var/mmfs/gen/mmsdrfs
gpfsClusterInit:mmsdrfsdef.sh[2875]> /bin/rm -f /var/mmfs/gen/mmsdrfs.25326


Here's a 4.2 node:
gpfsClusterInit:mmsdrfsdef.sh[2938]> loginPrefix=''
gpfsClusterInit:mmsdrfsdef.sh[2939]> [[ -n '' ]]
gpfsClusterInit:mmsdrfsdef.sh[2940]> /usr/bin/scp
supersecrethost:/var/mmfs/gen/mmsdrfs /var/mmfs/gen/mmsdrfs.8534
gpfsClusterInit:mmsdrfsdef.sh[2941]> rc=0
gpfsClusterInit:mmsdrfsdef.sh[2942]> [[ 0 -ne 0 ]]
gpfsClusterInit:mmsdrfsdef.sh[2974]> /bin/rm -f
/var/mmfs/tmp/cmdTmpDir.mmcommon.8534/tmpsdrfs.gpfsClusterInit
gpfsClusterInit:mmsdrfsdef.sh[2975]> [[ -f /var/mmfs/gen/mmsdrfs.8534 ]]
gpfsClusterInit:mmsdrfsdef.sh[2979]> /usr/bin/diff
/var/mmfs/gen/mmsdrfs.8534 /var/mmfs/gen/mmsdrfs
gpfsClusterInit:mmsdrfsdef.sh[2979]> 1> /dev/null 2> /dev/null
gpfsClusterInit:mmsdrfsdef.sh[2980]> rc=0
gpfsClusterInit:mmsdrfsdef.sh[2981]> [[ 0 -ne 0 ]]
gpfsClusterInit:mmsdrfsdef.sh[2986]> sdrfsFile=/var/mmfs/gen/mmsdrfs


it looks like the 4.1 code deletes the shadow mmsdrfs file is it's not
different from what's locally on the node where as 4.2 does *not* do that.

This seems to cause a problem when checkMmfsEnvironment is called
because it will return 1 if the shadow file exists which according to
the function comments indicates "something is not right", triggering the
environment update where the slowdown is incurred.

On 4.1 checkMmfsEnvironment returned 0 because the shadow mmsdrfs file
had been removed, whereas on 4.2 it returned 1 because the shadow
mmsdrfs file still existed despite it being identical to the mmsdrfs on
the node.

I've looked at 4.2.3.6 (efix12) and it doesn't look like 4.2.3.7 has
dropped yet so it may be this has been fixed there.

Maybe it's time for a PMR...

-Aaron

-- 
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776



More information about the gpfsug-discuss mailing list