[gpfsug-discuss] TSM 6.x + Space Management + GPFS 3.4.xx (all linux) ... try again

Jonathan Buzzard j.buzzard at dundee.ac.uk
Thu Mar 22 12:04:50 GMT 2012


On 03/21/2012 06:28 PM, Jez Tucker wrote:
> Yup.
>
> We’re running svr 6.2.2-30 and BA/HSM 6.2.4-1, GPFS 3.4.0-10 on RH Ent
> 5.4 x64
>
> At the moment, we’re running migration policies ‘auto-manually’ via a
> script which checks if it needs to be run as the THRESHOLDs are not working.
>
> We’ve noticed the behaviour/stability of thresholds change each release
> from 3.4.0-8 onwards. 3.4.0-5 worked, but we jumped to .8 as we were
> told DMAPI for Windows was available [but undocumented], alas not.
>
> I had a previous PMR with support who told me to set:
>
> enableLowSpaceEvents=no
>
> -z=yes on our filesystem(s)
>
> Our tsm server has the correct callback setup:
>
> [root at tsm01 ~]# mmlscallback
>
> DISKSPACE
>
> command = /usr/lpp/mmfs/bin/mmstartpolicy
>
> event = lowDiskSpace,noDiskSpace
>
> node = tsm01.rushesfx.co.uk
>
> parms = %eventName %fsName
>
> N.B. I set the node just to be tsm01 as other nodes do not have HSM
> installed, hence if the callback occurred on those nodes, they’d run
> mmstartpolicy which would run dsmmigrate which is not installed on those
> nodes.

Note that you can have more than one node with the hsm client installed. 
Gives some redundancy should a node fail.

Apart from that your current setup is a *REALLY* bad idea. As I 
understand it when you hit the lowDiskSpace event every two minutes it 
will call the mmstartpolicy command. That's fine if your policy can run 
inside two minutes and cause the usage to fall below the threshold.

As that is extremely unlikely you need to write a script with locking to 
prevent that happening, otherwise you will have multiple instances of 
the policy running all at once and bringing everything to it's knees.

I would add that the GPFS documentation surrounding this is *very* poor, 
and complete with the utter failure in the release notes to mention the 
change of behaviour between 3.2 and 3.3 this whole area needs to be 
approached with caution as clearly IBM are happy to break things with 
out telling us.

That said I run with the following on 3.4.0-6

DISKSPACE
         command       = /usr/local/bin/run_ilm_cycle
         event         = lowDiskSpace
         node          = nsdnodes
         parms         = %eventName %fsName


And the run_ilm_cycle works just fine, and is included inline below. It 
is installed on all NSD nodes. This is not strict HSM as it is pushing 
from my fast to slow disk. However as my nearline pool is not full, I 
have not yet applied HSM to that pool. In fact although I have HSM 
enabled and it works on the file system it is all turned off as we are 
still running with 5.5 servers we cannot install the 6.3 client, and 
without the 6.3 client you cannot turn of dsmscoutd and that just tanks 
our file system when it starts.


Note anyone still reading I urge you to read

http://www-01.ibm.com/support/docview.wss?uid=swg1IC73091

and upgrade your TSM client if necessary.


JAB.


#!/bin/bash
#
# Wrapper script to run an mmapplypolicy on a GPFS file system when a 
callback
# is triggered. Specifically it is intended to be triggered by a 
lowDiskSpace
# event registered with a call back like the following.
#
#   mmaddcallback DISKSPACE --command /usr/local/bin/run_ilm_cycle --event
#   lowDiskSpace -N nsdnodes --parms "%eventname %fsName"
#
# The script includes cluster wide quiescence locking so that it plays 
nicely
# with other automated scripts that need GPFS quiescence to run.
#

EVENT_NAME=$1
FS=$2
# determine the mount point for the file system
MOUNT_POINT=`/usr/lpp/mmfs/bin/mmlsfs ${FS} |grep "\-T" |awk '{print $2}'`
HOSTNAME=`/bin/hostname -s`


# lock file
LOCKDIR="${MOUNT_POINT}/ctdb/quiescence.lock"

# exit codes and text for them
ENO_SUCCESS=0; ETXT[0]="ENO_SUCCESS"
ENO_GENERAL=1; ETXT[1]="ENO_GENERAL"
ENO_LOCKFAIL=2; ETXT[2]="ENO_LOCKFAIL"
ENO_RECVSIG=3; ETXT[3]="ENO_RECVSIG"

#
# Attempt to get a lock
#
trap 'ECODE=$?; echo "[${PROG}] Exit: ${ETXT[ECODE]}($ECODE)" >&2' 0
echo -n "[${PROG}] Locking: " >&2

if mkdir "${LOCKDIR}" &>/dev/null; then

         # lock succeeded, install signal handlers
         trap 'ECODE=$?;
         echo "[${PROG}] Removing lock. Exit: ${ETXT[ECODE]}($ECODE)" >&2
                 rm -rf "${LOCKDIR}"' 0
         # the following handler will exit the script on receiving these 
signals
         # the trap on "0" (EXIT) from above will be triggered by this 
scripts
         # "exit" command!
         trap 'echo "[${PROG}] Killed by a signal." >&2
                 exit ${ENO_RECVSIG}' 1 2 3 15
         echo "success, installed signal handlers"
else
         # exit, we're locked!
         echo "lock failed other operation running" >&2
         exit ${ENO_LOCKFAIL}
fi

# note what we are doing and where we are doing it
/bin/touch $LOCKDIR/${EVENT_NAME}.${HOSTNAME}

# apply the policy
echo "running mmapplypolicy for the file system: ${FS}"
/usr/lpp/mmfs/bin/mmapplypolicy $FS -N nsdnodes -P $MOUNT_POINT/rules.txt

exit 0;



-- 
Jonathan A. Buzzard             Tel: +441382-386998
Storage Administrator, College of Life Sciences
University of Dundee, DD1 5EH



More information about the gpfsug-discuss mailing list