[gpfsug-discuss] TSM 6.x + Space Management + GPFS 3.4.xx (all linux) ... try again
Jonathan Buzzard
j.buzzard at dundee.ac.uk
Thu Mar 22 12:04:50 GMT 2012
On 03/21/2012 06:28 PM, Jez Tucker wrote:
> Yup.
>
> We’re running svr 6.2.2-30 and BA/HSM 6.2.4-1, GPFS 3.4.0-10 on RH Ent
> 5.4 x64
>
> At the moment, we’re running migration policies ‘auto-manually’ via a
> script which checks if it needs to be run as the THRESHOLDs are not working.
>
> We’ve noticed the behaviour/stability of thresholds change each release
> from 3.4.0-8 onwards. 3.4.0-5 worked, but we jumped to .8 as we were
> told DMAPI for Windows was available [but undocumented], alas not.
>
> I had a previous PMR with support who told me to set:
>
> enableLowSpaceEvents=no
>
> -z=yes on our filesystem(s)
>
> Our tsm server has the correct callback setup:
>
> [root at tsm01 ~]# mmlscallback
>
> DISKSPACE
>
> command = /usr/lpp/mmfs/bin/mmstartpolicy
>
> event = lowDiskSpace,noDiskSpace
>
> node = tsm01.rushesfx.co.uk
>
> parms = %eventName %fsName
>
> N.B. I set the node just to be tsm01 as other nodes do not have HSM
> installed, hence if the callback occurred on those nodes, they’d run
> mmstartpolicy which would run dsmmigrate which is not installed on those
> nodes.
Note that you can have more than one node with the hsm client installed.
Gives some redundancy should a node fail.
Apart from that your current setup is a *REALLY* bad idea. As I
understand it when you hit the lowDiskSpace event every two minutes it
will call the mmstartpolicy command. That's fine if your policy can run
inside two minutes and cause the usage to fall below the threshold.
As that is extremely unlikely you need to write a script with locking to
prevent that happening, otherwise you will have multiple instances of
the policy running all at once and bringing everything to it's knees.
I would add that the GPFS documentation surrounding this is *very* poor,
and complete with the utter failure in the release notes to mention the
change of behaviour between 3.2 and 3.3 this whole area needs to be
approached with caution as clearly IBM are happy to break things with
out telling us.
That said I run with the following on 3.4.0-6
DISKSPACE
command = /usr/local/bin/run_ilm_cycle
event = lowDiskSpace
node = nsdnodes
parms = %eventName %fsName
And the run_ilm_cycle works just fine, and is included inline below. It
is installed on all NSD nodes. This is not strict HSM as it is pushing
from my fast to slow disk. However as my nearline pool is not full, I
have not yet applied HSM to that pool. In fact although I have HSM
enabled and it works on the file system it is all turned off as we are
still running with 5.5 servers we cannot install the 6.3 client, and
without the 6.3 client you cannot turn of dsmscoutd and that just tanks
our file system when it starts.
Note anyone still reading I urge you to read
http://www-01.ibm.com/support/docview.wss?uid=swg1IC73091
and upgrade your TSM client if necessary.
JAB.
#!/bin/bash
#
# Wrapper script to run an mmapplypolicy on a GPFS file system when a
callback
# is triggered. Specifically it is intended to be triggered by a
lowDiskSpace
# event registered with a call back like the following.
#
# mmaddcallback DISKSPACE --command /usr/local/bin/run_ilm_cycle --event
# lowDiskSpace -N nsdnodes --parms "%eventname %fsName"
#
# The script includes cluster wide quiescence locking so that it plays
nicely
# with other automated scripts that need GPFS quiescence to run.
#
EVENT_NAME=$1
FS=$2
# determine the mount point for the file system
MOUNT_POINT=`/usr/lpp/mmfs/bin/mmlsfs ${FS} |grep "\-T" |awk '{print $2}'`
HOSTNAME=`/bin/hostname -s`
# lock file
LOCKDIR="${MOUNT_POINT}/ctdb/quiescence.lock"
# exit codes and text for them
ENO_SUCCESS=0; ETXT[0]="ENO_SUCCESS"
ENO_GENERAL=1; ETXT[1]="ENO_GENERAL"
ENO_LOCKFAIL=2; ETXT[2]="ENO_LOCKFAIL"
ENO_RECVSIG=3; ETXT[3]="ENO_RECVSIG"
#
# Attempt to get a lock
#
trap 'ECODE=$?; echo "[${PROG}] Exit: ${ETXT[ECODE]}($ECODE)" >&2' 0
echo -n "[${PROG}] Locking: " >&2
if mkdir "${LOCKDIR}" &>/dev/null; then
# lock succeeded, install signal handlers
trap 'ECODE=$?;
echo "[${PROG}] Removing lock. Exit: ${ETXT[ECODE]}($ECODE)" >&2
rm -rf "${LOCKDIR}"' 0
# the following handler will exit the script on receiving these
signals
# the trap on "0" (EXIT) from above will be triggered by this
scripts
# "exit" command!
trap 'echo "[${PROG}] Killed by a signal." >&2
exit ${ENO_RECVSIG}' 1 2 3 15
echo "success, installed signal handlers"
else
# exit, we're locked!
echo "lock failed other operation running" >&2
exit ${ENO_LOCKFAIL}
fi
# note what we are doing and where we are doing it
/bin/touch $LOCKDIR/${EVENT_NAME}.${HOSTNAME}
# apply the policy
echo "running mmapplypolicy for the file system: ${FS}"
/usr/lpp/mmfs/bin/mmapplypolicy $FS -N nsdnodes -P $MOUNT_POINT/rules.txt
exit 0;
--
Jonathan A. Buzzard Tel: +441382-386998
Storage Administrator, College of Life Sciences
University of Dundee, DD1 5EH
More information about the gpfsug-discuss
mailing list