[gpfsug-discuss] mmfsadm test pit
Aaron Knister
aaron.s.knister at nasa.gov
Tue Aug 16 03:22:17 BST 2016
I just discovered this interesting gem poking at mmfsadm:
test pit fsname list|suspend|status|resume|stop [jobId]
There have been times where I've kicked off a restripe and either
intentionally or accidentally ctrl-c'd it only to realize that many
times it's disappeared into the ether and is still running. The only way
I've known so far to stop it is with a chgmgr.
A far more painful instance happened when I ran a rebalance on an fs
w/more than 31 nsds using more than 31 pit workers and hit *that* fun
APAR which locked up access for a single filesystem to all 3.5k nodes.
We spent 48 hours round the clock rebooting nodes as jobs drained to
clear it up. I would have killed in that instance for a way to cancel
the PIT job (the chmgr trick didn't work). It looks like you might
actually be able to do this with mmfsadm, although how wise this is, I
do not know (kinda curious about that).
Here's an example. I kicked off a restripe and then ctrl-c'd it on a
client node. Then ran these commands from the fs manager:
root at loremds19:~ # /usr/lpp/mmfs/bin/mmfsadm test pit tlocal list
JobId 785979015170 PitJobStatus PIT_JOB_RUNNING progress 0.00
debug: statusListP D40E2C70
root at loremds19:~ # /usr/lpp/mmfs/bin/mmfsadm test pit tlocal stop
785979015170
debug: statusListP 0
root at loremds19:~ # /usr/lpp/mmfs/bin/mmfsadm test pit tlocal list
JobId 785979015170 PitJobStatus PIT_JOB_STOPPING progress 4.01
debug: statusListP D4013E70
... some time passes ...
root at loremds19:~ # /usr/lpp/mmfs/bin/mmfsadm test pit tlocal list
debug: statusListP 0
Interesting.
-Aaron
--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
More information about the gpfsug-discuss
mailing list