[gpfsug-discuss] mmfsadm test pit

Aaron Knister aaron.s.knister at nasa.gov
Tue Aug 16 03:22:17 BST 2016


I just discovered this interesting gem poking at mmfsadm:

  test pit fsname list|suspend|status|resume|stop [jobId]

There have been times where I've kicked off a restripe and either 
intentionally or accidentally ctrl-c'd it only to realize that many 
times it's disappeared into the ether and is still running. The only way 
I've known so far to stop it is with a chgmgr.

A far more painful instance happened when I ran a rebalance on an fs 
w/more than 31 nsds using more than 31 pit workers and hit *that* fun 
APAR which locked up access for a single filesystem to all 3.5k nodes. 
We spent 48 hours round the clock rebooting nodes as jobs drained to 
clear it up. I would have killed in that instance for a way to cancel 
the PIT job (the chmgr trick didn't work). It looks like you might 
actually be able to do this with mmfsadm, although how wise this is, I 
do not know (kinda curious about that).

Here's an example. I kicked off a restripe and then ctrl-c'd it on a 
client node. Then ran these commands from the fs manager:

root at loremds19:~ # /usr/lpp/mmfs/bin/mmfsadm test pit tlocal list
JobId 785979015170 PitJobStatus PIT_JOB_RUNNING progress 0.00
debug: statusListP D40E2C70

root at loremds19:~ # /usr/lpp/mmfs/bin/mmfsadm test pit tlocal stop 
785979015170
debug: statusListP 0

root at loremds19:~ # /usr/lpp/mmfs/bin/mmfsadm test pit tlocal list
JobId 785979015170 PitJobStatus PIT_JOB_STOPPING progress 4.01
debug: statusListP D4013E70

... some time passes ...

root at loremds19:~ # /usr/lpp/mmfs/bin/mmfsadm test pit tlocal list
debug: statusListP 0

Interesting.

-Aaron

-- 
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776



More information about the gpfsug-discuss mailing list