<font size=2 face="sans-serif">I was surprised to read that Ctrl-C did

not really kill restripe.   It's supposed to!  If it doesn't

that's a bug.  </font><br><br><font size=2 face="sans-serif">I ran this by my expert within IBM and

he wrote to me:</font><br><br><font size=2 face="sans-serif">First of all a "PIT job" such

as restripe, deldisk, delsnapshot, and such should be easy to stop by ^C

the management program that started them.  The SG manager daemon holds

open a socket to the client program for the purposes of sending command

output, progress updates, error messages and the like.  The PIT code

checks this socket periodically and aborts the PIT process cleanly if the

socket is closed.  If this cleanup doesn't occur, it is a bug and

should be worth reporting.  However, there's no exact guarantee on

how quickly each thread on the SG mgr will notice and then how quickly

the helper nodes can be stopped and so forth.  The interval between

socket checks depends among other things on how long it takes to process

each file, if there are a few very large files, the delay can be significant.

 In the limiting case, where most of the FS storage is contained in

a few files, this mechanism doesn't work [elided] well.  So it can

be quite involved and slow sometimes to wrap up a PIT operation.</font><br><br><font size=2 face="sans-serif">The simplest way to determine if the

command has really stopped is with the mmdiag --commands issued on the

SG manager node.  This shows running commands with the command line,

start time, socket, flags, etc.  After ^Cing the client program, the

entry here should linger for a while, then go away.  When it exits

you'll see an entry in the GPFS log file where it fails with err 50.  If

this doesn't stop the command after a while, it is worth looking into.</font><br><br><font size=2 face="sans-serif">If the command wasn't issued on the

SG mgr node and you can't find the where the client command is running,

the socket is still a useful hint.  While tedious, it should be possible

to trace this socket back to node where that command was originally run

using netstat or equivalent.  Poking around inside a GPFS internaldump

will also provide clues; there should be an outstanding  sgmMsgSGClientCmd

command listed in the dump tscomm section.  Once you find it, just

'kill `pidof mmrestripefs` or similar.</font><br><br><font size=2 face="sans-serif">I'd like to warn the OP away from </font><tt><font size=2>mmfsadm

test pit</font></tt><font size=2 face="sans-serif">.  These commands

are of course unsupported and unrecommended for any purpose (even internal

test and development purposes, as far as I know).  You are definitely

working without a net there.  When I was improving the integration

between PIT and snapshot quiesce a few years ago, I looked into this and

couldn't figure out how to (easily) make these stop and resume commands

safe to use, so as far as I know they remain unsafe.  The list command,

however, is probably fairly okay; but it would probably be better to use

mmfsadm saferdump pit.<br></font><br><br><br><br><br><font size=1 color=#5f5f5f face="sans-serif">From:      

 </font><font size=1 face="sans-serif">Aaron Knister <aaron.s.knister@nasa.gov></font><br><font size=1 color=#5f5f5f face="sans-serif">To:      

 </font><font size=1 face="sans-serif"><gpfsug-discuss@spectrumscale.org></font><br><font size=1 color=#5f5f5f face="sans-serif">Date:      

 </font><font size=1 face="sans-serif">08/15/2016 10:49 PM</font><br><font size=1 color=#5f5f5f face="sans-serif">Subject:    

   </font><font size=1 face="sans-serif">[gpfsug-discuss]

mmfsadm test pit</font><br><font size=1 color=#5f5f5f face="sans-serif">Sent by:    

   </font><font size=1 face="sans-serif">gpfsug-discuss-bounces@spectrumscale.org</font><br><hr noshade><br><br><br><tt><font size=2>I just discovered this interesting gem poking at mmfsadm:<br><br>  test pit fsname list|suspend|status|resume|stop [jobId]<br><br>There have been times where I've kicked off a restripe and either <br>intentionally or accidentally ctrl-c'd it only to realize that many <br>times it's disappeared into the ether and is still running. The only way

I've known so far to stop it is with a chgmgr. A far more painful instance happened when I ran a rebalance on an fs  w/more than 31 nsds using more than 31 pit workers and hit *that* fun  APAR which locked up access for a single filesystem to all 3.5k nodes.

<br>We spent 48 hours round the clock rebooting nodes as jobs drained to <br>clear it up. I would have killed in that instance for a way to cancel <br>the PIT job (the chmgr trick didn't work). It looks like you might <br>actually be able to do this with mmfsadm, although how wise this is, I

<br>do not know (kinda curious about that).<br><br>Here's an example. I kicked off a restripe and then ctrl-c'd it on a <br>client node. Then ran these commands from the fs manager:<br><br>root@loremds19:~ # /usr/lpp/mmfs/bin/mmfsadm test pit tlocal list<br>JobId 785979015170 PitJobStatus PIT_JOB_RUNNING progress 0.00<br>debug: statusListP D40E2C70<br><br>root@loremds19:~ # /usr/lpp/mmfs/bin/mmfsadm test pit tlocal stop <br>785979015170<br>debug: statusListP 0<br><br>root@loremds19:~ # /usr/lpp/mmfs/bin/mmfsadm test pit tlocal list<br>JobId 785979015170 PitJobStatus PIT_JOB_STOPPING progress 4.01<br>debug: statusListP D4013E70<br><br>... some time passes ...<br><br>root@loremds19:~ # /usr/lpp/mmfs/bin/mmfsadm test pit tlocal list<br>debug: statusListP 0<br><br>Interesting.<br><br>-Aaron<br><br>-- <br>Aaron Knister<br>NASA Center for Climate Simulation (Code 606.2)<br>Goddard Space Flight Center<br>(301) 286-2776<br>_______________________________________________<br>gpfsug-discuss mailing list<br>gpfsug-discuss at spectrumscale.org<br></font></tt><a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss"><tt><font size=2>http://gpfsug.org/mailman/listinfo/gpfsug-discuss</font></tt></a><tt><font size=2><br><br></font></tt><br><BR>