[gpfsug-discuss] fast search for archivable data sets

Jaime Pinto pinto at scinet.utoronto.ca
Sat Apr 4 00:45:18 BST 2020


Hi Jim,

If you never worked with policy rules before, you may want to start by building your nerves to it.

In the /usr/lpp/mmfs/samples/ilm path you will find several examples of templates that you can use to play around. I would start with the 'list' rules first.
Some of those templates are a bit complex, so here is one script that I use on a regular basis to detect files larger than 1MB (you can even exclude specific filesets):

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dss-mgt1:/scratch/r/root/mmpolicyRules # cat mmpolicyRules-list-large
/* A macro to abbreviate VARCHAR */
define([vc],[VARCHAR($1)])

/* Define three external lists */
RULE EXTERNAL LIST 'largefiles' EXEC '/gpfs/fs0/scratch/r/root/mmpolicyRules/mmpolicyExec-list'

/* Generate a list of all files that have more than 1MB of space allocated. */
RULE 'r2' LIST 'largefiles'
	SHOW('-u' vc(USER_ID) || ' -s' || vc(FILE_SIZE))
	/*FROM POOL 'system'*/
	FROM POOL 'data'
         /*FOR FILESET('root')*/
	WEIGHT(FILE_SIZE)
	WHERE KB_ALLOCATED > 1024

/* Files in special filesets, such as mmpolicyRules, are never moved or deleted */
RULE 'ExcSpecialFile' EXCLUDE
         FOR FILESET('mmpolicyRules','todelete','tapenode-stuff','toarchive')
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



And here is another to detect files not looked at for more than 6 months. I found more effective to use atime and ctime. You could combine this with the one above to detect file size as well.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dss-mgt1:/scratch/r/root/mmpolicyRules # cat mmpolicyRules-list-atime-ctime-gt-6months
/* A macro to abbreviate VARCHAR */
define([vc],[VARCHAR($1)])

/* Define three external lists */
RULE EXTERNAL LIST 'accessedfiles' EXEC '/gpfs/fs0/scratch/r/root/mmpolicyRules/mmpolicyExec-list'

/* Generate a list of all files, directories, plus all other file system objects,
    like symlinks, named pipes, etc, accessed prior to a certain date AND that are
    not owned by root. Include the owner's id with each object and sort them by
    the owner's id */

/* Files in special filesets, such as mmpolicyRules, are never moved or deleted */
RULE 'ExcSpecialFile' EXCLUDE
  	FOR FILESET ('scratch-root','todelete','root')

RULE 'r5' LIST 'accessedfiles'
	DIRECTORIES_PLUS
	FROM POOL 'data'
	SHOW('-u' vc(USER_ID) || ' -a' || vc(ACCESS_TIME) || ' -c' || vc(CREATION_TIME) || ' -s ' || vc(FILE_SIZE))
	WHERE (DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME) > 183) AND (DAYS(CURRENT_TIMESTAMP) - DAYS(CREATION_TIME) > 183) AND NOT USER_ID = 0
		AND NOT (PATH_NAME LIKE '/gpfs/fs0/scratch/r/root/%')
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Note that both these scripts work on a system wide (or root fileset) basis, and will not give you specific directories, unless you run them several times on specific directories (not very efficient). To produce general lists per directory you would need to do some post processing on the lists, with 'awk' or some other scripting language. If you need some samples I can send you.


And finally, you need to be more specific by what you mean by 'archivable'. Once you produce the list you can do several things with them or leverage the rules to actually execute things, such as move, delete, or hsm stuff. The /usr/lpp/mmfs/samples/ilm path has some samples as well.



On 4/3/2020 18:25:33, Jim Kavitsky wrote:
> Hello everyone,
> I'm managing a low-multi-petabyte Scale filesystem with hundreds of millions of inodes, and I'm looking for the best way to locate archivable directories. For example, these might be directories where whose contents were greater than 5 or 10TB, and whose contents had atimes greater than two years.
> 
> Has anyone found a great way to do this with a policy engine run? If not, is there another good way that anyone would recommend? Thanks in advance,

yes, there is another way, the 'mmfind' utility, also in the same sample path. You have to compile it for you OS (mmfind.README). This is a very powerful canned procedure that lets you run the "-exec" option just as in the normal linux version of 'find'. I use it very often, and it's just as efficient as the other policy rules based alternative.

Good luck.

Keep safe and confined.

Jaime


> 
> Jim Kavitsky
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> 

.
.
.        ************************************
           TELL US ABOUT YOUR SUCCESS STORIES
          http://www.scinethpc.ca/testimonials
          ************************************
---
Jaime Pinto - Storage Analyst
SciNet HPC Consortium - Compute/Calcul Canada
www.scinet.utoronto.ca - www.computecanada.ca
University of Toronto
661 University Ave. (MaRS), Suite 1140
Toronto, ON, M5G1M1
P: 416-978-2755
C: 416-505-1477



More information about the gpfsug-discuss mailing list