<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">


<head>


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


<meta name="Generator" content="Microsoft Word 15 (filtered medium)">


<style><!--


/* Font Definitions */


@font-face


        {font-family:"Cambria Math";


        panose-1:2 4 5 3 5 4 6 3 2 4;}


@font-face


        {font-family:Calibri;


        panose-1:2 15 5 2 2 2 4 3 2 4;}


/* Style Definitions */


p.MsoNormal, li.MsoNormal, div.MsoNormal


        {margin:0in;


        margin-bottom:.0001pt;


        font-size:11.0pt;


        font-family:"Calibri",sans-serif;}


a:link, span.MsoHyperlink


        {mso-style-priority:99;


        color:blue;


        text-decoration:underline;}


span.EmailStyle18


        {mso-style-type:personal-reply;


        font-family:"Calibri",sans-serif;


        color:windowtext;}


.MsoChpDefault


        {mso-style-type:export-only;


        font-size:10.0pt;}


@page WordSection1


        {size:8.5in 11.0in;


        margin:1.0in 1.0in 1.0in 1.0in;}


div.WordSection1


        {page:WordSection1;}


--></style>


</head>


<body lang="EN-US" link="blue" vlink="purple">


<div class="WordSection1">


<p class="MsoNormal">As Alex mentioned, there are tools that will keep filesystem metadata in a database and provide query tools.<o:p></o:p></p>


<p class="MsoNormal">NYGC uses Starfish and we’ve had good experience with it. At first the only feature we used is “sfdu” which is a quick replacement for recursive du. Using this we can script csv reports for selections of dirs. As we use starfish more, we’ve


 started opening the web interface to people to look at selected areas of our filesystems where they can sort directories by size, mtime, atime, and run other reports and queries. We’ve also started using tagging functionality so we can quickly get an aggregate


 total (and growth over time) by tag across multiple directories.<o:p></o:p></p>


<p class="MsoNormal"><o:p> </o:p></p>


<p class="MsoNormal">We tried Robinhood years ago but found it was taking too much work to get it to scale to 100s of millions of files and 10s of PiB on gpfs. It might be better now.<o:p></o:p></p>


<p class="MsoNormal"><o:p> </o:p></p>


<p class="MsoNormal">IBM has a metadata product called Spectrum Discover that has the benefit of using gpfs-specific interfaces to be always up to date. Many of the other tools require scheduling scans to update the db.<o:p></o:p></p>


<p class="MsoNormal">Igneous has a commercial tool called DataDiscover which also looked promising. ClarityNow and MediaFlux are other similar tools.<o:p></o:p></p>


<p class="MsoNormal">I expect all of these tools at the very least have nice replacements for du and find as well as some sort of web directory tree view.<o:p></o:p></p>


<p class="MsoNormal"><o:p> </o:p></p>


<p class="MsoNormal">We had run Starfish for a while and did a re-evaluation of a few options in 2019 and ultimately decided to stay with Starfish for now.<o:p></o:p></p>


<p class="MsoNormal"><o:p> </o:p></p>


<p class="MsoNormal">Best,<o:p></o:p></p>


<p class="MsoNormal">Chris<o:p></o:p></p>


<p class="MsoNormal"><o:p> </o:p></p>


<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">


<p class="MsoNormal"><b><span style="font-size:12.0pt;color:black">From: </span></b><span style="font-size:12.0pt;color:black"><gpfsug-discuss-bounces@spectrumscale.org> on behalf of Alex Chekholko <alex@calicolabs.com><br>


<b>Reply-To: </b>gpfsug main discussion list <gpfsug-discuss@spectrumscale.org><br>


<b>Date: </b>Friday, April 3, 2020 at 7:51 PM<br>


<b>To: </b>gpfsug main discussion list <gpfsug-discuss@spectrumscale.org><br>


<b>Subject: </b>Re: [gpfsug-discuss] fast search for archivable data sets<o:p></o:p></span></p>


</div>


<div>


<p class="MsoNormal"><o:p> </o:p></p>


</div>


<div>


<p class="MsoNormal">Hi Jim, <o:p></o:p></p>


<div>


<p class="MsoNormal"><o:p> </o:p></p>


</div>


<div>


<p class="MsoNormal">The common non-GPFS-specific way is to use a tool that dumps all of your filesystem metadata into an SQL database and then you can have a webapp that makes nice graphs/reports from the SQL database, or do your own queries.<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal"><o:p> </o:p></p>


</div>


<div>


<p class="MsoNormal">The Free Software example is "Robinhood" (use the POSIX scanner, not the lustre-specific one) and one proprietary example is Starfish.<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal"><o:p> </o:p></p>


</div>


<div>


<p class="MsoNormal">In both cases, you need a pretty beefy machine for the DB and the scanning of your filesystem may take a long time, depending on your filesystem performance.  And then without any filesystem-specific hooks like a transaction log, you'll


 need to rescan the entire filesystem to update your db.<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal"><o:p> </o:p></p>


</div>


<div>


<p class="MsoNormal">Regards,<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal">Alex<o:p></o:p></p>


</div>


</div>


<p class="MsoNormal"><o:p> </o:p></p>


<div>


<div>


<p class="MsoNormal">On Fri, Apr 3, 2020 at 3:25 PM Jim Kavitsky <<a href="mailto:jkavitsky@23andme.com">jkavitsky@23andme.com</a>> wrote:<o:p></o:p></p>


</div>


<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">


<div>


<p class="MsoNormal">Hello everyone, <o:p></o:p></p>


<div>


<p class="MsoNormal">I'm managing a low-multi-petabyte Scale filesystem with hundreds of millions of inodes, and I'm looking for the best way to locate archivable directories. For example, these might be directories where whose contents were greater than 5


 or 10TB, and whose contents had atimes greater than two years.<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal"><o:p> </o:p></p>


</div>


<div>


<p class="MsoNormal">Has anyone found a great way to do this with a policy engine run? If not, is there another good way that anyone would recommend? Thanks in advance,<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal"><o:p> </o:p></p>


</div>


<div>


<p class="MsoNormal">Jim Kavitsky<o:p></o:p></p>


</div>


</div>


<p class="MsoNormal">_______________________________________________<br>


gpfsug-discuss mailing list<br>


gpfsug-discuss at <a href="https://urldefense.com/v3/__http:/spectrumscale.org__;!!C6sPl7C9qQ!Cs4pKCeiQY8iPQeiCnSUIUHDC9FHjx7C64p_WVNeQsaF4ODzO9o7NkBoUT6E2Y-C$" target="_blank">


spectrumscale.org</a><br>


<a href="https://urldefense.com/v3/__http:/gpfsug.org/mailman/listinfo/gpfsug-discuss__;!!C6sPl7C9qQ!Cs4pKCeiQY8iPQeiCnSUIUHDC9FHjx7C64p_WVNeQsaF4ODzO9o7NkBoUWuphbie$" target="_blank">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a><o:p></o:p></p>


</blockquote>


</div>


</div>


<hr>


<div style="font-size:7.5pt; font-family: arial; font-style:normal; font-weight:normal; ">


This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the


 sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.</div>


</body>


</html>