[gpfsug-discuss] fast search for archivable data sets

Christopher Black cblack at nygenome.org
Sat Apr 4 01:26:22 BST 2020


As Alex mentioned, there are tools that will keep filesystem metadata in a database and provide query tools.
NYGC uses Starfish and we’ve had good experience with it. At first the only feature we used is “sfdu” which is a quick replacement for recursive du. Using this we can script csv reports for selections of dirs. As we use starfish more, we’ve started opening the web interface to people to look at selected areas of our filesystems where they can sort directories by size, mtime, atime, and run other reports and queries. We’ve also started using tagging functionality so we can quickly get an aggregate total (and growth over time) by tag across multiple directories.

We tried Robinhood years ago but found it was taking too much work to get it to scale to 100s of millions of files and 10s of PiB on gpfs. It might be better now.

IBM has a metadata product called Spectrum Discover that has the benefit of using gpfs-specific interfaces to be always up to date. Many of the other tools require scheduling scans to update the db.
Igneous has a commercial tool called DataDiscover which also looked promising. ClarityNow and MediaFlux are other similar tools.
I expect all of these tools at the very least have nice replacements for du and find as well as some sort of web directory tree view.

We had run Starfish for a while and did a re-evaluation of a few options in 2019 and ultimately decided to stay with Starfish for now.

Best,
Chris

From: <gpfsug-discuss-bounces at spectrumscale.org> on behalf of Alex Chekholko <alex at calicolabs.com>
Reply-To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date: Friday, April 3, 2020 at 7:51 PM
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Subject: Re: [gpfsug-discuss] fast search for archivable data sets

Hi Jim,

The common non-GPFS-specific way is to use a tool that dumps all of your filesystem metadata into an SQL database and then you can have a webapp that makes nice graphs/reports from the SQL database, or do your own queries.

The Free Software example is "Robinhood" (use the POSIX scanner, not the lustre-specific one) and one proprietary example is Starfish.

In both cases, you need a pretty beefy machine for the DB and the scanning of your filesystem may take a long time, depending on your filesystem performance.  And then without any filesystem-specific hooks like a transaction log, you'll need to rescan the entire filesystem to update your db.

Regards,
Alex

On Fri, Apr 3, 2020 at 3:25 PM Jim Kavitsky <jkavitsky at 23andme.com<mailto:jkavitsky at 23andme.com>> wrote:
Hello everyone,
I'm managing a low-multi-petabyte Scale filesystem with hundreds of millions of inodes, and I'm looking for the best way to locate archivable directories. For example, these might be directories where whose contents were greater than 5 or 10TB, and whose contents had atimes greater than two years.

Has anyone found a great way to do this with a policy engine run? If not, is there another good way that anyone would recommend? Thanks in advance,

Jim Kavitsky
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<https://urldefense.com/v3/__http:/spectrumscale.org__;!!C6sPl7C9qQ!Cs4pKCeiQY8iPQeiCnSUIUHDC9FHjx7C64p_WVNeQsaF4ODzO9o7NkBoUT6E2Y-C$>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss<https://urldefense.com/v3/__http:/gpfsug.org/mailman/listinfo/gpfsug-discuss__;!!C6sPl7C9qQ!Cs4pKCeiQY8iPQeiCnSUIUHDC9FHjx7C64p_WVNeQsaF4ODzO9o7NkBoUWuphbie$>
________________________________
This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20200404/2dc8d263/attachment-0002.htm>


More information about the gpfsug-discuss mailing list