[gpfsug-discuss] AFM experiences?

Venkateswara R Puvvada vpuvvada at in.ibm.com
Tue Nov 24 02:37:18 GMT 2020


Dean,

This is one of the corner case which is associated with sparse files at 
the home cluster. You could try with latest versions of scale, AFM 
indepedent-writer mode have many performance/functional improvements in 
newer releases. 

~Venkat (vpuvvada at in.ibm.com)



From:   "Flanders, Dean" <dean.flanders at fmi.ch>
To:     gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date:   11/23/2020 11:44 PM
Subject:        [EXTERNAL] Re: [gpfsug-discuss] AFM experiences?
Sent by:        gpfsug-discuss-bounces at spectrumscale.org



Hello Rob,
 
We looked at AFM years ago for DR, but after reading the bug reports, we 
avoided it, and also have had seen a case where it had to be removed from 
one customer, so we have kept things simple. Now looking again a few years 
later there are still issues, IBM Spectrum Scale Active File Management 
(AFM) issues which may result in undetected data corruption, and that was 
just my first google hit. We have kept it simple, and use a parallel rsync 
process with policy engine and can hit wire speed for copying of millions 
of small files in order to have isolation between the sites at GB/s. I am 
not saying it is bad, just that it needs an appropriate risk/reward ratio 
to implement as it increases overall complexity.
 
Kind regards,
 
Dean
 
From: gpfsug-discuss-bounces at spectrumscale.org 
<gpfsug-discuss-bounces at spectrumscale.org> On Behalf Of Ryan Novosielski
Sent: Monday, November 23, 2020 4:31 PM
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Subject: Re: [gpfsug-discuss] AFM experiences?
 
We use it similar to how you describe it. We now run 5.0.4.1 on the client 
side (I mean actual client nodes, not the home or cache clusters). Before 
that, we had reliability problems (failure to cache libraries of programs 
that were executing, etc.). The storage clusters in our case are 
5.0.3-2.3. 
 
We also got bit by the quotas thing. You have to set them the same on both 
sides, or you will have problems. It seems a little silly that they are 
not kept in sync by GPFS, but that’s how it is. If memory serves, the 
result looked like an AFM failure (queue not being cleared), but it turned 
out to be that the files just could not be written at the home cluster 
because the user was over quota there. I also think I’ve seen load average 
increase due to this sort of thing, but I may be mixing that up with 
another problem scenario. 

We monitor via Nagios which I believe monitors using mmafmctl commands. 
Really can’t think of a single time, apart from the other day, where the 
queue backed up. The instance the other day only lasted a few minutes (if 
you suddenly create many small files, like installing new software, it may 
not catch up instantly). 
 
-- 
#BlackLivesMatter
____
|| \\UTGERS, |---------------------------*O*---------------------------
||_// the State     |         Ryan Novosielski - novosirj at rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS 
Campus
||  \\    of NJ     | Office of Advanced Research Computing - MSB C630, 
Newark
    `'


On Nov 23, 2020, at 10:19, Robert Horton <robert.horton at icr.ac.uk> wrote:
Hi all,

We're thinking about deploying AFM and would be interested in hearing
from anyone who has used it in anger - particularly independent writer.

Our scenario is we have a relatively large but slow (mainly because it
is stretched over two sites with a 10G link) cluster for long/medium-
term storage and a smaller but faster cluster for scratch storage in
our HPC system. What we're thinking of doing is using some/all of the
scratch capacity as an IW cache of some/all of the main cluster, the
idea to reduce the need for people to manually move data between the
two.

It seems to generally work as expected in a small test environment,
although we have a few concerns:

- Quota management on the home cluster - we need a way of ensuring
people don't write data to the cache which can't be accomodated on
home. Probably not insurmountable but needs a bit of thought...

- It seems inodes on the cache only get freed when they are deleted on
the cache cluster - not if they get deleted from the home cluster or
when the blocks are evicted from the cache. Does this become an issue
in time?

If anyone has done anything similar I'd be interested to hear how you
got on. It would be intresting to know if you created a cache fileset
for each home fileset or just one for the whole lot, as well as any
other pearls of wisdom you may have to offer.

Thanks!
Rob

-- 
Robert Horton | Research Data Storage Lead
The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB
T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk |
Twitter @ICR_London
Facebook: www.facebook.com/theinstituteofcancerresearch

The Institute of Cancer Research: Royal Cancer Hospital, a charitable 
Company Limited by Guarantee, Registered in England under Company No. 
534147 with its Registered Office at 123 Old Brompton Road, London SW7 
3RP.

This e-mail message is confidential and for use by the addressee only.  If 
the message is received by anyone other than the addressee, please return 
the message to the sender by replying to it and then delete the message 
from your computer and network.
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss 





-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20201124/e89b392c/attachment-0002.htm>


More information about the gpfsug-discuss mailing list