[gpfsug-discuss] AFM limitations in a multi-cluster environment, slow prefetch operations

Kalyan Gunda kgunda at in.ibm.com
Tue Oct 7 16:20:30 BST 2014


some clarifications inline:

Regards
Kalyan
GPFS Development
EGL D Block, Bangalore




From:	Bryan Banister <bbanister at jumptrading.com>
To:	gpfsug main discussion list <gpfsug-discuss at gpfsug.org>
Date:	10/07/2014 08:12 PM
Subject:	Re: [gpfsug-discuss] AFM limitations in a multi-cluster
            environment, slow prefetch operations
Sent by:	gpfsug-discuss-bounces at gpfsug.org



Interesting that AFM is supposed to work in a multi-cluster environment.
We were using GPFS on the backend.  The new GPFS file system was AFM linked
over GPFS protocol to the old GPFS file system using the standard
multi-cluster mount.   The "gateway" nodes in the new cluster mounted the
old file system.  All systems were connected over the same QDR IB fabric.
The client compute nodes in the third cluster mounted both the old and new
file systems.  I looked for waiters on the client and NSD servers of the
new file system when the problem occurred, but none existed.  I tried
stracing the `ls` process, but it reported nothing and the strace itself
become unkillable.  There were no error messages in any GPFS or system logs
related to the `ls` fail.  NFS clients accessing cNFS servers in the new
cluster also worked as expected.  The `ls` from the NFS client in an AFM
fileset returned the expected directory listing.  Thus all symptoms
indicated the configuration wasn't supported.  I may try to replicate the
problem in a test environment at some point.

However AFM isn't really a great solution for file data migration between
file systems for these reasons:
1) It requires the complicated AFM setup, which requires manual operations
to sync data between the file systems (e.g. mmapplypolicy run on old file
system to get file list THEN mmafmctl prefetch operation on the new AFM
fileset to pull data).  No way to have it simply keep the two namespaces in
sync.  And you must be careful with the "Local Update" configuration not to
modify basically ANY file attributes in the new AFM fileset until a CLEAN
cutover of your application is performed, otherwise AFM will remove the
link of the file to data stored on the old file system.  This is concerning
and it is not easy to detect that this event has occurred.

--> The LU mode is meant for scenarios where changes in cache are not meant
to be pushed back to old filesystem.  If thats not whats desired then other
AFM modes like IW can be used to keep namespace in sync and data can flow
from both sides.  Typically, for data migration --metadata-only to pull in
the full namespace first and data can be migrated on demand or via policy
as outlined above using prefetch cmd.  AFM setup should be extension to
GPFS multi-cluster setup when using GPFS backend.

2) The "Progressive migration with no downtime" directions actually states
that there is downtime required to move applications to the new cluster,
THUS DOWNTIME!  And it really requires a SECOND downtime to finally disable
AFM on the file set so that there is no longer a connection to the old file
system, THUS TWO DOWNTIMES!
--> I am not sure I follow the first downtime.  If applications have to
start using the new filesystem, then they have to be informed accordingly.
If this can be done without bringing down applications, then there is no
DOWNTIME.
Regarding, second downtime, you are right, disabling AFM after data
migration requires unlink and hence downtime.  But there is a easy
workaround, where revalidation intervals can be increased to max or GW
nodes can be unconfigured without downtime with same effect.  And disabling
AFM can be done at a later point during maintenance window.  We plan to
modify this to have this done online aka without requiring unlink of the
fileset.  This will get prioritized if there is enough interest in AFM
being used in this direction.

3) The prefetch operation can only run on a single node thus is not able to
take any advantage of the large number of NSD servers supporting both file
systems for the data migration.  Multiple threads from a single node just
doesn't cut it due to single node bandwidth limits.  When I was running the
prefetch it was only executing roughly 100 " Queue numExec" operations per
second.  The prefetch operation for a directory with 12 Million files was
going to take over 33 HOURS just to process the file list!
--> Prefetch can run on multiple nodes by configuring multiple GW nodes and
enabling parallel i/o as specified in the docs..link provided below.
Infact it can parallelize data xfer to a single file and also do multiple
files in parallel depending on filesizes and various tuning params.

4) In comparison, parallel rsync operations will require only ONE downtime
to run a final sync over MULTIPLE nodes in parallel at the time that
applications are migrated between file systems and does not require the
complicated AFM configuration.  Yes, there is of course efforts to breakup
the namespace for each rsync operations.  This is really what AFM should be
doing for us... chopping up the namespace intelligently and spawning
prefetch operations across multiple nodes in a configurable way to ensure
performance is met or limiting overall impact of the operation if desired.

--> AFM can be used for data migration without any downtime dictated by AFM
(see above) and it can infact use multiple threads on multiple nodes to do
parallel i/o.

AFM, however, is great for what it is intended to be, a cached data access
mechanism across a WAN.

Thanks,
-Bryan

-----Original Message-----
From: gpfsug-discuss-bounces at gpfsug.org [
mailto:gpfsug-discuss-bounces at gpfsug.org] On Behalf Of Kalyan Gunda
Sent: Tuesday, October 07, 2014 12:03 AM
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] AFM limitations in a multi-cluster
environment, slow prefetch operations

Hi Bryan,
 AFM supports GPFS multi-cluster..and we have customers already using this
successfully.  Are you using GPFS backend?
Can you explain your configuration in detail and if ls is hung it would
have generated some long waiters.  Maybe this should be pursued separately
via PMR.  You can ping me the details directly if needed along with opening
a PMR per IBM service process.

As for as prefetch is concerned, right now its limited to  one prefetch job
per fileset.  Each job in itself is multi-threaded and can use multi-nodes
to pull in data based on configuration.
"afmNumFlushThreads" tunable controls the number of threads used by AFM.
This parameter can be changed via mmchfileset cmd (mmchfileset pubs doesn't
show this param for some reason, I will have that updated.)

eg: mmchfileset fs1 prefetchIW -p afmnumflushthreads=5 Fileset prefetchIW
changed.

List the change:
 mmlsfileset fs1 prefetchIW --afm -L
Filesets in file system 'fs1':

Attributes for fileset prefetchIW:
===================================
Status                                  Linked
Path                                    /gpfs/fs1/prefetchIW
Id                                      36
afm-associated                          Yes
Target
nfs://hs21n24/gpfs/fs1/singleTargetToUseForPrefetch
Mode                                    independent-writer
File Lookup Refresh Interval            30 (default)
File Open Refresh Interval              30 (default)
Dir Lookup Refresh Interval             60 (default)
Dir Open Refresh Interval               60 (default)
Async Delay                             15 (default)
Last pSnapId                            0
Display Home Snapshots                  no
Number of Gateway Flush Threads         5
Prefetch Threshold                      0 (default)
Eviction Enabled                        yes (default)

AFM parallel i/o can be setup such that multiple GW nodes can be used to
pull in data..more details are available here
http://www-01.ibm.com/support/knowledgecenter/SSFKCN_4.1.0/com.ibm.cluster.gpfs.v4r1.gpfs200.doc/bl1adv_afmparallelio.htm


and this link outlines tuning params for parallel i/o along with others:
http://www-01.ibm.com/support/knowledgecenter/SSFKCN_4.1.0/com.ibm.cluster.gpfs.v4r1.gpfs200.doc/bl1adv_afmtuning.htm%23afmtuning


Regards
Kalyan
GPFS Development
EGL D Block, Bangalore




From:   Bryan Banister <bbanister at jumptrading.com>
To:     gpfsug main discussion list <gpfsug-discuss at gpfsug.org>
Date:   10/06/2014 09:57 PM
Subject:        Re: [gpfsug-discuss] AFM limitations in a multi-cluster
            environment, slow prefetch operations
Sent by:        gpfsug-discuss-bounces at gpfsug.org



We are using 4.1.0.3 on the cluster with the AFM filesets, -Bryan

From: gpfsug-discuss-bounces at gpfsug.org [
mailto:gpfsug-discuss-bounces at gpfsug.org] On Behalf Of Sven Oehme
Sent: Monday, October 06, 2014 11:28 AM
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] AFM limitations in a multi-cluster
environment, slow prefetch operations

Hi Bryan,

in 4.1 AFM uses multiple threads for reading data, this was different in
3.5 . what version are you using ?

thx. Sven


On Mon, Oct 6, 2014 at 8:36 AM, Bryan Banister <bbanister at jumptrading.com>
wrote:
Just an FYI to the GPFS user community,

We have been testing out GPFS AFM file systems in our required process of
file data migration between two GPFS file systems.  The two GPFS file
systems are managed in two separate GPFS clusters.  We have a third GPFS
cluster for compute systems.  We created new independent AFM filesets in
the new GPFS file system that are linked to directories in the old file
system.  Unfortunately access to the AFM filesets from the compute cluster
completely hang.  Access to the other parts of the second file system is
fine.  This limitation/issue is not documented in the Advanced Admin Guide.

Further, we performed prefetch operations using a file mmafmctl command,
but the process appears to be single threaded and the operation was
extremely slow as a result.  According to the Advanced Admin Guide, it is
not possible to run multiple prefetch jobs on the same fileset:
GPFS can prefetch the data using the mmafmctl Device prefetch –j
FilesetName command (which specifies a list of files to prefetch). Note the
following about prefetching:
v It can be run in parallel on multiple filesets (although more than one
prefetching job cannot be run in parallel on a single fileset).

We were able to quickly create the “--home-inode-file” from the old file
system using the mmapplypolicy command as the documentation describes.
However the AFM prefetch operation is so slow that we are better off
running parallel rsync operations between the file systems versus using the
GPFS AFM prefetch operation.

Cheers,
-Bryan




Note: This email is for the confidential use of the named addressee(s) only
and may contain proprietary, confidential or privileged information. If you
are not the intended recipient, you are hereby notified that any review,
dissemination or copying of this email is strictly prohibited, and to
please notify the sender immediately and destroy this email and any
attachments. Email transmission cannot be guaranteed to be secure or
error-free. The Company, therefore, does not make any guarantees as to the
completeness or accuracy of this email or any attachments. This email is
for informational purposes only and does not constitute a recommendation,
offer, request or solicitation of any kind to buy, sell, subscribe, redeem
or perform any type of transaction of a financial product.

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss





Note: This email is for the confidential use of the named addressee(s) only
and may contain proprietary, confidential or privileged information. If you
are not the intended recipient, you are hereby notified that any review,
dissemination or copying of this email is strictly prohibited, and to
please notify the sender immediately and destroy this email and any
attachments. Email transmission cannot be guaranteed to be secure or
error-free. The Company, therefore, does not make any guarantees as to the
completeness or accuracy of this email or any attachments. This email is
for informational purposes only and does not constitute a recommendation,
offer, request or solicitation of any kind to buy, sell, subscribe, redeem
or perform any type of transaction of a financial product.
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

________________________________

Note: This email is for the confidential use of the named addressee(s) only
and may contain proprietary, confidential or privileged information. If you
are not the intended recipient, you are hereby notified that any review,
dissemination or copying of this email is strictly prohibited, and to
please notify the sender immediately and destroy this email and any
attachments. Email transmission cannot be guaranteed to be secure or
error-free. The Company, therefore, does not make any guarantees as to the
completeness or accuracy of this email or any attachments. This email is
for informational purposes only and does not constitute a recommendation,
offer, request or solicitation of any kind to buy, sell, subscribe, redeem
or perform any type of transaction of a financial product.
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


More information about the gpfsug-discuss mailing list