[gpfsug-discuss] Tuning Spectrum Scale AFM for stability?

Venkateswara R Puvvada vpuvvada at in.ibm.com
Tue Apr 28 12:37:24 BST 2020


Hi,

What is lock down of  AFM fileset ? Are the messages in requeued state and 
AFM won't replicate any data ?  I would recommend opening a ticket by 
collecting the logs and internaldump from the gateway node when the 
replication is stuck.

You can also try increasing the value of afmAsyncOpWaitTimeout option and 
see if this solves the issue.

mmchconfig afmAsyncOpWaitTimeout=3600 -i

~Venkat (vpuvvada at in.ibm.com)



From:   Andi Christiansen <andi at christiansen.xxx>
To:     "gpfsug-discuss at spectrumscale.org" 
<gpfsug-discuss at spectrumscale.org>
Date:   04/28/2020 12:04 PM
Subject:        [EXTERNAL] [gpfsug-discuss] Tuning Spectrum Scale AFM for 
stability?
Sent by:        gpfsug-discuss-bounces at spectrumscale.org



Hi All, 

Can anyone share some thoughts on how to tune AFM for stability? at the 
moment we have ok performance between our sites (5-8Gbits with 34ms 
latency) but we encounter a lock down of the cache fileset from week to 
week, which was day to day before we tuned below settings.. is there any 
way to tune AFM further i haven't found ? 


Cache Site only: 
TCP Settings: 
sunrpc.tcp_slot_table_entries = 128 


Home and Cache: 
AFM / GPFS Settings: 
maxBufferDescs=163840 
afmHardMemThreshold=25G 
afmMaxWriteMergeLen=30G 


Cache fileset: 
Attributes for fileset AFMFILESET: 
================================ 
Status Linked 
Path /mnt/fs02/AFMFILESET 
Id 1 
Root inode 524291 
Parent Id 0 
Created Tue Apr 14 15:57:43 2020 
Comment 
Inode space 1 
Maximum number of inodes 10000384 
Allocated inodes 10000384 
Permission change flag chmodAndSetacl 
afm-associated Yes 
Target nfs://DK_VPN/mnt/fs01/AFMFILESET 
Mode single-writer 
File Lookup Refresh Interval 30 (default) 
File Open Refresh Interval 30 (default) 
Dir Lookup Refresh Interval 60 (default) 
Dir Open Refresh Interval 60 (default) 
Async Delay 15 (default) 
Last pSnapId 0 
Display Home Snapshots no 
Number of Read Threads per Gateway 64 
Parallel Read Chunk Size 128 
Parallel Read Threshold 1024 
Number of Gateway Flush Threads 48 
Prefetch Threshold 0 (default) 
Eviction Enabled yes (default) 
Parallel Write Threshold 1024 
Parallel Write Chunk Size 128 
Number of Write Threads per Gateway 16 
IO Flags 0 (default) 


mmfsadm dump afm: 
AFM Gateway: 
RpcQLen: 0 maxPoolSize: 4294967295 QOF: 0 MaxOF: 131072 
readThLimit 128 minIOBuf 1048576 maxIOBuf 1073741824 msgMaxWriteSize 
2147483648 
readBypassThresh 67108864 
QLen: 0 QMem: 0 SoftQMem: 10737418240 HardQMem 26843545600 
Ping thread: Started 
Fileset: AFMFILESET 1 (fs02) 
mode: single-writer queue: Normal MDS: <c0n1> QMem 0 CTL 577 
home: DK_VPN homeServer: 10.110.5.11 proto: nfs port: 2049 lastCmd: 16 
handler: Mounted Dirty refCount: 1 
queueTransfer: state: Idle senderVerified: 0 receiverVerified: 1 
terminate: 0 psnapWait: 0 
remoteAttrs: AsyncLookups 0 tsfindinode: success 0 failed 0 totalTime 0.0 
avgTime 0,000000 maxTime 0.0 
queue: delay 15 QLen 0+0 flushThds 0 maxFlushThds 48 numExec 8772518 qfs 0 
iwo 0 err 78 
handlerCreateTime : 2020-04-27_11:14:57.415+0200 numCreateSnaps : 0 
InflightAsyncLookups 0 
lastReplayTime : 2020-04-28_07:22:32.415+0200 lastSyncTime : 
2020-04-27_15:09:57.415+0200 
i/o: readBuf: 33554432 writeBuf: 2097152 sparseReadThresh: 134217728 
pReadThreads 64 
i/o: pReadChunkSize 33554432 pReadThresh: 1073741824 pWriteThresh: 
1073741824 
i/o: prefetchThresh 0 (Prefetch) 
Mnt status: 0:0 1:0 2:0 3:0 
Export Map: 10.110.5.10/<c0n0> 10.110.5.11/<c0n1> 10.110.5.12/<c0n2> 
10.110.5.13/<c0n9> 
Priority Queue: Empty (state: Active) 
Normal Queue: Empty (state: Active) 


Cluster Config Cache: 
maxFilesToCache 131072 
maxStatCache 524288 
afmDIO 2 
afmIOFlags 4096 
maxReceiverThreads 32 
afmNumReadThreads 64 
afmNumWriteThreads 8 
afmHardMemThreshold 26843545600 
maxBufferDescs 163840 
afmMaxWriteMergeLen 32212254720 
workerThreads 1024 


The entries in the gpfs log states "AFM: Home is taking longer to 
respond..." but its only AFM and the Cache AFM fileset which enteres a 
locked state. we have the same NFS exports from home mounted on the same 
gateway nodes to check when a file is transferred and they are all ok 
while the AFM lock is happening. a simple gpfs restart of the AFM Master 
node is enough to make AFM restart and continue for another week.. 


The home target is exported through CES NFS from 4 CES nodes and a map is 
created at the Cache site to utilize the ParallelWrites feature. 


If there is anyone sitting around with some ideas/knowledge on how to tune 
this further for more stability then i would be happy if you could share 
your thoughts about it! :-) 


Many Thanks in Advance! 
Andi Christiansen 
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=-XbtU1ILcqI_bUurDD3j1j-oqGszcNZAbQVIhQ5EZOs&s=IjrGy-VdY1cuNfy0bViEykWMEVDax7_xvrMdRhQ2QkM&e= 





-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20200428/ebdec3a2/attachment-0002.htm>


More information about the gpfsug-discuss mailing list